Re: possible new option for wal_sync_method - Mailing list pgsql-hackers
From | Dan Scales |
---|---|
Subject | Re: possible new option for wal_sync_method |
Date | |
Msg-id | 1994285609.3054894.1330727302115.JavaMail.root@zimbra-prod-mbox-4.vmware.com Whole thread Raw |
In response to | Re: possible new option for wal_sync_method (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, > Got any result so far? I measured the results with barrier=0, and yes, you are correct -- it seems that most of the benefit of the open_direct wal_sync_methodis probably from not doing the barrier operation at the end of fsync(): wal_sync_method fdatasync open_direct open_sync no archive, barrier=1: 17309 18507 17138 no archive, barrier=0: 17771 18369 18045 archive, barrier=1 : 15789 16592 15645 archive, barrier=0 : 16616 16785 16547 It took me a while to look through Linux, and understand why barrier=1 had such an effect, even for disks with battery-backedcaches. As you pointed out, the barrier operation not only flushes the disk cache, but also has some queue implications, particularly forLinux releases below 2.6.37. I've been using 2.6.32, and in that case, the barrier at the end of fsync requires that all previously-queued operationsbe finished before the barrier occurs and flushes the disk cache. This means that each fsync of the WAL log islikely waiting for completely unrelated in-flight operations of the data files. That is why getting rid of the fsync ofthe WAL log has such a good performance win, even for disks that don't have a disk cache flush (because the cache is batterybacked). This option will probably have less benefit for Linux 2.6.37 and above, where barriers are eliminated, and operations are written more specifically in terms of disk cache flushes. fsync() on ext3 (even for Linux 2.6.37 and above) does still wait for any outstanding meta-data transaction to commit. So,there is still another reason to put the WAL log and data files on different logical disks (even if backed by the same physical disk). It does still seem to me the sync_file_range() is unsafe in the case of non-battery backed disk write caches, since it doesn'tsync the disk cache. However, if sync_file_range() was being used to optimize checkpoint fsyncs, then one final fsync() to an unused fileon the same block device would do the trick of flushing the disk cache. Dan ----- Original Message ----- From: "Andres Freund" <andres@anarazel.de> To: pgsql-hackers@postgresql.org Cc: "Dan Scales" <scales@vmware.com> Sent: Monday, February 27, 2012 12:43:49 PM Subject: Re: [HACKERS] possible new option for wal_sync_method Hi, On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote: > Good point, thanks. From the ext3 source code, it looks like > ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the > block device, whereas simple direct IO does not. So, that would make > this wal_sync_method option less useful, since, as you say, the user > would have to know if the block device is doing write caching. The experiments I know which played with disabling write caches nearly always had the result that write caching as worth the overhead of syncing. > For the numbers I reported, I don't think the performance gain is from > not doing the block device flush. The system being measured is a Fibre > Channel disk which should have a fully-nonvolatile disk array. And > measurements using systemtap show that blkdev_issue_flush() always takes > only in the microsecond range. Well, I think it has some io queue implications which could explain some of the difference. With that regard I think it heavily depends on the kernel version as thats an area which had loads of pretty radical changes in nearly every release since 2.6.32. > I think the overhead is still from the fact that ext3_sync_file() waits > for the current in-flight transaction if there is one (and does an > explicit device flush if there is no transaction to wait for.) I do > think there are lots of meta-data operations happening on the data files > (especially for a growing database), so the WAL log commit is waiting for > unrelated data operations. It would be nice if there a simple file > system operation that just flushed the cache of the block device > containing the filesystem (i.e. just does the blkdev_issue_flush() and > not the other things in ext3_sync_file()). I think you are right there. I think the metadata issue could be relieved a lot by doing the growing of files in way much larger bits than currently. I have seen profiles which indicated that lots of time was spent on increasing the file size. I would be very interested in seing how much changes in that area would benefit real-world benchmarks. > The ext4_sync_file() code looks fairly similar, so I think it may have > the same problem, though I can't be positive. In that case, this > wal_sync_method option might help ext4 as well. The journaling code for ext4 is significantly different so I think it very well might play a role here - although youre probably right and it wont be in *_sync_file. > With respect to sync_file_range(), the Linux code that I'm looking at > doesn't really seem to indicate that there is a device flush (since it > never calls a f_op->fsync_file operation). So sync_file_range() may be > not be as useful as thought. Hm, need to check that. I thought it invoked that path somewhere. > By the way, all the numbers were measured with "data=writeback, > barrier=1" options for ext3. I don't think that I have seen a > significant different when the DBT2 workload for ext3 option > data=ordered. You have not? Interesting again because I have seen results that differed by a magnitude. > I will measure all these numbers again tonight, but with barrier=0, so as > to try to confirm that the write flush itself isn't costing a lot for > this configuration. Got any result so far? Thanks, Andres
pgsql-hackers by date: