Thread: Re: [pgsql-hackers-win32] win32 performance - fsync question
> > Magnus prepared a trivial patch which added the O_SYNC flag for > > windows and mapped it to FILE_FLAG_WRITE_THROUGH in win32_open.c. > > Attached is this trivial patch. As Merlin says, it needs some > more reliability testing. But the numbers are at least reasonable - it > *seems* like it's doing the right thing (as long as you turn > off write cache). And it's certainly a significant > performance increase - it brings the speed almost up to the > same as linux. I have now run a bunch of pull-the-plug testing on this patch (literally pulling the plug, yes. to the point of some of my co-workers thinking I'm crazy) My results are: Fisrt, baseline: * Linux, with fsync (default), write-cache disabled: no data corruption * Linux, with fsync (default), write-cache enabled: usually no data corruption, but two runs which had * Win32, with fsync, write-cache disabled: no data corruption * Win32, with fsync, write-cache enabled: no data corruption * Win32, with osync, write cache disabled: no data corruption * Win32, with osync, write cache enabled: no data corruption. Once I got: 2005-02-24 12:19:54 LOG: could not open file "C:/Program Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0, segment 16): No such file or directory but the data in the database was consistent. Almost all runs showed a line along the line: 2005-02-24 11:22:41 LOG: record with zero length at 0/A450548 In the final test, the BIOS decided the disk was giving up and reassigned it as 0Mb.. Required two extra cold boots, then it was back up to 20Gb. Still no data loss. My tests was three clients doing lots of inserts and updates, some in transactions some bare. In some tests, I kicked in a manual vacuum while at it. Then I yanked the powercord, rebooted, manually started pg, and verified taht the data in the db came up with the same values the cliens reported as last committed. I also ran vacuum verbose on all tables after it was back up to see if there were any warnings. Test machine is a 1GHz Celeron, 256Mb RAM and a Maxtor IDE disk. It'd of course be good if others could also test, but I'm getting the feeling that this patch at least doesn't make things worse than before :-) ANd it's *a lot* faster. //Magnus
> In the final test, the BIOS decided the disk was giving up and > reassigned it as 0Mb.. Required two extra cold boots, then it was back > up to 20Gb. Still no data loss. I think it would be fun to re-run these tests with MySQL... Chris
> My results are: > Fisrt, baseline: > * Linux, with fsync (default), write-cache disabled: no data corruption > * Linux, with fsync (default), write-cache enabled: usually no data > corruption, but two runs which had > * Win32, with fsync, write-cache disabled: no data corruption > * Win32, with fsync, write-cache enabled: no data corruption > * Win32, with osync, write cache disabled: no data corruption > * Win32, with osync, write cache enabled: no data corruption. Once I > got: > 2005-02-24 12:19:54 LOG: could not open file "C:/Program > Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0, > segment 16): No such file or directory In case anyone is wondering, you can turn off write caching on FreeBSD, for a terrible perfomance loss... http://freebsd.active-venture.com/handbook/configtuning-disk.html#AEN8015 Chris
"Magnus Hagander" <mha@sollentuna.net> writes: > My results are: > Fisrt, baseline: > * Linux, with fsync (default), write-cache disabled: no data corruption > * Linux, with fsync (default), write-cache enabled: usually no data > corruption, but two runs which had That makes sense. > * Win32, with fsync, write-cache disabled: no data corruption > * Win32, with fsync, write-cache enabled: no data corruption > * Win32, with osync, write cache disabled: no data corruption > * Win32, with osync, write cache enabled: no data corruption. Once I > got: > 2005-02-24 12:19:54 LOG: could not open file "C:/Program > Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0, > segment 16): No such file or directory > but the data in the database was consistent. It disturbs me that you couldn't produce data corruption in the cases where it theoretically should occur. Seems like this is an indication that your test was insufficiently severe, or that there is something going on we don't understand. regards, tom lane
> "Magnus Hagander" <mha@sollentuna.net> writes: >> My results are: >> Fisrt, baseline: >> * Linux, with fsync (default), write-cache disabled: no data corruption >> * Linux, with fsync (default), write-cache enabled: usually no data >> corruption, but two runs which had > > That makes sense. > >> * Win32, with fsync, write-cache disabled: no data corruption >> * Win32, with fsync, write-cache enabled: no data corruption >> * Win32, with osync, write cache disabled: no data corruption >> * Win32, with osync, write cache enabled: no data corruption. Once I >> got: >> 2005-02-24 12:19:54 LOG: could not open file "C:/Program >> Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0, >> segment 16): No such file or directory >> but the data in the database was consistent. > > It disturbs me that you couldn't produce data corruption in the cases > where it theoretically should occur. Seems like this is an indication > that your test was insufficiently severe, or that there is something > going on we don't understand. > I was thinking about that. A few years back, Microsoft had some serious issues with write caching drives. They were taken to task for losing data if Windows shut down too fast, especially on drives with a large cache. MS is big enough and bad enough to get all the info they need from the various drive makers to know how to handle write cache flushing. Even the stuff that isn't documented. If anyone has a very good debugger and/or emulator or even a logic analyzer, it would be interesting to see if MS sends commands to the drives after a disk write or a set of disk writes. Also, I would like to see this test performed on NTFS and FAT32, and see if you are more likely to lose data on FAT32.
"Magnus Hagander" <mha@sollentuna.net> writes: > * Linux, with fsync (default), write-cache enabled: usually no data > corruption, but two runs which had Are you verifying that all the data that was committed was actually stored? Or just verifying that the database works properly after rebooting? I'm a bit surprised that the write-cache lead to a corrupt database, and not merely lost transactions. I had the impression that drives still handled the writes in the order received. You may find that if you check this case again that the "usually no data corruption" is actually "usually lost transactions but no corruption". -- greg
Greg Stark <gsstark@mit.edu> writes: > I'm a bit surprised that the write-cache lead to a corrupt database, and not > merely lost transactions. I had the impression that drives still handled the > writes in the order received. There'd be little point in having a cache if they did, I should think. I thought the point of the cache was to allow the disk to schedule I/O in an order that minimizes seek time (ie, such a disk has got its own elevator queue or similar). > You may find that if you check this case again that the "usually no data > corruption" is actually "usually lost transactions but no corruption". That's a good point, but it seems difficult to be sure of the last reportedly-committed transaction in a powerfail situation. Maybe if you drive the test from a client on another machine? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > I'm a bit surprised that the write-cache lead to a corrupt database, and not > > merely lost transactions. I had the impression that drives still handled the > > writes in the order received. > > There'd be little point in having a cache if they did, I should think. > I thought the point of the cache was to allow the disk to schedule I/O > in an order that minimizes seek time (ie, such a disk has got its own > elevator queue or similar). If that were the case then SCSI drives that ship with write caching disabled and using tagged command queuing instead would perform poorly. I think the main motivation for write caching on IDE drives is that the IDE protocol forces commands to be issued synchronously. So you can't send a second command until the first command has completed. Without write caching that limits the write bandwidth tremendously. Write caching is being used here as a poor man's tcq. -- greg