Re: Maximum transaction rate - Mailing list pgsql-general
From | Greg Smith |
---|---|
Subject | Re: Maximum transaction rate |
Date | |
Msg-id | alpine.GSO.2.01.0903171553240.1708@westnet.com Whole thread Raw |
In response to | Re: Maximum transaction rate (Marco Colombo <pgsql@esiway.net>) |
Responses |
Re: Maximum transaction rate
Re: Maximum transaction rate |
List | pgsql-general |
On Tue, 17 Mar 2009, Marco Colombo wrote: > If LVM/dm is lying about fsync(), all this is moot. There's no point > talking about disk caches. I decided to run some tests to see what's going on there, and it looks like some of my quick criticism of LVM might not actually be valid--it's only the performance that is problematic, not necessarily the reliability. Appears to support fsync just fine. I tested with kernel 2.6.22, so certainly not before the recent changes to LVM behavior improving this area, but with the bugs around here from earlier kernels squashed (like crummy HPA support circa 2.6.18-2.6.19, see https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 ) You can do a quick test of fsync rate using sysbench; got the idea from http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ (their command has some typos, fixed one below) If fsync is working properly, you'll get something near the RPM rate of the disk. If it's lying, you'll see a much higher number. I couldn't get the current sysbench-0.4.11 to compile (bunch of X complains from libtool), but the old 0.4.8 I had around still works fine. Let's start with a regular ext3 volume. Here's what I see against a 7200 RPM disk (=120 rotations/second) with the default caching turned on: $ alias fsynctest="~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384--file-test-mode=rndwr run | grep \"Requests/sec\"" $ fsynctest 6469.36 Requests/sec executed That's clearly lying as expected (and I ran all these a couple of times, just reporting one for brevity sake; snipped some other redundant stuff too). I followed the suggestions at http://www.postgresql.org/docs/current/static/wal-reliability.html to turn off the cache and tested again: $ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache" * Write cache $ sudo /sbin/hdparm -W0 /dev/sdf /dev/sdf: setting drive write-caching to 0 (off) $ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache" Write cache $ fsynctest 106.05 Requests/sec executed $ sudo /sbin/hdparm -W1 /dev/sdf $ fsynctest 6469.36 Requests/sec executed Great: I was expecting ~120 commits/sec from a 7200 RPM disk, that's what I get when caching is off. Now, let's switch to using a LVM volume on a different partition of that disk, and run the same test to see if anything changes. $ sudo mount /dev/lvmvol/lvmtest /mnt/ $ cd /mnt/test $ fsynctest 6502.67 Requests/sec executed $ sudo /sbin/hdparm -W0 /dev/sdf $ fsynctest 112.78 Requests/sec executed $ sudo /sbin/hdparm -W1 /dev/sdf $ fsynctest 6499.11 Requests/sec executed Based on this test, it looks to me like fsync works fine on LVM. It must be passing that down to the physical disk correctly or I'd still be seeing inflated rates. If you've got a physical disk that lies about fsync, and you put a database on it, you're screwed whether or not you use LVM; nothing different on LVM than in the regular case. A battery-backed caching controller should also handle fsync fine if it turns off the physical disk cache, which most of them do--and, again, you're no more or less exposed to that particular problem with LVM than a regular filesystem. The thing that barriers helps out with is that it makes it possible to optimize flushing ext3 journal metadata when combined with hard drives that support the appropriate cache flushing mechanism (what hdparm calls "FLUSH CACHE EXT"; see http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html ). That way you can prioritize flushing just the metadata needed to prevent filesystem corruption while still fully caching less critical regular old writes. In that situation, performance could be greatly improved over turning off caching altogether. However, in the PostgreSQL case, the fsync hammer doesn't appreciate this optimization anyway--all the database writes are going to get forced out by that no matter what before the database considers them reliable. Proper barriers support might be helpful in the case where you're using a database on a shared disk that has other files being written to as well, basically allowing caching on those while forcing the database blocks to physical disk, but that presumes the Linux fsync implementation is more sophisticated than I believe it currently is. Far as I can tell, the main open question I didn't directly test here is whether LVM does any write reordering that can impact database use because it doesn't handle write barriers properly. According to https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does not, and I never got the impression that was impacted by the LVM layer before. The concern is nicely summarized by the comment from Xman at http://lwn.net/Articles/283161/ : "fsync will block until the outstanding requests have been sync'd do disk, but it doesn't guarantee that subsequent I/O's to the same fd won't potentially also get completed, and potentially ahead of the I/O's submitted prior to the fsync. In fact it can't make such guarantees without functioning barriers." Since we know LVM does not have functioning barriers, this would seem to be one area where PostgreSQL would be vulnerable. But since ext3 doesn't have barriers turned by default either (except some recent SuSE system), it's not unique to a LVM setup, and if this were really a problem it would be nailing people everywhere. I believe the WAL design handles this situation. There are some known limitations to Linux fsync that I remain somewhat concerned about, independantly of LVM, like "ext3 fsync() only does a journal commit when the inode has changed" (see http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ). The way files are preallocated, the PostgreSQL WAL is supposed to function just fine even if you're using fdatasync after WAL writes, which also wouldn't touch the journal (last time I checked fdatasync was implemented as a full fsync on Linux). Since the new ext4 is more aggressive at delaying writes than ext3, it will be interesting to see if that uncovers some subtle race conditions here that have been lying dormant so far. I leave it as an exercise to the dedicated reader to modify the sysbench test to use O_SYNC/O_DIRECT in order to re-test LVM for the situation if you changed wal_sync_method=open_sync , how to do that is mentioned briefly at http://sysbench.sourceforge.net/docs/ -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
pgsql-general by date: