Re: Maximum transaction rate - Mailing list pgsql-general

From Greg Smith
Subject Re: Maximum transaction rate
Date
Msg-id alpine.GSO.2.01.0903171553240.1708@westnet.com
Whole thread Raw
In response to Re: Maximum transaction rate  (Marco Colombo <pgsql@esiway.net>)
Responses Re: Maximum transaction rate  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
Re: Maximum transaction rate  (Marco Colombo <pgsql@esiway.net>)
List pgsql-general
On Tue, 17 Mar 2009, Marco Colombo wrote:

> If LVM/dm is lying about fsync(), all this is moot. There's no point
> talking about disk caches.

I decided to run some tests to see what's going on there, and it looks
like some of my quick criticism of LVM might not actually be valid--it's
only the performance that is problematic, not necessarily the reliability.
Appears to support fsync just fine.  I tested with kernel 2.6.22, so
certainly not before the recent changes to LVM behavior improving this
area, but with the bugs around here from earlier kernels squashed (like
crummy HPA support circa 2.6.18-2.6.19, see
https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

You can do a quick test of fsync rate using sysbench; got the idea from
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
(their command has some typos, fixed one below)

If fsync is working properly, you'll get something near the RPM rate of
the disk.  If it's lying, you'll see a much higher number.

I couldn't get the current sysbench-0.4.11 to compile (bunch of X
complains from libtool), but the old 0.4.8 I had around still works fine.
Let's start with a regular ext3 volume.  Here's what I see against a 7200
RPM disk (=120 rotations/second) with the default caching turned on:

$ alias fsynctest="~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384--file-test-mode=rndwr run | grep \"Requests/sec\"" 
$ fsynctest
  6469.36 Requests/sec executed

That's clearly lying as expected (and I ran all these a couple of times,
just reporting one for brevity sake; snipped some other redundant stuff
too).  I followed the suggestions at
http://www.postgresql.org/docs/current/static/wal-reliability.html to turn
off the cache and tested again:

$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
            *    Write cache
$ sudo /sbin/hdparm -W0 /dev/sdf

/dev/sdf:
  setting drive write-caching to 0 (off)
$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
                 Write cache
$ fsynctest
   106.05 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
  6469.36 Requests/sec executed

Great:  I was expecting ~120 commits/sec from a 7200 RPM disk, that's what
I get when caching is off.

Now, let's switch to using a LVM volume on a different partition of
that disk, and run the same test to see if anything changes.

$ sudo mount /dev/lvmvol/lvmtest /mnt/
$ cd /mnt/test
$ fsynctest
  6502.67 Requests/sec executed
$ sudo /sbin/hdparm -W0 /dev/sdf
$ fsynctest
   112.78 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
  6499.11 Requests/sec executed

Based on this test, it looks to me like fsync works fine on LVM.  It must
be passing that down to the physical disk correctly or I'd still be seeing
inflated rates.  If you've got a physical disk that lies about fsync, and
you put a database on it, you're screwed whether or not you use LVM;
nothing different on LVM than in the regular case.  A battery-backed
caching controller should also handle fsync fine if it turns off the
physical disk cache, which most of them do--and, again, you're no more or
less exposed to that particular problem with LVM than a regular
filesystem.

The thing that barriers helps out with is that it makes it possible to
optimize flushing ext3 journal metadata when combined with hard drives
that support the appropriate cache flushing mechanism (what hdparm calls
"FLUSH CACHE EXT"; see

http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html

).  That way you can prioritize flushing just the metadata needed to
prevent filesystem corruption while still fully caching less critical
regular old writes.  In that situation, performance could be greatly
improved over turning off caching altogether.  However, in the PostgreSQL
case, the fsync hammer doesn't appreciate this optimization anyway--all
the database writes are going to get forced out by that no matter what
before the database considers them reliable.  Proper barriers support
might be helpful in the case where you're using a database on a shared
disk that has other files being written to as well, basically allowing
caching on those while forcing the database blocks to physical disk, but
that presumes the Linux fsync implementation is more sophisticated than I
believe it currently is.

Far as I can tell, the main open question I didn't directly test here is
whether LVM does any write reordering that can impact database use because
it doesn't handle write barriers properly.  According to
https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does
not, and I never got the impression that was impacted by the LVM layer
before.  The concern is nicely summarized by the comment from Xman at
http://lwn.net/Articles/283161/ :

"fsync will block until the outstanding requests have been sync'd do disk,
but it doesn't guarantee that subsequent I/O's to the same fd won't
potentially also get completed, and potentially ahead of the I/O's
submitted prior to the fsync. In fact it can't make such guarantees
without functioning barriers."

Since we know LVM does not have functioning barriers, this would seem to
be one area where PostgreSQL would be vulnerable.  But since ext3 doesn't
have barriers turned by default either (except some recent SuSE system),
it's not unique to a LVM setup, and if this were really a problem it would
be nailing people everywhere.  I believe the WAL design handles this
situation.

There are some known limitations to Linux fsync that I remain somewhat
concerned about, independantly of LVM, like "ext3 fsync() only does a
journal commit when the inode has changed" (see
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
way files are preallocated, the PostgreSQL WAL is supposed to function
just fine even if you're using fdatasync after WAL writes, which also
wouldn't touch the journal (last time I checked fdatasync was implemented
as a full fsync on Linux).  Since the new ext4 is more aggressive at
delaying writes than ext3, it will be interesting to see if that uncovers
some subtle race conditions here that have been lying dormant so far.

I leave it as an exercise to the dedicated reader to modify the sysbench
test to use O_SYNC/O_DIRECT in order to re-test LVM for the situation if
you changed wal_sync_method=open_sync , how to do that is mentioned
briefly at http://sysbench.sourceforge.net/docs/

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-general by date:

Previous
From: John Cheng
Date:
Subject: Re: PostgreSQL versus MySQL for GPS Data
Next
From: Ron Mayer
Date:
Subject: Re: Maximum transaction rate