Re: Maximum transaction rate - Mailing list pgsql-general

From Marco Colombo
Subject Re: Maximum transaction rate
Date
Msg-id 49C03EB2.3080706@esiway.net
Whole thread Raw
In response to Re: Maximum transaction rate  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: Maximum transaction rate  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-general
Greg Smith wrote:
> On Tue, 17 Mar 2009, Marco Colombo wrote:
>
>> If LVM/dm is lying about fsync(), all this is moot. There's no point
>> talking about disk caches.
>
> I decided to run some tests to see what's going on there, and it looks
> like some of my quick criticism of LVM might not actually be valid--it's
> only the performance that is problematic, not necessarily the
> reliability. Appears to support fsync just fine.  I tested with kernel
> 2.6.22, so certainly not before the recent changes to LVM behavior
> improving this area, but with the bugs around here from earlier kernels
> squashed (like crummy HPA support circa 2.6.18-2.6.19, see
> https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

I've run tests too, you can seen them here:
https://www.redhat.com/archives/linux-lvm/2009-March/msg00055.html
in case you're looking for something trivial (write/fsync loop).

> You can do a quick test of fsync rate using sysbench; got the idea from
> http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
> (their command has some typos, fixed one below)
>
> If fsync is working properly, you'll get something near the RPM rate of
> the disk.  If it's lying, you'll see a much higher number.

Same results. -W1 gives x50 speedup, it must be waiting for something
at disk level with -W0.

[...]

> Based on this test, it looks to me like fsync works fine on LVM.  It
> must be passing that down to the physical disk correctly or I'd still be
> seeing inflated rates.  If you've got a physical disk that lies about
> fsync, and you put a database on it, you're screwed whether or not you
> use LVM; nothing different on LVM than in the regular case.  A
> battery-backed caching controller should also handle fsync fine if it
> turns off the physical disk cache, which most of them do--and, again,
> you're no more or less exposed to that particular problem with LVM than
> a regular filesystem.

That was my initial understanding.

> The thing that barriers helps out with is that it makes it possible to
> optimize flushing ext3 journal metadata when combined with hard drives
> that support the appropriate cache flushing mechanism (what hdparm calls
> "FLUSH CACHE EXT"; see
>
http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html
> ).  That way you can prioritize flushing just the metadata needed to
> prevent filesystem corruption while still fully caching less critical
> regular old writes.  In that situation, performance could be greatly
> improved over turning off caching altogether.  However, in the
> PostgreSQL case, the fsync hammer doesn't appreciate this optimization
> anyway--all the database writes are going to get forced out by that no
> matter what before the database considers them reliable.  Proper
> barriers support might be helpful in the case where you're using a
> database on a shared disk that has other files being written to as well,
> basically allowing caching on those while forcing the database blocks to
> physical disk, but that presumes the Linux fsync implementation is more
> sophisticated than I believe it currently is.

This is the same conclusion I came to. Moreover, once you have barriers
passed down to the disks, it would be nice to have a userland API to send
them to the kernel. Any application managing a 'journal' or 'log' type
of object, would benefit from that. I'm not familiar with PG internals,
but it's likely you can have some records you just want to be ordered, and
you can do something like write-barrier-write-barrier-...-fsync instead of
write-fsync-write-fsync-... Currenly fsync() (and friends, O_SYNC,
fdatasync(), O_DSYNC) is the only way to enforce ordering on writes
from userland.

> Far as I can tell, the main open question I didn't directly test here is
> whether LVM does any write reordering that can impact database use
> because it doesn't handle write barriers properly.  According to
> https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it
> does not, and I never got the impression that was impacted by the LVM
> layer before.  The concern is nicely summarized by the comment from Xman
> at http://lwn.net/Articles/283161/ :
>
> "fsync will block until the outstanding requests have been sync'd do
> disk, but it doesn't guarantee that subsequent I/O's to the same fd
> won't potentially also get completed, and potentially ahead of the I/O's
> submitted prior to the fsync. In fact it can't make such guarantees
> without functioning barriers."

Sure, but from userland you can't set barriers. If you fsync() after each
write you want ordered, there can't be any "subsequent I/O" (unless
there are many different processes cuncurrently writing to the file
w/o synchronization).

> Since we know LVM does not have functioning barriers, this would seem to
> be one area where PostgreSQL would be vulnerable.  But since ext3
> doesn't have barriers turned by default either (except some recent SuSE
> system), it's not unique to a LVM setup, and if this were really a
> problem it would be nailing people everywhere.  I believe the WAL design
> handles this situation.

Well well. Ext3 is definitely in the lucky area. The journal on most ext3
instances is contiguous on disk. The disk won't reorder requests only
because the are already ordered... only when the journal wraps around there's
a (extremely) small window of vulnerability. You need to write a careful
crafted torture program to get any chance to observe that... such program
exists, and triggers the problem (leaving inconsistent fs) almost 50% of
the times. But it's extremely unlikely you can see it happen in real
workloads.

http://lwn.net/Articles/283168/

.TM.

pgsql-general by date:

Previous
From: Greg Smith
Date:
Subject: Re: Maximum transaction rate
Next
From: Jack W
Date:
Subject: Question about Warm Standby