Thread: Maximum transaction rate

Maximum transaction rate

From

Jack Orenstein

Date:

06 March 2009, 12:58:26

I'm using postgresql 8.3.6 through JDBC, and trying to measure the maximum
transaction rate on a given Linux box. I wrote a test program that:

- Creates a table with two int columns and no indexes,

- loads the table through a configurable number of threads, with each
   transaction writing one row and then committing, (auto commit is
   false), and

- reports transactions/sec.

The postgres configuration regarding syncing is standard: fsync = on,
synchronous_commit = on, wal_sync_method = fsync. My linux kernel is
2.6.27.19-78.2.30.fc9.i686.

The transaction rates I'm getting seem way too high: 2800-2900 with one thread,
5000-7000 with ten threads. I'm guessing that writes aren't really reaching the
disk. Can someone suggest how to figure out where, below postgres, someone is
lying about writes reaching the disk?

Jack

Re: Maximum transaction rate

From

Tom Lane

Date:

06 March 2009, 13:15:44

Jack Orenstein <jack.orenstein@hds.com> writes:
> The transaction rates I'm getting seem way too high: 2800-2900 with
> one thread, 5000-7000 with ten threads. I'm guessing that writes
> aren't really reaching the disk. Can someone suggest how to figure out
> where, below postgres, someone is lying about writes reaching the
> disk?

AFAIK there are two trouble sources in recent Linux machines: LVM and
the disk drive itself.  LVM is apparently broken by design --- it simply
fails to pass fsync requests.  If you're using it you have to stop.
(Which sucks, because it's exactly the kind of thing DBAs tend to want.)
Otherwise you need to reconfigure your drive to not cache writes.
I forget the incantation for that but it's in the PG list archives.

            regards, tom lane

Re: Maximum transaction rate

From

Greg Smith

Date:

06 March 2009, 14:32:02

On Fri, 6 Mar 2009, Tom Lane wrote:

> Otherwise you need to reconfigure your drive to not cache writes.
> I forget the incantation for that but it's in the PG list archives.

There's a dicussion of this in the docs now,
http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html

hdparm -I lets you check if write caching is on, hdparm -W lets you toggle
it off.  That's for ATA disks; SCSI ones can use sdparm instead, but
usually it's something you can adjust more permanently in the card
configuration or BIOS instead for those.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Ben Chobot

Date:

06 March 2009, 17:22:56

On Fri, 6 Mar 2009, Greg Smith wrote:

> On Fri, 6 Mar 2009, Tom Lane wrote:
>
>>  Otherwise you need to reconfigure your drive to not cache writes.
>>  I forget the incantation for that but it's in the PG list archives.
>
> There's a dicussion of this in the docs now,
> http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html

How does turning off write caching on the disk stop the problem with LVM?
It still seems like you have to get the data out of the OS buffer, and if
fsync() doesn't do that for you....

Re: Maximum transaction rate

From

Scott Marlowe

Date:

06 March 2009, 17:39:13

On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote:
> On Fri, 6 Mar 2009, Greg Smith wrote:
>
>> On Fri, 6 Mar 2009, Tom Lane wrote:
>>
>>>  Otherwise you need to reconfigure your drive to not cache writes.
>>>  I forget the incantation for that but it's in the PG list archives.
>>
>> There's a dicussion of this in the docs now,
>> http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html
>
> How does turning off write caching on the disk stop the problem with LVM? It
> still seems like you have to get the data out of the OS buffer, and if
> fsync() doesn't do that for you....

I think he was saying otherwise (if you're not using LVM and you still
have this super high transaction rate) you'll need to turn off the
drive's write caches.  I kinda wondered at it for a second too.

Re: Maximum transaction rate

From

Greg Smith

Date:

06 March 2009, 18:51:30

On Fri, 6 Mar 2009, Ben Chobot wrote:

> How does turning off write caching on the disk stop the problem with LVM?

It doesn't.  Linux LVM is awful and broken, I was just suggesting more
details on what you still need to check even when it's not involved.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Marco Colombo

Date:

13 March 2009, 08:54:51

Scott Marlowe wrote:
> On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote:
>> On Fri, 6 Mar 2009, Greg Smith wrote:
>>
>>> On Fri, 6 Mar 2009, Tom Lane wrote:
>>>
>>>>  Otherwise you need to reconfigure your drive to not cache writes.
>>>>  I forget the incantation for that but it's in the PG list archives.
>>> There's a dicussion of this in the docs now,
>>> http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html
>> How does turning off write caching on the disk stop the problem with LVM? It
>> still seems like you have to get the data out of the OS buffer, and if
>> fsync() doesn't do that for you....
>
> I think he was saying otherwise (if you're not using LVM and you still
> have this super high transaction rate) you'll need to turn off the
> drive's write caches.  I kinda wondered at it for a second too.
>

And I'm still wondering. The problem with LVM, AFAIK, is missing support
for write barriers. Once you disable the write-back cache on the disk,
you no longer need write barriers. So I'm missing something, what else
does LVM do to break fsync()?

It was my understanding that disabling disk caches was enough.

.TM.

Re: Maximum transaction rate

From

Tom Lane

Date:

13 March 2009, 13:24:10

Marco Colombo <pgsql@esiway.net> writes:
> And I'm still wondering. The problem with LVM, AFAIK, is missing support
> for write barriers. Once you disable the write-back cache on the disk,
> you no longer need write barriers. So I'm missing something, what else
> does LVM do to break fsync()?

I think you're imagining that the disk hardware is the only source of
write reordering, which isn't the case ... various layers in the kernel
can reorder operations before they get sent to the disk.

            regards, tom lane

Re: Maximum transaction rate

From

Marco Colombo

Date:

13 March 2009, 14:44:43

Tom Lane wrote:
> Marco Colombo <pgsql@esiway.net> writes:
>> And I'm still wondering. The problem with LVM, AFAIK, is missing support
>> for write barriers. Once you disable the write-back cache on the disk,
>> you no longer need write barriers. So I'm missing something, what else
>> does LVM do to break fsync()?
>
> I think you're imagining that the disk hardware is the only source of
> write reordering, which isn't the case ... various layers in the kernel
> can reorder operations before they get sent to the disk.
>
>             regards, tom lane

You mean some layer (LVM) is lying about the fsync()?

write(A);
fsync();
...
write(B);
fsync();
...
write(C);
fsync();

you mean that the process may be awakened after the first fsync() while
A is still somewhere in OS buffers and not sent to disk yet, so it's
possible that B gets to the disk BEFORE A. And if the system crashes,
A never hits the platters while B (possibly) does. Is it this you
mean by "write reodering"?

But doesn't this break any application with transactional-like behavior,
such as sendmail? The problem being 3rd parties, if sendmail declares
"ok, I saved the message" (*after* a fsync()) to the SMTP client,
it's actually lying 'cause the message hasn't hit the platters yet.
Same applies to IMAP/POP server, say. Well, it applies to anything
using fsync().

I mean, all this with disk caches in write-thru modes? It's the OS
lying, not the disks?

Wait, this breaks all journaled FSes as well, a DM device is just
a block device to them, if it's lying about synchronous writes the
whole purpose of the journal is defeated... I find it hard to
believe, I have to say.

.TM.

Re: Maximum transaction rate

From

Tom Lane

Date:

13 March 2009, 15:02:52

Marco Colombo <pgsql@esiway.net> writes:
> You mean some layer (LVM) is lying about the fsync()?

Got it in one.

            regards, tom lane

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

13 March 2009, 15:15:07

On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
> Marco Colombo <pgsql@esiway.net> writes:
> > You mean some layer (LVM) is lying about the fsync()?
>
> Got it in one.
>

I wouldn't think this would be a problem with the proper battery backed
raid controller correct?

Joshua D. Drake


>             regards, tom lane
>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Ben Chobot

Date:

13 March 2009, 15:19:27

On Fri, 13 Mar 2009, Joshua D. Drake wrote:

> On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
>> Marco Colombo <pgsql@esiway.net> writes:
>>> You mean some layer (LVM) is lying about the fsync()?
>>
>> Got it in one.
>>
>
> I wouldn't think this would be a problem with the proper battery backed
> raid controller correct?

It seems to me that all you get with a BBU-enabled card is the ability to
get burts of writes out of the OS faster. So you still have the problem,
it's just less like to be encountered.

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

13 March 2009, 15:38:11

On Fri, 2009-03-13 at 11:17 -0700, Ben Chobot wrote:
> On Fri, 13 Mar 2009, Joshua D. Drake wrote:
>
> > On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote:
> >> Marco Colombo <pgsql@esiway.net> writes:
> >>> You mean some layer (LVM) is lying about the fsync()?
> >>
> >> Got it in one.
> >>
> >
> > I wouldn't think this would be a problem with the proper battery backed
> > raid controller correct?
>
> It seems to me that all you get with a BBU-enabled card is the ability to
> get burts of writes out of the OS faster. So you still have the problem,
> it's just less like to be encountered.

A BBU controller is about more than that. It is also supposed to be
about data integrity. The ability to have unexpected outages and have
the drives stay consistent because the controller remembers the state
(if that is a reasonable way to put it).

Joshua D. Drake


>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Ben Chobot

Date:

13 March 2009, 15:42:00

On Fri, 13 Mar 2009, Joshua D. Drake wrote:

>> It seems to me that all you get with a BBU-enabled card is the ability to
>> get burts of writes out of the OS faster. So you still have the problem,
>> it's just less like to be encountered.
>
> A BBU controller is about more than that. It is also supposed to be
> about data integrity. The ability to have unexpected outages and have
> the drives stay consistent because the controller remembers the state
> (if that is a reasonable way to put it).

Of course. But if you can't reliably flush the OS buffers (because, say,
you're using LVM so fsync() doesn't work), then you can't say what
actually has made it to the safety of the raid card.

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

13 March 2009, 15:43:49

On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
> On Fri, 13 Mar 2009, Joshua D. Drake wrote:

> Of course. But if you can't reliably flush the OS buffers (because, say,
> you're using LVM so fsync() doesn't work), then you can't say what
> actually has made it to the safety of the raid card.

Good point. So the next question of course is, does EVMS do it right?

http://evms.sourceforge.net/

This is actually a pretty significant issue.

Joshua D. Drake


>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

13 March 2009, 16:00:35

On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote:
> On Fri, 13 Mar 2009, Joshua D. Drake wrote:
>
> >> It seems to me that all you get with a BBU-enabled card is the ability to
> >> get burts of writes out of the OS faster. So you still have the problem,
> >> it's just less like to be encountered.
> >
> > A BBU controller is about more than that. It is also supposed to be
> > about data integrity. The ability to have unexpected outages and have
> > the drives stay consistent because the controller remembers the state
> > (if that is a reasonable way to put it).
>
> Of course. But if you can't reliably flush the OS buffers (because, say,
> you're using LVM so fsync() doesn't work), then you can't say what
> actually has made it to the safety of the raid card.

Wait, actually a good BBU RAID controller will disable the cache on the
drives. So everything that is cached is already on the controller vs.
the drives itself.

Or am I missing something?

Joshua D. Drake

>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Christophe

Date:

13 March 2009, 16:09:31

On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:
> Wait, actually a good BBU RAID controller will disable the cache on
> the
> drives. So everything that is cached is already on the controller vs.
> the drives itself.
>
> Or am I missing something?

Maybe I'm missing something, but a BBU controller moves the "safe
point" from the platters to the controller, but it doesn't move it all
the way into the OS.

So, if the software calls fsync, but fsync doesn't actually push the
data to the controller, you are still at risk... right?

Re: Maximum transaction rate

From

Scott Marlowe

Date:

13 March 2009, 19:59:21

On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote:
>
> On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote:
>>
>> Wait, actually a good BBU RAID controller will disable the cache on the
>> drives. So everything that is cached is already on the controller vs.
>> the drives itself.
>>
>> Or am I missing something?
>
> Maybe I'm missing something, but a BBU controller moves the "safe point"
> from the platters to the controller, but it doesn't move it all the way into
> the OS.
>
> So, if the software calls fsync, but fsync doesn't actually push the data to
> the controller, you are still at risk... right?

Ding!

Re: Maximum transaction rate

From

Marco Colombo

Date:

14 March 2009, 01:25:28

Scott Marlowe wrote:
> On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote:
>> So, if the software calls fsync, but fsync doesn't actually push the data to
>> the controller, you are still at risk... right?
>
> Ding!
>

I've been doing some googling, now I'm not sure that not supporting barriers
implies not supporting (of lying) at blkdev_issue_flush(). It seems that
it's pretty common (and well-defined) for block devices to report
-EOPNOTSUPP at BIO_RW_BARRIER requests. device mapper apparently falls in
this category.

See:
http://lkml.org/lkml/2007/5/25/71
this is an interesting discussion on barriers and flushing.

It seems to me that PostgreSQL needs both ordered and synchronous
writes, maybe at different times (not that EVERY write must be both ordered
and synchronous).

You can emulate ordered with single+synchronous althought with a price.
You can't emulate synchronous writes with just barriers.

OPTIMAL: write-barrier-write-barrier-write-barrier-flush

SUBOPTIMAL: write-flush-write-flush-write-flush

As I understand it, fsync() should always issue a real flush: it's unrelated
to the barriers issue.
There's no API to issue ordered writes (or barriers) at user level,
AFAIK. (Uhm... O_DIRECT, maybe implies that?)

FS code may internally issue barrier requests to the block device, for
its own purposes (e.g. journal updates), but there's not useland API for
that.

Yet, there's no reference to DM not supporting flush correctly in the
whole thread... actually there are refereces to the opposite. DM devices
are defined as FLUSHABLE.

Also see:
http://lkml.org/lkml/2008/2/26/41
but it seems to me that all this discussion is under the assuption that
disks have write-back caches.
"The alternative is to disable the disk write cache." says it all.

.TM.

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

14 March 2009, 12:00:37

On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
> Scott Marlowe wrote:

> Also see:
> http://lkml.org/lkml/2008/2/26/41
> but it seems to me that all this discussion is under the assuption that
> disks have write-back caches.
> "The alternative is to disable the disk write cache." says it all.

If this applies to raid based cache as well then performance is going to
completely tank. For users of Linux + PostgreSQL using LVM.

Joshua D. Drake

>
> .TM.
>
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Marco Colombo

Date:

14 March 2009, 21:48:47

Joshua D. Drake wrote:
> On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
>> Scott Marlowe wrote:
>
>> Also see:
>> http://lkml.org/lkml/2008/2/26/41
>> but it seems to me that all this discussion is under the assuption that
>> disks have write-back caches.
>> "The alternative is to disable the disk write cache." says it all.
>
> If this applies to raid based cache as well then performance is going to
> completely tank. For users of Linux + PostgreSQL using LVM.
>
> Joshua D. Drake

Yet that's not the point. The point is safety. I may have a lightly loaded
database, with low write rate, but still I want it to be reliable. I just
want to know if disabling the caches makes it reliable or not. People on LK
seem to think it does. And it seems to me they may have a point.
fsync() is a flush operation on the block device, not a write barrier. LVM
doesn't pass write barriers down, but that doesn't mean it doesn't perform
a flush when requested to.

.TM.

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

14 March 2009, 22:28:58

On Sun, 2009-03-15 at 01:48 +0100, Marco Colombo wrote:
> Joshua D. Drake wrote:
> > On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote:
> >> Scott Marlowe wrote:
> >
> >> Also see:
> >> http://lkml.org/lkml/2008/2/26/41
> >> but it seems to me that all this discussion is under the assuption that
> >> disks have write-back caches.
> >> "The alternative is to disable the disk write cache." says it all.
> >
> > If this applies to raid based cache as well then performance is going to
> > completely tank. For users of Linux + PostgreSQL using LVM.
> >
> > Joshua D. Drake
>
> Yet that's not the point. The point is safety. I may have a lightly loaded
> database, with low write rate, but still I want it to be reliable. I just
> want to know if disabling the caches makes it reliable or not.

I understand but disabling cache is not an option for anyone I know. So
I need to know the other :)

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Marco Colombo

Date:

15 March 2009, 20:17:15

Joshua D. Drake wrote:
>
> I understand but disabling cache is not an option for anyone I know. So
> I need to know the other :)
>
> Joshua D. Drake
>

Come on, how many people/organizations do you know who really need 30+ MB/s
sustained write throughtput in the disk subsystem but can't afford a
battery backed controller at the same time?

Something must take care of writing data in the disk cache on permanent
storage; write-thru caches, battery backed controllers, write barriers
are all alternatives, choose the one you like most.

The problem here is fsync(). We know that not fsync()'ing gives you a big
performance boost, but that's not the point. I want to choose, and I want
a true fsync() when I ask for one. Because if the data don't make it to
the disk cache, the whole point about wt, bb and wb is moot.

.TM.

Re: Maximum transaction rate

From

Stefan Kaltenbrunner

Date:

16 March 2009, 17:03:24

Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> The transaction rates I'm getting seem way too high: 2800-2900 with
>> one thread, 5000-7000 with ten threads. I'm guessing that writes
>> aren't really reaching the disk. Can someone suggest how to figure out
>> where, below postgres, someone is lying about writes reaching the
>> disk?
>
> AFAIK there are two trouble sources in recent Linux machines: LVM and
> the disk drive itself.  LVM is apparently broken by design --- it simply
> fails to pass fsync requests.  If you're using it you have to stop.
> (Which sucks, because it's exactly the kind of thing DBAs tend to want.)
> Otherwise you need to reconfigure your drive to not cache writes.
> I forget the incantation for that but it's in the PG list archives.

hmm are you sure this is what is happening?
In my understanding LVM is not passing down barriers(generally - it
seems to do in some limited circumstances) which means in my
understanding it is not safe on any storage drive that has write cache
enabled. This seems to be the very same issue like linux had for ages
before ext3 got barrier support(not sure if even today all filesystems
do have that).
So in my understanding LVM is safe on disks that have write cache
disabled or "behave" as one (like a controller with a battery backed cache).
For storage with write caches it seems to be unsafe, even if the
filesystem supports barriers and it has them enabled (which I don't
think all have) which is basically what all of linux was not too long ago.

Stefan

Re: Maximum transaction rate

From

Scott Marlowe

Date:

16 March 2009, 17:31:31

On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> So in my understanding LVM is safe on disks that have write cache disabled
> or "behave" as one (like a controller with a battery backed cache).
> For storage with write caches it seems to be unsafe, even if the filesystem
> supports barriers and it has them enabled (which I don't think all have)
> which is basically what all of linux was not too long ago.

I definitely didn't have this problem with SCSI drives directly
attached to a machine under pgsql on ext2 back in the day (way back,
like 5 to 10 years ago).  IDE / PATA drives, on the other hand,
definitely suffered with having write caches enabled.

Re: Maximum transaction rate

From

John R Pierce

Date:

16 March 2009, 17:34:02

Stefan Kaltenbrunner wrote:
> So in my understanding LVM is safe on disks that have write cache
> disabled or "behave" as one (like a controller with a battery backed
> cache).

what about drive write caches on battery backed raid controllers?  do
the controllers ensure the drive cache gets flushed prior to releasing
the cached write blocks ?

Re: Maximum transaction rate

From

Stefan Kaltenbrunner

Date:

16 March 2009, 18:07:29

Scott Marlowe wrote:
> On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner
> <stefan@kaltenbrunner.cc> wrote:
>> So in my understanding LVM is safe on disks that have write cache disabled
>> or "behave" as one (like a controller with a battery backed cache).
>> For storage with write caches it seems to be unsafe, even if the filesystem
>> supports barriers and it has them enabled (which I don't think all have)
>> which is basically what all of linux was not too long ago.
>
> I definitely didn't have this problem with SCSI drives directly
> attached to a machine under pgsql on ext2 back in the day (way back,
> like 5 to 10 years ago).  IDE / PATA drives, on the other hand,
> definitely suffered with having write caches enabled.

I guess thats likely because most SCSI drives (at least back in the
days) had write caches turned off by default (whereas IDE drives had
them turned on).
The Linux kernel docs actually have some stuff on the barrier
implementation (

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=Documentation/block/barrier.txt;hb=HEAD)

which seems to explain some of the issues related to that.


Stefan

Re: Maximum transaction rate

From

Marco Colombo

Date:

17 March 2009, 11:11:20

John R Pierce wrote:
> Stefan Kaltenbrunner wrote:
>> So in my understanding LVM is safe on disks that have write cache
>> disabled or "behave" as one (like a controller with a battery backed
>> cache).
>
> what about drive write caches on battery backed raid controllers?  do
> the controllers ensure the drive cache gets flushed prior to releasing
> the cached write blocks ?

If LVM/dm is lying about fsync(), all this is moot. There's no point
talking about disk caches.

BTW. This discussion is continuing on the linux-lvm mailing list.
https://www.redhat.com/archives/linux-lvm/2009-March/msg00025.html
I have some PG databases on LVM systems, so I need to know for sure
I have have to move them elsewhere. It seemed to me the right place
for asking about the issue.

Someone there pointed out that fsycn() is not LVM's responsibility.

Correct. For sure, there's an API (or more than one) a filesystem uses
to force a flush on the underlying block device, and for sure it has to
called while inside the fsync() system call.

So "lying to fsync()" maybe is more correct than "lying about fsync()".

.TM.

Re: Maximum transaction rate

From

Greg Smith

Date:

17 March 2009, 18:43:40

On Tue, 17 Mar 2009, Marco Colombo wrote:

> If LVM/dm is lying about fsync(), all this is moot. There's no point
> talking about disk caches.

I decided to run some tests to see what's going on there, and it looks
like some of my quick criticism of LVM might not actually be valid--it's
only the performance that is problematic, not necessarily the reliability.
Appears to support fsync just fine.  I tested with kernel 2.6.22, so
certainly not before the recent changes to LVM behavior improving this
area, but with the bugs around here from earlier kernels squashed (like
crummy HPA support circa 2.6.18-2.6.19, see
https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

You can do a quick test of fsync rate using sysbench; got the idea from
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
(their command has some typos, fixed one below)

If fsync is working properly, you'll get something near the RPM rate of
the disk.  If it's lying, you'll see a much higher number.

I couldn't get the current sysbench-0.4.11 to compile (bunch of X
complains from libtool), but the old 0.4.8 I had around still works fine.
Let's start with a regular ext3 volume.  Here's what I see against a 7200
RPM disk (=120 rotations/second) with the default caching turned on:

$ alias fsynctest="~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384--file-test-mode=rndwr run | grep \"Requests/sec\"" 
$ fsynctest
  6469.36 Requests/sec executed

That's clearly lying as expected (and I ran all these a couple of times,
just reporting one for brevity sake; snipped some other redundant stuff
too).  I followed the suggestions at
http://www.postgresql.org/docs/current/static/wal-reliability.html to turn
off the cache and tested again:

$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
            *    Write cache
$ sudo /sbin/hdparm -W0 /dev/sdf

/dev/sdf:
  setting drive write-caching to 0 (off)
$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
                 Write cache
$ fsynctest
   106.05 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
  6469.36 Requests/sec executed

Great:  I was expecting ~120 commits/sec from a 7200 RPM disk, that's what
I get when caching is off.

Now, let's switch to using a LVM volume on a different partition of
that disk, and run the same test to see if anything changes.

$ sudo mount /dev/lvmvol/lvmtest /mnt/
$ cd /mnt/test
$ fsynctest
  6502.67 Requests/sec executed
$ sudo /sbin/hdparm -W0 /dev/sdf
$ fsynctest
   112.78 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
  6499.11 Requests/sec executed

Based on this test, it looks to me like fsync works fine on LVM.  It must
be passing that down to the physical disk correctly or I'd still be seeing
inflated rates.  If you've got a physical disk that lies about fsync, and
you put a database on it, you're screwed whether or not you use LVM;
nothing different on LVM than in the regular case.  A battery-backed
caching controller should also handle fsync fine if it turns off the
physical disk cache, which most of them do--and, again, you're no more or
less exposed to that particular problem with LVM than a regular
filesystem.

The thing that barriers helps out with is that it makes it possible to
optimize flushing ext3 journal metadata when combined with hard drives
that support the appropriate cache flushing mechanism (what hdparm calls
"FLUSH CACHE EXT"; see

http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html

).  That way you can prioritize flushing just the metadata needed to
prevent filesystem corruption while still fully caching less critical
regular old writes.  In that situation, performance could be greatly
improved over turning off caching altogether.  However, in the PostgreSQL
case, the fsync hammer doesn't appreciate this optimization anyway--all
the database writes are going to get forced out by that no matter what
before the database considers them reliable.  Proper barriers support
might be helpful in the case where you're using a database on a shared
disk that has other files being written to as well, basically allowing
caching on those while forcing the database blocks to physical disk, but
that presumes the Linux fsync implementation is more sophisticated than I
believe it currently is.

Far as I can tell, the main open question I didn't directly test here is
whether LVM does any write reordering that can impact database use because
it doesn't handle write barriers properly.  According to
https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does
not, and I never got the impression that was impacted by the LVM layer
before.  The concern is nicely summarized by the comment from Xman at
http://lwn.net/Articles/283161/ :

"fsync will block until the outstanding requests have been sync'd do disk,
but it doesn't guarantee that subsequent I/O's to the same fd won't
potentially also get completed, and potentially ahead of the I/O's
submitted prior to the fsync. In fact it can't make such guarantees
without functioning barriers."

Since we know LVM does not have functioning barriers, this would seem to
be one area where PostgreSQL would be vulnerable.  But since ext3 doesn't
have barriers turned by default either (except some recent SuSE system),
it's not unique to a LVM setup, and if this were really a problem it would
be nailing people everywhere.  I believe the WAL design handles this
situation.

There are some known limitations to Linux fsync that I remain somewhat
concerned about, independantly of LVM, like "ext3 fsync() only does a
journal commit when the inode has changed" (see
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
way files are preallocated, the PostgreSQL WAL is supposed to function
just fine even if you're using fdatasync after WAL writes, which also
wouldn't touch the journal (last time I checked fdatasync was implemented
as a full fsync on Linux).  Since the new ext4 is more aggressive at
delaying writes than ext3, it will be interesting to see if that uncovers
some subtle race conditions here that have been lying dormant so far.

I leave it as an exercise to the dedicated reader to modify the sysbench
test to use O_SYNC/O_DIRECT in order to re-test LVM for the situation if
you changed wal_sync_method=open_sync , how to do that is mentioned
briefly at http://sysbench.sourceforge.net/docs/

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Ron Mayer

Date:

17 March 2009, 18:55:32

Greg Smith wrote:
> There are some known limitations to Linux fsync that I remain somewhat
> concerned about, independantly of LVM, like "ext3 fsync() only does a
> journal commit when the inode has changed" (see
> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
> way files are preallocated, the PostgreSQL WAL is supposed to function
> just fine even if you're using fdatasync after WAL writes, which also
> wouldn't touch the journal (last time I checked fdatasync was
> implemented as a full fsync on Linux).  Since the new ext4 is more

Indeed it does.

I wonder if there should be an optional fsync mode
in postgres should turn fsync() into
    fchmod (fd, 0644); fchmod (fd, 0664);
to work around this issue.

For example this program below will show one write
per disk revolution if you leave the fchmod() in there,
and run many times faster (i.e. lying) if you remove it.
This with ext3 on a standard IDE drive with the write
cache enabled, and no LVM or anything between them.

==========================================================
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
  if (argc<2) {
    printf("usage: fs <filename>\n");
    exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i<100;i++) {
    char byte;
    pwrite (fd, &byte, 1, 0);
    fchmod (fd, 0644); fchmod (fd, 0664);
    fsync (fd);
  }
}
==========================================================

Re: Maximum transaction rate

From

Greg Smith

Date:

17 March 2009, 20:12:42

On Tue, 17 Mar 2009, Ron Mayer wrote:

> I wonder if there should be an optional fsync mode
> in postgres should turn fsync() into
>    fchmod (fd, 0644); fchmod (fd, 0664);
> to work around this issue.

The test I haven't had time to run yet is to turn the bug exposing program
you were fiddling with into a more accurate representation of WAL
activity, to see if that chmod still changes the behavior there. I think
the most dangerous possibility here is if you create a new WAL segment and
immediately fill it, all in less than a second.  Basically, what
XLogFileInit does:

-Open with O_RDWR | O_CREAT | O_EXCL
-Write XLogSegSize (16MB) worth of zeros
-fsync

Followed by simulating what XLogWrite would do if you fed it enough data
to force a segment change:

-Write a new 16MB worth of data
-fsync

If you did all that in under a second, would you still get a filesystem
flush each time?  From the description of the problem I'm not so sure
anymore.  I think that's how tight the window would have to be for this
issue to show up right now, you'd only be exposed if you filled a new WAL
segment faster than the associated journal commit happened (basically, a
crash when WAL write volume >16MB/s in a situation where new segments are
being created).  But from what I've read about ext4 I think that window
for mayhem might widen on that filesystem--that's what got me reading up
on this whole subject recently, before this thread even started.

The other ameliorating factor here is that in order for this to bite you,
I think you'd need to have another, incorrectly ordered write somewhere
else that could happen before the delayed write.  Not sure where that
might be possible in the PostgreSQL WAL implementation yet.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Marco Colombo

Date:

17 March 2009, 21:27:19

Greg Smith wrote:
> On Tue, 17 Mar 2009, Marco Colombo wrote:
>
>> If LVM/dm is lying about fsync(), all this is moot. There's no point
>> talking about disk caches.
>
> I decided to run some tests to see what's going on there, and it looks
> like some of my quick criticism of LVM might not actually be valid--it's
> only the performance that is problematic, not necessarily the
> reliability. Appears to support fsync just fine.  I tested with kernel
> 2.6.22, so certainly not before the recent changes to LVM behavior
> improving this area, but with the bugs around here from earlier kernels
> squashed (like crummy HPA support circa 2.6.18-2.6.19, see
> https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

I've run tests too, you can seen them here:
https://www.redhat.com/archives/linux-lvm/2009-March/msg00055.html
in case you're looking for something trivial (write/fsync loop).

> You can do a quick test of fsync rate using sysbench; got the idea from
> http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
> (their command has some typos, fixed one below)
>
> If fsync is working properly, you'll get something near the RPM rate of
> the disk.  If it's lying, you'll see a much higher number.

Same results. -W1 gives x50 speedup, it must be waiting for something
at disk level with -W0.

[...]

> Based on this test, it looks to me like fsync works fine on LVM.  It
> must be passing that down to the physical disk correctly or I'd still be
> seeing inflated rates.  If you've got a physical disk that lies about
> fsync, and you put a database on it, you're screwed whether or not you
> use LVM; nothing different on LVM than in the regular case.  A
> battery-backed caching controller should also handle fsync fine if it
> turns off the physical disk cache, which most of them do--and, again,
> you're no more or less exposed to that particular problem with LVM than
> a regular filesystem.

That was my initial understanding.

> The thing that barriers helps out with is that it makes it possible to
> optimize flushing ext3 journal metadata when combined with hard drives
> that support the appropriate cache flushing mechanism (what hdparm calls
> "FLUSH CACHE EXT"; see
>
http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html
> ).  That way you can prioritize flushing just the metadata needed to
> prevent filesystem corruption while still fully caching less critical
> regular old writes.  In that situation, performance could be greatly
> improved over turning off caching altogether.  However, in the
> PostgreSQL case, the fsync hammer doesn't appreciate this optimization
> anyway--all the database writes are going to get forced out by that no
> matter what before the database considers them reliable.  Proper
> barriers support might be helpful in the case where you're using a
> database on a shared disk that has other files being written to as well,
> basically allowing caching on those while forcing the database blocks to
> physical disk, but that presumes the Linux fsync implementation is more
> sophisticated than I believe it currently is.

This is the same conclusion I came to. Moreover, once you have barriers
passed down to the disks, it would be nice to have a userland API to send
them to the kernel. Any application managing a 'journal' or 'log' type
of object, would benefit from that. I'm not familiar with PG internals,
but it's likely you can have some records you just want to be ordered, and
you can do something like write-barrier-write-barrier-...-fsync instead of
write-fsync-write-fsync-... Currenly fsync() (and friends, O_SYNC,
fdatasync(), O_DSYNC) is the only way to enforce ordering on writes
from userland.

> Far as I can tell, the main open question I didn't directly test here is
> whether LVM does any write reordering that can impact database use
> because it doesn't handle write barriers properly.  According to
> https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it
> does not, and I never got the impression that was impacted by the LVM
> layer before.  The concern is nicely summarized by the comment from Xman
> at http://lwn.net/Articles/283161/ :
>
> "fsync will block until the outstanding requests have been sync'd do
> disk, but it doesn't guarantee that subsequent I/O's to the same fd
> won't potentially also get completed, and potentially ahead of the I/O's
> submitted prior to the fsync. In fact it can't make such guarantees
> without functioning barriers."

Sure, but from userland you can't set barriers. If you fsync() after each
write you want ordered, there can't be any "subsequent I/O" (unless
there are many different processes cuncurrently writing to the file
w/o synchronization).

> Since we know LVM does not have functioning barriers, this would seem to
> be one area where PostgreSQL would be vulnerable.  But since ext3
> doesn't have barriers turned by default either (except some recent SuSE
> system), it's not unique to a LVM setup, and if this were really a
> problem it would be nailing people everywhere.  I believe the WAL design
> handles this situation.

Well well. Ext3 is definitely in the lucky area. The journal on most ext3
instances is contiguous on disk. The disk won't reorder requests only
because the are already ordered... only when the journal wraps around there's
a (extremely) small window of vulnerability. You need to write a careful
crafted torture program to get any chance to observe that... such program
exists, and triggers the problem (leaving inconsistent fs) almost 50% of
the times. But it's extremely unlikely you can see it happen in real
workloads.

http://lwn.net/Articles/283168/

.TM.

Re: Maximum transaction rate

From

Marco Colombo

Date:

17 March 2009, 22:00:58

Ron Mayer wrote:
> Greg Smith wrote:
>> There are some known limitations to Linux fsync that I remain somewhat
>> concerned about, independantly of LVM, like "ext3 fsync() only does a
>> journal commit when the inode has changed" (see
>> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The
>> way files are preallocated, the PostgreSQL WAL is supposed to function
>> just fine even if you're using fdatasync after WAL writes, which also
>> wouldn't touch the journal (last time I checked fdatasync was
>> implemented as a full fsync on Linux).  Since the new ext4 is more
>
> Indeed it does.
>
> I wonder if there should be an optional fsync mode
> in postgres should turn fsync() into
>     fchmod (fd, 0644); fchmod (fd, 0664);
> to work around this issue.

Question is... why do you care if the journal is not flushed on fsync?
Only the file data blocks need to be, if the inode is unchanged.

> For example this program below will show one write
> per disk revolution if you leave the fchmod() in there,
> and run many times faster (i.e. lying) if you remove it.
> This with ext3 on a standard IDE drive with the write
> cache enabled, and no LVM or anything between them.
>
> ==========================================================
> /*
> ** based on http://article.gmane.org/gmane.linux.file-systems/21373
> ** http://thread.gmane.org/gmane.linux.kernel/646040
> */
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc,char *argv[]) {
>   if (argc<2) {
>     printf("usage: fs <filename>\n");
>     exit(1);
>   }
>   int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
>   int i;
>   for (i=0;i<100;i++) {
>     char byte;
>     pwrite (fd, &byte, 1, 0);
>     fchmod (fd, 0644); fchmod (fd, 0664);
>     fsync (fd);
>   }
> }
> ==========================================================
>

I ran the program above, w/o the fchmod()s.

$ time ./test2 testfile

real    0m0.056s
user    0m0.001s
sys     0m0.008s

This is with ext3+LVM+raid1+sata disks with hdparm -W1.
With -W0 I get:

$ time ./test2 testfile

real    0m1.014s
user    0m0.000s
sys     0m0.008s

Big difference. The fsync() there does its job.

The same program runs with a x3 slowdown with the fsyncs, but that's
expected, it's doing twice the writes, and in different places.

.TM.

Re: Maximum transaction rate

From

Greg Smith

Date:

17 March 2009, 22:56:46

On Wed, 18 Mar 2009, Marco Colombo wrote:

> If you fsync() after each write you want ordered, there can't be any
> "subsequent I/O" (unless there are many different processes cuncurrently
> writing to the file w/o synchronization).

Inside PostgreSQL, each of the database backend processes ends up writing
blocks to the database disk, if they need to allocate a new buffer and the
one they are handed is dirty.  You can easily have several of those
writing to the same 1GB underlying file on disk.  So that prerequisite is
there.  The main potential for a problem here would be if a stray
unsynchronized write from one of those backends happened in a way that
wasn't accounted for by the WAL+checkpoint design.  What I was suggesting
is that the way that synchronization happens in the database provides some
defense from running into problems in this area.

The way backends handle writes themselves is also why your suggestion
about the database being able to utilize barriers isn't really helpful.
Those trickle out all the time, and normally you don't even have to care
about ordering them.  The only you do need to care, at checkpoint time,
only a hard line is really practical--all writes up to that point, period.
Trying to implement ordered writes for everything that happened before
then would complicate the code base, which isn't going to happen for such
a platform+filesystem specific feature, one that really doesn't offer much
acceleration from the database's perspective.

> only when the journal wraps around there's a (extremely) small window of
> vulnerability. You need to write a careful crafted torture program to
> get any chance to observe that... such program exists, and triggers the
> problem

Yeah, I've been following all that.  The PostgreSQL WAL design works on
ext2 filesystems with no journal at all.  Some people even put their
pg_xlog directory onto ext2 filesystems for best performance, relying on
the WAL to be the journal.  As long as fsync is honored correctly, the WAL
writes should be re-writing already allocated space, which makes this
category of journal mayhem not so much of a problem.  But when I read
about fsync doing unexpected things, that gets me more concerned.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Ron Mayer

Date:

18 March 2009, 13:33:18

Marco Colombo wrote:
> Ron Mayer wrote:
>> Greg Smith wrote:
>>> There are some known limitations to Linux fsync that I remain somewhat
>>> concerned about, independantly of LVM, like "ext3 fsync() only does a
>>> journal commit when the inode has changed" (see
>>> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 )....
>> I wonder if there should be an optional fsync mode
>> in postgres should turn fsync() into
>>     fchmod (fd, 0644); fchmod (fd, 0664);
'course I meant: "fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);"
>> to work around this issue.
>
> Question is... why do you care if the journal is not flushed on fsync?
> Only the file data blocks need to be, if the inode is unchanged.

You don't - but ext3 fsync won't even push the file data blocks
through a disk cache unless the inode was changed.

The point is that ext3 only does the "write barrier" processing
that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI)
commands on inode changes, not data changes.   And with no FLUSH
CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks
cache after the fsync() as well.

PS: not sure if this is still true - last time I tested it
was nov 2006.

   Ron

Re: Maximum transaction rate

From

Marco Colombo

Date:

18 March 2009, 18:59:05

Greg Smith wrote:
> On Wed, 18 Mar 2009, Marco Colombo wrote:
>
>> If you fsync() after each write you want ordered, there can't be any
>> "subsequent I/O" (unless there are many different processes
>> cuncurrently writing to the file w/o synchronization).
>
> Inside PostgreSQL, each of the database backend processes ends up
> writing blocks to the database disk, if they need to allocate a new
> buffer and the one they are handed is dirty.  You can easily have
> several of those writing to the same 1GB underlying file on disk.  So
> that prerequisite is there.  The main potential for a problem here would
> be if a stray unsynchronized write from one of those backends happened
> in a way that wasn't accounted for by the WAL+checkpoint design.

Wow, that would be quite a bug. That's why I wrote "w/o synchronization".
"stray" + "unaccounted" + "cuncurrent" smells like the recipe for an
explosive to me :)

> What I
> was suggesting is that the way that synchronization happens in the
> database provides some defense from running into problems in this area.

I hope it's "full defence". If you have two processes doing at the
same time write(); fsycn(); on the same file, either there are no order
requirements, or it will boom sooner or later... fsync() works inside
a single process, but any system call may put the process to sleep, and
who knows when it will be awakened and what other processes did to that
file meanwhile. I'm pretty confident that PG code protects access to
shared resources with synchronization primitives.

Anyway I was referring to WAL writes... due to the nature of a log,
it's hard to think of many unordered writes and of cuncurrent access
w/o synchronization. But inside a critical region, there can be more
than one single write, and you may need to enforce an order, but no
more than that before the final fsycn(). If so, userland originated
barriers instead of full fsync()'s may help with performance.
But I'm speculating.

> The way backends handle writes themselves is also why your suggestion
> about the database being able to utilize barriers isn't really helpful.
> Those trickle out all the time, and normally you don't even have to care
> about ordering them.  The only you do need to care, at checkpoint time,
> only a hard line is really practical--all writes up to that point,
> period. Trying to implement ordered writes for everything that happened
> before then would complicate the code base, which isn't going to happen
> for such a platform+filesystem specific feature, one that really doesn't
> offer much acceleration from the database's perspective.

I don't know the internals of WAL writing, I can't really reply on that.

>> only when the journal wraps around there's a (extremely) small window
>> of vulnerability. You need to write a careful crafted torture program
>> to get any chance to observe that... such program exists, and triggers
>> the problem
>
> Yeah, I've been following all that.  The PostgreSQL WAL design works on
> ext2 filesystems with no journal at all.  Some people even put their
> pg_xlog directory onto ext2 filesystems for best performance, relying on
> the WAL to be the journal.  As long as fsync is honored correctly, the
> WAL writes should be re-writing already allocated space, which makes
> this category of journal mayhem not so much of a problem.  But when I
> read about fsync doing unexpected things, that gets me more concerned.

Well, that's highly dependant on your expectations :) I don't expect
a fsync to trigger a journal commit, if metadata hasn't changed. That's
obviuosly true for metadata-only journals (like most of them, with
notable exceptions of ext3 in data=journal mode).

Yet, if you're referring to this
http://article.gmane.org/gmane.linux.file-systems/21373

well that seems to me the same usual thing/bug, fsync() allows disks to
lie when it comes to caching writes. Nothing new under the sun.

Barriers don't change much, because they don't replace a flush. They're
about consistency, not durability. So even with full barriers support,
a fsync implementation needs to end up in a disk cache flush, to be fully
compliant with its own semantics.

.TM.

Re: Maximum transaction rate

From

Martijn van Oosterhout

Date:

18 March 2009, 19:26:26

On Wed, Mar 18, 2009 at 10:58:39PM +0100, Marco Colombo wrote:
> I hope it's "full defence". If you have two processes doing at the
> same time write(); fsycn(); on the same file, either there are no order
> requirements, or it will boom sooner or later... fsync() works inside
> a single process, but any system call may put the process to sleep, and
> who knows when it will be awakened and what other processes did to that
> file meanwhile. I'm pretty confident that PG code protects access to
> shared resources with synchronization primitives.

Generally PG uses O_SYNC on open, so it's only one system call, not
two. And the file it's writing to is generally preallocated (not
always though).

> Well, that's highly dependant on your expectations :) I don't expect
> a fsync to trigger a journal commit, if metadata hasn't changed. That's
> obviuosly true for metadata-only journals (like most of them, with
> notable exceptions of ext3 in data=journal mode).

Really the only thing needed is that the WAL entry reaches disk before
the actual data does. AIUI as long as you have that the situation is
recoverable. Given that the actual data probably won't be written for a
while it'd need to go pretty wonky before you see an issue.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: Maximum transaction rate

From

Marco Colombo

Date:

18 March 2009, 19:28:03

Ron Mayer wrote:
> Marco Colombo wrote:
>> Ron Mayer wrote:
>>> Greg Smith wrote:
>>>> There are some known limitations to Linux fsync that I remain somewhat
>>>> concerned about, independantly of LVM, like "ext3 fsync() only does a
>>>> journal commit when the inode has changed" (see
>>>> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 )....
>>> I wonder if there should be an optional fsync mode
>>> in postgres should turn fsync() into
>>>     fchmod (fd, 0644); fchmod (fd, 0664);
> 'course I meant: "fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);"
>>> to work around this issue.
>> Question is... why do you care if the journal is not flushed on fsync?
>> Only the file data blocks need to be, if the inode is unchanged.
>
> You don't - but ext3 fsync won't even push the file data blocks
> through a disk cache unless the inode was changed.
>
> The point is that ext3 only does the "write barrier" processing
> that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI)
> commands on inode changes, not data changes.   And with no FLUSH
> CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks
> cache after the fsync() as well.

Yes, but we knew it already, didn't we? It's always been like
that, with IDE disks and write-back cache enabled, fsync just
waits for the disk reporting completion and disks lie about
that. Write barriers enforce ordering, WHEN writes are
committed to disk, they will be in order, but that doesn't mean
NOW. Ordering is enough for FS a journal, the only requirement
is consistency.

Anyway, it's the block device job to control disk caches. A
filesystem is just a client to the block device, it posts a
flush request, what happens depends on the block device code.
The FS doesn't talk to disks directly. And a write barrier is
not a flush request, is a "please do not reorder" request.
On fsync(), ext3 issues a flush request to the block device,
that's all it's expected to do.

Now, some block devices may implement write barriers issuing
FLUSH commands to the disk, but that's another matter. A FS
shouldn't rely on that.

You can replace a barrier with a flush (not as efficently),
but not the other way around.

If a block device driver issues FLUSH for a barrier, and
doesn't issue a FLUSH for a flush, well, it's a buggy driver,
IMHO.

.TM.

Re: Maximum transaction rate

From

Greg Smith

Date:

18 March 2009, 20:00:52

On Wed, 18 Mar 2009, Martijn van Oosterhout wrote:

> Generally PG uses O_SYNC on open

Only if you change wal_sync_method=open_sync.  That's the very last option
PostgreSQL will try--only if none of the other are available will it use
that.

Last time I checked the defaults value for that parameter broke down like
this by platform:

open_datasync (O_DSYNC):  Solaris, Windows (I think there's a PG wrapper
involved for Win32)

fdatasync:  Linux (even though the OS just provides a fake wrapper around
fsync for that call)

fsync_writethrough:  Mac OS X

fsync:  FreeBSD

That makes the only UNIX{-ish} OS where the default is a genuine sync
write Solaris.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maximum transaction rate

From

Marco Colombo

Date:

18 March 2009, 20:50:04

Martijn van Oosterhout wrote:
> Generally PG uses O_SYNC on open, so it's only one system call, not
> two. And the file it's writing to is generally preallocated (not
> always though).

It has to wait for I/O completion on write(), then, it has to go to
sleep. If two different processes do a write(), you don't know which
will be awakened first. Preallocation don't mean much here, since with
O_SYNC you expect a physical write to be done (with the whole sleep/
HW interrupt/SW interrupt/awake dance). It's true that you may expect
the writes to be carried out in order, and that might be enough. I'm
not sure tho.

>> Well, that's highly dependant on your expectations :) I don't expect
>> a fsync to trigger a journal commit, if metadata hasn't changed. That's
>> obviuosly true for metadata-only journals (like most of them, with
>> notable exceptions of ext3 in data=journal mode).
>
> Really the only thing needed is that the WAL entry reaches disk before
> the actual data does. AIUI as long as you have that the situation is
> recoverable. Given that the actual data probably won't be written for a
> while it'd need to go pretty wonky before you see an issue.

You're giveing up Durability here. In a closed system, that doesn't mean
much, but when you report "payment accepted" to third parties, you can't
forget about it later. The requirement you stated is for Consistency only.
That's  what a journaled FS cares about, i.e. no need for fsck (internal
consistency checks) after a crash. It may be acceptable for a remote
standby backup, you replay as much of the WAL as it's available after
the crash (the part you managed to copy, that is). But you know there
can be lost transactions.

It may be acceptable or not. Sometimes it's not. Sometimes you must be
sure the data in on platters before you report "committed". Sometimes
when you say "fsync!" you mean "i want data flushed to disk NOW, and I
really mean it!". :)

.TM.

Re: Maximum transaction rate

From

"Joshua D. Drake"

Date:

19 March 2009, 12:54:35

Hello,

As a continued follow up to this thread, Tim Post replied on the LVM
list to this affect:

"
If a logical volume spans physical devices where write caching is
enabled, the results of fsync() can not be trusted. This is an issue
with device mapper, lvm is one of a few possible customers of DM.

Now it gets interesting:

Enter virtualization. When you have something like this:

fsync -> guest block device -> block tap driver -> CLVM -> iscsi ->
storage -> physical disk.

Even if device mapper passed along the write barrier, would it be
reliable? Is every part of that chain going to pass the same along, and
how many opportunities for re-ordering are presented in the above?

So, even if its fixed in DM, can fsync() still be trusted? I think, at
the least, more testing should be done with various configurations even
after a suitable patch to DM is merged. What about PGSQL users using
some kind of elastic hosting?

Given the craze in 'cloud' technology, its an important question to ask
(and research).


Cheers,
--Tim
"

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Maximum transaction rate

From

Ron Mayer

Date:

19 March 2009, 16:10:22

Marco Colombo wrote:
> Yes, but we knew it already, didn't we? It's always been like
> that, with IDE disks and write-back cache enabled, fsync just
> waits for the disk reporting completion and disks lie about

I've looked hard, and I have yet to see a disk that lies.

ext3, OTOH seems to lie.

IDE drives happily report whether they support write barriers
or not, which you can see with the command:
%hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
I've tested about a dozen drives, and I've never seen one
claims to support flushing that doesn't.  And I haven't seen
one that doesn't support it that was made less than half a
decade ago.  IIRC, ATA-5 specs from 2000 made supporting
this mandatory.

Linux kernels since 2005 or so check for this feature.  It'll
happily tell you which of your devices don't support it.
  %dmesg | grep 'disabling barriers'
  JBD: barrier-based sync failed on md1 - disabling barriers
And for devices that do, it will happily send IDE FLUSH CACHE
commands to IDE drives that support the feature.   At the same
time Linux kernels started sending the very similar. SCSI
SYNCHRONIZE CACHE commands.

> Anyway, it's the block device job to control disk caches. A
> filesystem is just a client to the block device, it posts a
> flush request, what happens depends on the block device code.
> The FS doesn't talk to disks directly. And a write barrier is
> not a flush request, is a "please do not reorder" request.
> On fsync(), ext3 issues a flush request to the block device,
> that's all it's expected to do.

But AFAICT ext3 fsync() only tell the block device to
flush disk caches if the inode was changed.

Or, at least empirically if I modify a file and do
fsync(fd); on ext3 it does not wait until the disk
spun to where it's supposed to spin.   But if I put
a couple fchmod()'s right before the fsync() it does.

Re: Maximum transaction rate

From

Baron Schwartz

Date:

19 March 2009, 19:25:20

I am jumping into this thread late, and maybe this has already been
stated clearly, but from my experience benchmarking, LVM does *not*
lie about fsync() on the servers I've configured.  An fsync() goes to
the physical device.  You can see it clearly by setting the write
cache on the RAID controller to write-through policy.  Performance
decreases to what the disks can do.

And my colleagues and clients have tested yanking the power plug and
checking that the data got to the RAID controller's battery-backed
cache, many many times.  In other words, the data is safe and durable,
even on LVM.

However, I have never tried to do this on volumes that span multiple
physical devices, because LVM can't take an atomic snapshot across
them, which completely negates the benefit of LVM for my purposes.  So
I always create one logical disk in the RAID controller, and then
carve that up with LVM, partitions, etc however I please.

I almost surely know less about this topic than anyone on this thread.

Baron

Re: Maximum transaction rate

From

Martijn van Oosterhout

Date:

20 March 2009, 04:39:38

On Thu, Mar 19, 2009 at 12:49:52AM +0100, Marco Colombo wrote:
> It has to wait for I/O completion on write(), then, it has to go to
> sleep. If two different processes do a write(), you don't know which
> will be awakened first. Preallocation don't mean much here, since with
> O_SYNC you expect a physical write to be done (with the whole sleep/
> HW interrupt/SW interrupt/awake dance). It's true that you may expect
> the writes to be carried out in order, and that might be enough. I'm
> not sure tho.

True, but the relative wakeup order of two different processes is not
important since by definition they are working on different
transactions. As long as the WAL writes for a single transaction (in a
single process) are not reordered you're fine. The benefit of a
non-overwriting storage manager is that you don't need to worry about
undo's. Any incomplete transaction is uncomitted and so any data
produced by that transaction is ignored.

> It may be acceptable or not. Sometimes it's not. Sometimes you must be
> sure the data in on platters before you report "committed". Sometimes
> when you say "fsync!" you mean "i want data flushed to disk NOW, and I
> really mean it!". :)

Ofcourse. Committing a transaction comes down to flipping a single bit.
Before you flip it all the WAL data for that transaction must have hit
disk. And you don't tell the client the transaction has committed until
the fipped bit has hit disk. And fsync better do what you're asking
(how fast is just a performance issue, just as long as it's done).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: Maximum transaction rate

From

Marco Colombo

Date:

20 March 2009, 09:25:32

Martijn van Oosterhout wrote:
> True, but the relative wakeup order of two different processes is not
> important since by definition they are working on different
> transactions. As long as the WAL writes for a single transaction (in a
> single process) are not reordered you're fine.

I'm not totally sure, but I think I understand what you mean here,
indipendent transactions by definition don't care about relative ordering.

.TM.

Re: Maximum transaction rate

From

Marco Colombo

Date:

20 March 2009, 15:28:10

Ron Mayer wrote:
> Marco Colombo wrote:
>> Yes, but we knew it already, didn't we? It's always been like
>> that, with IDE disks and write-back cache enabled, fsync just
>> waits for the disk reporting completion and disks lie about
>
> I've looked hard, and I have yet to see a disk that lies.

No, "lie" in the sense they report completion before the data
hit the platters. Of course, that's the expected behaviour with
write-back caches.

> ext3, OTOH seems to lie.

ext3 simply doesn't know, it interfaces with a block device,
which does the caching (OS level) and the reordering (e.g. elevator
algorithm). ext3 doesn't directly send commands to the disk,
neither manages the OS cache.

When software raid and device mapper come into play, you have
"virtual" block devices built on top of other block devices.

My home desktop has ext3 on top of a dm device (/dev/mapper/something,
a LV set up by LVM in this case), on top of a raid1 device (/dev/mdX),
on top of /dev/sdaX and /dev/sdbX, which, in a way, on their own
are blocks device built on others, /dev/sda and /dev/sdb (you don't
actually send commands to partitions, do you? although the mapping
"sector offset relative to partition -> real sector on disk" is
trivial).

Each of these layers potentially caches writes and reorders them, it's
the job of a block device, although it makes sense at most only for
the last one, the one that controls the disk. Anyway there isn't
much ext3 can do, but posting wb and flush requests to the block
device at the top of the "stack".

> IDE drives happily report whether they support write barriers
> or not, which you can see with the command:
> %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT

Of course a write barrier is not a cache flush. A flush is
synchronous, a write barrier asyncronous. The disk supports
flushing, not write barriers. Well, technically if you can
control the ordering of the requests, that's barriers proper.
With SCSI you can, IIRC. But a cache flush is, well, a flush.

> Linux kernels since 2005 or so check for this feature.  It'll
> happily tell you which of your devices don't support it.
>   %dmesg | grep 'disabling barriers'
>   JBD: barrier-based sync failed on md1 - disabling barriers
> And for devices that do, it will happily send IDE FLUSH CACHE
> commands to IDE drives that support the feature.   At the same
> time Linux kernels started sending the very similar. SCSI
> SYNCHRONIZE CACHE commands.

>> Anyway, it's the block device job to control disk caches. A
>> filesystem is just a client to the block device, it posts a
>> flush request, what happens depends on the block device code.
>> The FS doesn't talk to disks directly. And a write barrier is
>> not a flush request, is a "please do not reorder" request.
>> On fsync(), ext3 issues a flush request to the block device,
>> that's all it's expected to do.
>
> But AFAICT ext3 fsync() only tell the block device to
> flush disk caches if the inode was changed.

No, ext3 posts a write barrier request when the inode changes and it
commits the journal, which is not a flush. [*]

> Or, at least empirically if I modify a file and do
> fsync(fd); on ext3 it does not wait until the disk
> spun to where it's supposed to spin.   But if I put
> a couple fchmod()'s right before the fsync() it does.

If you were right, and ext3 didn't wait, it would make no
difference to have disk cache enabled or not, on fsync.
My test shows a 50x speedup when turning the disk cache on.
So for sure ext3 is waiting for the block device to report
completion. It's the block device that - on flush - doesn't
issue a FLUSH command to the disk.

.TM.

[*] A barrier ends up in a FLUSH for the disk, but it doesn't
mean it's synchronous, like a real flush. Even journal updates done
with barriers don't mean "hit the disk now", they just mean "keep
order" when writing. If you turn off automatic page cache flushing
and if you have zero memory pressure, a write request with a
barrier may stay forever in the OS cache, at least in theory.

Imagine you don't have bdflush and nothing reclaims resources: days
of activity may stay in RAM, as far as write barriers are concerned.
Now someone types 'sync' as root. The block device starts flushing
dirty pages, reordering writes, but honoring barriers, that is,
it reorders anything up to the first barrier, posts write requests
to the disk, issues a FLUSH command then waits until the flush
is completed. Then "consumes" the barrier, and starts processing
writes, reordering them up to the next barrier, and so on.
So yes, a barrier turns into a FLUSH command for the disk. But in
this scenario, days have passed since the original write/barrier request
from the filesystem.

Compare with a fsync(). Even in the above scenario, a fsync() should
end up in a FLUSH command to the disk, and wait for the request to
complete, before awakening the process that issued it. So the filesystem
has to request a flush operation to the block device, not a barrier.
And so it does.

If it turns out that the block device just issues writes but no FLUSH
command to disks, that's not the FS fault. And issuing barrier requests
won't change anything.

All this in theory. In practice there may be implementation details
that make things different. I've read that in the linux kernel at some
time (maybe even now) only one outstanding write barrier is possibile
in the stack of block devices. So I guess that a second write barrier
request triggers a real disk flush. That's why when you use fchmod()
repeatedly, you see all those flushes. But technically it's a side
effect and I think at closer analysis you may notice it's always lagging
one request behind, which you don't see just by looking at numbers or
listening to disk noise.

So, multiple journal commits may really help in having the disk cache
flushed as a side effect, but I think the bug is elsewhere. The day
linux supports multiple outstanding wb requests, that stops working.

It's the block devices that should be fixed, so that it performs
a cache FLUSH when the filesystem asks for a flush. Why they don't do
that today, it's a mistery to me, but I think there must be something
that I'm missing.

Anyway, the point here is that using LVM is no less safe than using
directly an IDE block device. There may be filesystems that on fsync
issue not only a flush request, but also a journal commit, with
attacched barrier requests, thus getting the Right Thing done by
double side effect. And yes ext3 is NOT among them, unless you trigger
those commits with the fchmod() dance.

Re: Maximum transaction rate

From

Markus Wanner

Date:

25 March 2009, 06:53:31

Hi,

Martijn van Oosterhout wrote:
> And fsync better do what you're asking
> (how fast is just a performance issue, just as long as it's done).

Where are we on this issue? I've read all of this thread and the one on
the lvm-linux mailing list as well, but still don't feel confident.

In the following scenario:

  fsync -> filesystem -> physical disk

I'm assuming the filesystem correctly issues an blkdev_issue_flush() on
the physical disk upon fsync(), to do what it's told: flush the cache(s)
to disk. Further, I'm also assuming the physical disk is flushable (i.e.
it correctly implements the blkdev_issue_flush() call). Here we can be
pretty certain that fsync works as advertised, I think.

The unanswered question to me is, what's happening, if I add LVM in
between as follows:

  fsync -> filesystmem -> device mapper (lvm) -> physical disk(s)

Again, assume the filesystem issues a blkdev_issue_flush() to the lower
layer and the physical disks are all flushable (and implement that
correctly). How does the device mapper behave?

I'd expect it to forward the blkdev_issue_flush() call to all affected
devices and only return after the last one has confirmed and completed
flushing its caches. Is that the case?

I've also read about the newish write barriers and about filesystems
implementing fsync with such write barriers. That seems fishy to me and
would of course break in combination with LVM (which doesn't completely
support write barriers, AFAIU). However, that's clearly the filesystem
side of the story and has not much to do with whether fsync lies on top
of LVM or not.

Help in clarifying this issue greatly appreciated.

Kind Regards

Markus Wanner

Re: Maximum transaction rate

From

Marco Colombo

Date:

30 March 2009, 10:15:28

Markus Wanner wrote:
> Hi,
>
> Martijn van Oosterhout wrote:
>> And fsync better do what you're asking
>> (how fast is just a performance issue, just as long as it's done).
>
> Where are we on this issue? I've read all of this thread and the one on
> the lvm-linux mailing list as well, but still don't feel confident.
>
> In the following scenario:
>
>   fsync -> filesystem -> physical disk
>
> I'm assuming the filesystem correctly issues an blkdev_issue_flush() on
> the physical disk upon fsync(), to do what it's told: flush the cache(s)
> to disk. Further, I'm also assuming the physical disk is flushable (i.e.
> it correctly implements the blkdev_issue_flush() call). Here we can be
> pretty certain that fsync works as advertised, I think.
>
> The unanswered question to me is, what's happening, if I add LVM in
> between as follows:
>
>   fsync -> filesystmem -> device mapper (lvm) -> physical disk(s)
>
> Again, assume the filesystem issues a blkdev_issue_flush() to the lower
> layer and the physical disks are all flushable (and implement that
> correctly). How does the device mapper behave?
>
> I'd expect it to forward the blkdev_issue_flush() call to all affected
> devices and only return after the last one has confirmed and completed
> flushing its caches. Is that the case?
>
> I've also read about the newish write barriers and about filesystems
> implementing fsync with such write barriers. That seems fishy to me and
> would of course break in combination with LVM (which doesn't completely
> support write barriers, AFAIU). However, that's clearly the filesystem
> side of the story and has not much to do with whether fsync lies on top
> of LVM or not.
>
> Help in clarifying this issue greatly appreciated.
>
> Kind Regards
>
> Markus Wanner

Well, AFAIK, the summary would be:

1) adding LVM to the chain makes no difference;

2) you still need to disable the write-back cache in IDE/SATA disks,
for fsync() to work properly.

3) without LVM and with write-back cache enabled, due to current(?)
limitations in the linux kernel, with some journaled filesystems
(but not ext3 in data=write-back or data=ordered mode, I'm not sure
about data=journal), you may be less vulnerable, if you use fsync()
(or O_SYNC).

"less vulnerable" means that all pending changes are commetted to disk,
but the very last one.

So:
- write-back cache + EXT3 = unsafe
- write-back cache + other fs = (depending on the fs)[*] safer but not 100% safe
- write-back cache + LVM + any fs = unsafe
- write-thru cache + any fs = safe
- write-thru cache + LVM + any fs = safe

[*] the fs must use (directly or indirectly via journal commit) a write barrier
on fsync(). Ext3 doesn't (it does when the inode changes, but that happens
once a second only).

If you want both speed and safety, use a batter-backed controller (and write-thru
cache on disks, but the controller should enforce it when you plug the disks in).
It's the usual "Fast, Safe, Cheap: choose two".

This is an interesting article:

http://support.microsoft.com/kb/234656/en-us/

note how for all three kinds of disk (IDE/SATA/SCSI) they say:
"Disk caching should be disabled in order to use the drive with SQL Server".

They don't mention write barriers.

.TM.