Thread: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

[PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Marti Raudsepp
Date:
Hi list,

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior. Users might be in for a nasty
performance surprise when re-building their Postgres with newer Linux
headers (as was I), so I propose that this change should be made in
all supported releases.

-- commit message --
Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Linux kernel headers from 2.6.33 (and later) change the behavior of the
O_SYNC flag. Previously O_SYNC was aliased to O_DSYNC, which caused
PostgreSQL to use fdatasync as the default instead.

Starting with kernels 2.6.33 and later, the definitions of O_DSYNC and
O_SYNC differ. When built with headers from these newer kernels,
PostgreSQL will default to using open_datasync. This patch reverts the
Linux default to fdatasync, which has had much more testing over time
and also significantly better performance.
-- end commit message --

Earlier kernel headers defined O_SYNC and O_DSYNC to 0x1000
2.6.33 and later define O_SYNC=0x101000 and O_DSYNC=0x1000 (since old
behavior on most FS-es was always equivalent to POSIX O_DSYNC)

More details at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3

Currently PostgreSQL's include/access/xlogdefs.h defaults to using
open_datasync when O_SYNC != O_DSYNC, otherwise fdatasync is used.

Since other platforms might want to default to fdatasync in the
future, too, I defined a new PLATFORM_DEFAULT_SYNC_METHOD constant in
include/port/linux.h. I don't know if this is the best way to do it.

Regards,
Marti

Attachment

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Marti Raudsepp <marti@juffo.org> writes:
> PostgreSQL's default settings change when built with Linux kernel
> headers 2.6.33 or newer. As discussed on the pgsql-performance list,
> this causes a significant performance regression:
> http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

> NB! I am not proposing to change the default -- to the contrary --
> this patch restores old behavior.

I'm less than convinced this is the right approach ...

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere?  Should we be rethinking the default preference order?
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Marti Raudsepp
Date:
On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm less than convinced this is the right approach ...
>
> If open_dsync is so bad for performance on Linux, maybe it's bad
> everywhere?  Should we be rethinking the default preference order?

Sure, maybe for PostgreSQL 9.1

But the immediate problem is older releases (8.1 - 9.0) specifically
on Linux. Something as innocuous as re-building your DB on a newer
kernel will radically affect performance -- even when the DB kernel
didn't change.

So I think we should aim to fix old versions first. Do you disagree?

Regards,
Marti


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Marti Raudsepp <marti@juffo.org> writes:
> On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If open_dsync is so bad for performance on Linux, maybe it's bad
>> everywhere?  Should we be rethinking the default preference order?

> So I think we should aim to fix old versions first. Do you disagree?

What's that got to do with it?
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Marti Raudsepp
Date:
On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marti Raudsepp <marti@juffo.org> writes:
>> On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If open_dsync is so bad for performance on Linux, maybe it's bad
>>> everywhere?  Should we be rethinking the default preference order?
>
>> So I think we should aim to fix old versions first. Do you disagree?
>
> What's that got to do with it?

I'm not sure what you're asking.

Surely changing the default wal_sync_method for all OSes in
maintenance releases is out of the question, no?

Regards,
Marti


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Marti Raudsepp <marti@juffo.org> writes:
> On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What's that got to do with it?

> I'm not sure what you're asking.

> Surely changing the default wal_sync_method for all OSes in
> maintenance releases is out of the question, no?

Well, if we could leave well enough alone it would be fine with me,
but I think our hand is being forced by the Linux kernel hackers.
I don't really think that "change the default on Linux" is that
much nicer than "change the default everywhere" when it comes to
what we ought to consider back-patching.  In any case, you're getting
ahead of the game: we need to decide on the desired behavior first and
then think about what to patch.  Do the performance results that were
cited show that open_dsync is generally inferior to fdatasync?  If so,
why would we think that that conclusion is Linux-specific?
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Andres Freund
Date:
On Friday 05 November 2010 19:13:47 Tom Lane wrote:
> Marti Raudsepp <marti@juffo.org> writes:
> > PostgreSQL's default settings change when built with Linux kernel
> > headers 2.6.33 or newer. As discussed on the pgsql-performance list,
> > this causes a significant performance regression:
> > http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php
> > 
> > NB! I am not proposing to change the default -- to the contrary --
> > this patch restores old behavior.
> 
> I'm less than convinced this is the right approach ...
> 
> If open_dsync is so bad for performance on Linux, maybe it's bad
> everywhere?  Should we be rethinking the default preference order?
I fail to see how it could be beneficial on *any* non-buggy platform.
Especially with small wal_buffers and larger commits (but also otherwise) it 
increases the amount of synchronous writes the os has to do tremendously.

* It removes about all benefits of XLogBackgroundFlush() 
* It removes any chances of reordering after writing.
* It makes AdvanceXLInsertBuffer synchronous if it has to write outy 

Whats the theory about placing it so high in the preferences list?

Andres


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Marti Raudsepp
Date:
On Fri, Nov 5, 2010 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't really think that "change the default on Linux" is that
> much nicer than "change the default everywhere" when it comes to
> what we ought to consider back-patching.  In any case, you're getting
> ahead of the game: we need to decide on the desired behavior first and
> then think about what to patch.

We should be trying to guarantee the stability of maintenance
releases. "Stability" includes consistent defaults. The fact that
Linux now distinguishes between these two flags has a very surprising
effect on PostgreSQL's defaults; an effect that wasn't intended by any
developer, is not documented anywhere, and certainly won't be
anticipated by users.

Do you reject this premise?

As newer distros are adopting 2.6.33+ kernels, more and more people
will shoot themselves in the foot by this change. I am also worried
that it will have a direct effect on PostgreSQL adoption.

Regards,
Marti


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Greg Smith
Date:
Tom Lane wrote:
> If open_dsync is so bad for performance on Linux, maybe it's bad
> everywhere?  Should we be rethinking the default preference order?
>   

And I've seen the expected sync write performance gain over fdatasync on 
a system with a battery-backed cache running VxFS on Linux, because 
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and 
therefore reducing cache pollution from WAL writes.  This doesn't work 
by default on Solaris because they have a special system call you have 
to execute for direct output, but if you trick the OS into doing that 
via mount options you can observe it there too.  The last serious tests 
of this area I saw on that platform were from Jignesh, and they 
certainly didn't show a significant performance regression running in 
sync mode.  I vaguely recall seeing a set once that showed a minor loss 
compared to fdatasync, but it was too close to make any definitive 
statement about reordering.

I haven't seen any report yet of a serious performance regression in the 
new Linux case that was written by someone who understands fully how 
fsync and drive cache flushing are supposed to interact.  It's been 
obvious for a year now that the reports from Phoronix about this had no 
idea what they were actually testing.  I didn't see anything from 
Marti's report that definitively answers whether this is anything other 
than Linux finally doing the right thing to flush drive caches out when 
sync writes happen.  There may be a performance regression here related 
to WAL data going out in smaller chunks than it used to, but in all the 
reports I've seen it that hasn't been isolated well enough to consider 
making any changes yet--to tell if it's a performance loss or a 
reliability gain we're seeing.

I'd like to see some output from the 9.0 test_fsync on one of these 
RHEL6 systems on a system without a battery backed write cache as a 
first step here.  That should start to shed some light on what's 
happening.  I just bumped up the priority on the pending upgrade of my 
spare laptop to the RHEL6 beta I had been trying to find time for, so I 
can investigate this further myself.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD




Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On Friday 05 November 2010 19:13:47 Tom Lane wrote:
>> If open_dsync is so bad for performance on Linux, maybe it's bad
>> everywhere?  Should we be rethinking the default preference order?

> I fail to see how it could be beneficial on *any* non-buggy platform.
> Especially with small wal_buffers and larger commits (but also otherwise) it 
> increases the amount of synchronous writes the os has to do tremendously.

> * It removes about all benefits of XLogBackgroundFlush() 
> * It removes any chances of reordering after writing.
> * It makes AdvanceXLInsertBuffer synchronous if it has to write outy 

> Whats the theory about placing it so high in the preferences list?

I think the original idea was that if you had a dedicated WAL drive then
sync-on-write would be reasonable.  But that was a very long time ago
and I'm not sure that the system's behavior is anything like what it was
then; for that matter I'm not sure we had proof that it was an optimal
choice even back then.  That's why I want to revisit the choice of
default and not just go for "minimum" change.
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Andres Freund
Date:
On Friday 05 November 2010 22:53:37 Greg Smith wrote:
> > If open_dsync is so bad for performance on Linux, maybe it's bad
> > everywhere?  Should we be rethinking the default preference order?
> >
> >   
> 
> And I've seen the expected sync write performance gain over fdatasync on 
> a system with a battery-backed cache running VxFS on Linux, because 
> working open_[d]sync means O_DIRECT writes bypassing the OS cache, and 
> therefore reducing cache pollution from WAL writes.
Which looks like a setup where you definitely need to know what you do. I.e. 
don't need support from wal_sync_method by default being open_fdatasync...

Andres


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
> I think the original idea was that if you had a dedicated WAL drive then
> sync-on-write would be reasonable.  But that was a very long time ago
> and I'm not sure that the system's behavior is anything like what it was
> then; for that matter I'm not sure we had proof that it was an optimal
> choice even back then.  That's why I want to revisit the choice of
> default and not just go for "minimum" change.

What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> What plaforms do we need to test to get a reasonable idea? Solaris,
> FreeBSD, Windows?

At least.  I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
On 11/5/10 3:31 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> What plaforms do we need to test to get a reasonable idea? Solaris,
>> FreeBSD, Windows?
> 
> At least.  I'm hoping that Greg Smith will take the lead on testing
> this, since he seems to have spent the most time in the area so far.

I could test at least 1 version of Solaris, I think.

Greg, any recommendations on pgbench parameters?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Greg Smith
Date:
Tom Lane wrote:
> I'm hoping that Greg Smith will take the lead on testing
> this, since he seems to have spent the most time in the area so far.
>   

It's not coincidence that the chapter of my book I convinced the 
publisher to release as a sample is the one that covers this area; this 
mess has been visibly approaching for some time now.  I'm going to put 
RHEL6 onto a system and start collecting some proper slowdown numbers 
this week, then pass along a suggested test regime for others.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD




Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Greg Smith
Date:
Marti Raudsepp wrote:
> PostgreSQL's default settings change when built with Linux kernel
> headers 2.6.33 or newer. As discussed on the pgsql-performance list,
> this causes a significant performance regression:
> http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php
>
> NB! I am not proposing to change the default -- to the contrary --
> this patch restores old behavior.

Following our standard community development model, I've put this patch 
onto our CommitFest list:  
https://commitfest.postgresql.org/action/patch_view?id=432 and assigned 
myself as the reviewer.  I didn't look at this until now because I 
already had some patch development and review work to finish before the 
CommitFest deadline we just crossed.  Now I can go back to reviewing 
other people's work.

P.S. There is no pgsql-patch list anymore; everything goes through the 
hackers list now.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
All,

So, this week I've had my hands on a medium-high-end test system where I
could test various wal_sync_methods.  This is a 24-core Intel Xeon
machine with 72GB of ram, and 8 internal 10K SAS disks attached to a
raid controller with 512MB BBU write cache.  2 of the disks are in a
RAID1, which supports both an Ext4 partition and an XFS partition.  The
remaining disks are in a RAID10 which only supports a single pgdata
partition.

This is running on RHEL6, Linux Kernel: 2.6.32-71.el6.x86_64

I think this kind of a system much better represents our users who are
performance-conscious than testing on people's laptops or on VMs does.

I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.  I think
the second change should be permanent; I imagine that a lot of people
who are running test_fsync are not aware that they're actually testing
the performance of /var/tmp, not whatever FS mount they wanted to test.

Here's the results.  I think you'll agree that, at least on Linux, the
benefits of o_sync and o_dsync as defaults would be highly questionable.Particularly, it seems that if O_DIRECT support
isabsent, fdatasync is
 
across-the-board faster:

=============

test_fsync with directIO, on 2 drives, XFS tuned:

Loops = 10000

Simple write:       8k write                      198629.457/second

Compare file sync methods using one write:       open_datasync 8k write        14798.263/second       open_sync 8k
write           14316.864/second       8k write, fdatasync           12198.871/second       8k write, fsync
 12371.843/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes      7362.805/second       2 open_sync 8k
writes         7156.685/second       8k write, 8k write, fdatasync 10613.525/second       8k write, 8k write, fsync
10597.396/second

Compare open_sync with different sizes:       open_sync 16k write           13631.816/second       2 open_sync 8k
writes         7645.038/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close        11427.096/second       8k write, close, fsync
11321.220/second


test_fsync with directIO, on 6 drives RAID10, XFS tuned:

Loops = 10000

Simple write:       8k write                      196494.537/second

Compare file sync methods using one write:       open_datasync 8k write        14909.974/second       open_sync 8k
write           14559.326/second       8k write, fdatasync           11046.025/second       8k write, fsync
 11046.916/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes      7349.223/second       2 open_sync 8k
writes         7667.395/second       8k write, 8k write, fdatasync  9560.495/second       8k write, 8k write, fsync
9557.287/second
 

Compare open_sync with different sizes:       open_sync 16k write           12060.049/second       2 open_sync 8k
writes         7650.746/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close         9377.107/second       8k write, close, fsync
9251.233/second



test_fsync without directIO on RAID1, Ext4, data=journal:

Loops = 10000

Simple write:       8k write                      150514.005/second

Compare file sync methods using one write:       open_datasync 8k write         4012.070/second       open_sync 8k
write            5476.898/second       8k write, fdatasync            5512.649/second       8k write, fsync
  5803.814/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes      2910.401/second       2 open_sync 8k
writes         2817.377/second       8k write, 8k write, fdatasync  5041.608/second       8k write, 8k write, fsync
5155.248/second
 

Compare open_sync with different sizes:       open_sync 16k write            4895.956/second       2 open_sync 8k
writes         2720.875/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close         4724.052/second       8k write, close, fsync
4694.776/second


test_fsync without directIO on RAID1, XFS, tuned:

Loops = 10000

Simple write:       8k write                      199796.208/second

Compare file sync methods using one write:       open_datasync 8k write        12553.525/second       open_sync 8k
write           12535.978/second       8k write, fdatasync           12268.298/second       8k write, fsync
 12305.875/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes      6323.835/second       2 open_sync 8k
writes         6285.169/second       8k write, 8k write, fdatasync 10893.756/second       8k write, 8k write, fsync
10752.607/second

Compare open_sync with different sizes:       open_sync 16k write           11053.510/second       2 open_sync 8k
writes         6293.270/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close        11087.482/second       8k write, close, fsync
11157.477/second


test_fsync without directIO on RAID10, 6 drives, XFS Tuned:

Loops = 10000

Simple write:       8k write                      197262.003/second

Compare file sync methods using one write:       open_datasync 8k write        12784.699/second       open_sync 8k
write           12684.512/second       8k write, fdatasync           12404.547/second       8k write, fsync
 12452.757/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes      6376.587/second       2 open_sync 8k
writes         6364.113/second       8k write, 8k write, fdatasync  9895.699/second       8k write, 8k write, fsync
9866.886/second
 

Compare open_sync with different sizes:       open_sync 16k write           10156.491/second       2 open_sync 8k
writes         6400.889/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close        11142.620/second       8k write, close, fsync
11076.393/second



--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
All,

While I have this machine available I've been trying to run some 
performance tests using pgbench and various wal_sync_methods.  However, 
I seem to be maxing out at the speed of pgbench itself; no matter which 
wal_sync_method I use (including "fsync"), it tops out at around 2750 TPS.

Of course, it's also possible that the wal_sync_method does not in fact 
make a difference in throughput.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Greg Smith
Date:
Josh Berkus wrote:
> I modified test_fsync in two ways to run this; first, to make it support
> O_DIRECT, and second to make it run in the *current* directory.

Patch please?  I agree with the latter change; what test_fsync does is 
surprising.

I suggested a while ago that we refactor test_fsync to use a common set 
of source code as the database itself for detecting things related to 
wal_sync_method, perhaps just extract that whole set of DEFINE macro 
logic to somewhere else.  That happened at a bad time in the development 
cycle (right before a freeze) and nobody ever got back to the idea 
afterwards.  If this code is getting touched, and it's clear it is in 
some direction, I'd like to see things change so it's not possible for 
the two to diverge again afterwards.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us




Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
On 12/5/10 2:12 PM, Greg Smith wrote:
> Josh Berkus wrote:
>> I modified test_fsync in two ways to run this; first, to make it support
>> O_DIRECT, and second to make it run in the *current* directory.
>
> Patch please?  I agree with the latter change; what test_fsync does is
> surprising.

Attached.

Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.

> I suggested a while ago that we refactor test_fsync to use a common set
> of source code as the database itself for detecting things related to
> wal_sync_method, perhaps just extract that whole set of DEFINE macro
> logic to somewhere else.  That happened at a bad time in the development
> cycle (right before a freeze) and nobody ever got back to the idea
> afterwards.  If this code is getting touched, and it's clear it is in
> some direction, I'd like to see things change so it's not possible for
> the two to diverge again afterwards.

I don't quite follow you.  Maybe nobody else did last time, either.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Attachment

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Making it support O_DIRECT would be possible but more complex; I don't
> see the point unless we think we're going to have open_sync_with_odirect
> as a seperate option.

Whether it's complex or not isn't really the issue.  The issue is that
what test_fsync is testing had better match what the backend does, or
people will be making choices based on not-comparable test results.
I think we should have test_fsync just automatically fold in O_DIRECT
the same way the backend does.
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Andrew Dunstan
Date:

On 12/06/2010 08:38 PM, Tom Lane wrote:
> Josh Berkus<josh@agliodbs.com>  writes:
>> Making it support O_DIRECT would be possible but more complex; I don't
>> see the point unless we think we're going to have open_sync_with_odirect
>> as a seperate option.
> Whether it's complex or not isn't really the issue.  The issue is that
> what test_fsync is testing had better match what the backend does, or
> people will be making choices based on not-comparable test results.
> I think we should have test_fsync just automatically fold in O_DIRECT
> the same way the backend does.
>
>             

Indeed. We were quite confused for a while when we were dealing with 
this about a week ago, and my handwritten test program failed as 
expected but test_fsync didn't. Anything other than behaving just as the 
backend does violates POLA, in my view.

cheers

andrew


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
> Whether it's complex or not isn't really the issue.  The issue is that
> what test_fsync is testing had better match what the backend does, or
> people will be making choices based on not-comparable test results.
> I think we should have test_fsync just automatically fold in O_DIRECT
> the same way the backend does.

OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.What should I have it do instead?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
>  What should I have it do instead?

Report that it fails, and keep testing the other methods.
        regards, tom lane


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
On 12/6/10 6:13 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
>>  What should I have it do instead?
>
> Report that it fails, and keep testing the other methods.

Patch attached.  Includes a fair amount of comment cleanup, since
existing comments did not meet our current project standards.  Tests all
6 of the methods we support separately.

Some questions, though:

(1) Why are we doing the open_sync different-size write test?  AFAIK,
this doesn't match any behavior which PostgreSQL has.

(2) In this patch, I'm stepping down the number of loops which
fsync_writethrough does by 90%.  The reason for that was that on the
platforms where I tested writethrough (desktop machines), doing 10,000
loops took 15-20 *minutes*, which seems hard on the user.  Would be easy
to revert if you think it's a bad idea.
    Possibly auto-sizing the number of loops based on the first fsync test
might be a good idea, but seems like going a bit too far.

(3) Should the multi-descriptor test be using writethrough on platforms
which support it?

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Attachment

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Bruce Momjian
Date:
Josh Berkus wrote:
> On 12/6/10 6:13 PM, Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> >> OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
> >>  What should I have it do instead?
> > 
> > Report that it fails, and keep testing the other methods.
> 
> Patch attached.  Includes a fair amount of comment cleanup, since
> existing comments did not meet our current project standards.  Tests all
> 6 of the methods we support separately.
> 
> Some questions, though:
> 
> (1) Why are we doing the open_sync different-size write test?  AFAIK,
> this doesn't match any behavior which PostgreSQL has.

I did that so we could see the impact of doing 2 8k writes that were
both fsync'ed vs doing one 16k write and then fsync:
Compare open_sync with different sizes:       open_sync 16k write             201.323/second       2 open_sync 8k
writes          332.466/second
 

We often write multiple 8k WAL pages and then fsync on commit.

> (2) In this patch, I'm stepping down the number of loops which
> fsync_writethrough does by 90%.  The reason for that was that on the
> platforms where I tested writethrough (desktop machines), doing 10,000
> loops took 15-20 *minutes*, which seems hard on the user.  Would be easy
> to revert if you think it's a bad idea.
>     Possibly auto-sizing the number of loops based on the first fsync test
> might be a good idea, but seems like going a bit too far.

Sure, I recently increased the number, probably too much.

> (3) Should the multi-descriptor test be using writethrough on platforms
> which support it?

Uh, I didn't think that would matter because the test is to test kernel
behavior of writing to one file descriptor and fsyncing using another.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Bruce Momjian
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Making it support O_DIRECT would be possible but more complex; I don't
> > see the point unless we think we're going to have open_sync_with_odirect
> > as a seperate option.
> 
> Whether it's complex or not isn't really the issue.  The issue is that
> what test_fsync is testing had better match what the backend does, or
> people will be making choices based on not-comparable test results.
> I think we should have test_fsync just automatically fold in O_DIRECT
> the same way the backend does.

The problem is that O_DIRECT was not implemented in macros but rather
down in the code:
   if (!XLogIsNeeded() && !am_walreceiver)       o_direct_flag = PG_O_DIRECT;

Which means if we just export the macros, we would still not have caught
this.  I would like to share all the defines --- I am just saying it
isn't trivial.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
> Which means if we just export the macros, we would still not have caught
> this.  I would like to share all the defines --- I am just saying it
> isn't trivial.

I just called all the define variables manually rather than relying on
the macros.  Seemed to work fine.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Josh Berkus
Date:
Greg, All:

Results for Solaris 10u8, on ZFS on a 7-drive attached storage array:

bash-3.00# ./test_fsync -f /dbdata/pgdata/test.out
Loops = 10000

Simple write:       8k write                      59988.002/second

Compare file sync methods using one write:       open_datasync 8k write          214.125/second       (unavailable:
o_direct)      open_sync 8k write              222.155/second       (unavailable: o_direct)       8k write, fdatasync
         214.086/second       8k write, fsync                 215.035/second       (unavailable: fsync_writethrough)
 

Compare file sync methods using two writes:       2 open_datasync 8k writes       108.227/second       (unavailable:
o_direct)      2 open_sync 8k writes           106.935/second       (unavailable: o_direct)       8k write, 8k write,
fdatasync  205.525/second       8k write, 8k write, fsync       210.483/second       (unavailable: fsync_writethrough)
 

Compare open_sync with different sizes:       open_sync 16k write             211.481/second       2 open_sync 8k
writes          106.202/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close          207.499/second       8k write, close, fsync
213.656/second




--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

From
Bruce Momjian
Date:
Josh Berkus wrote:
> On 12/6/10 6:13 PM, Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> >> OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
> >>  What should I have it do instead?
> >
> > Report that it fails, and keep testing the other methods.
>
> Patch attached.  Includes a fair amount of comment cleanup, since
> existing comments did not meet our current project standards.  Tests all
> 6 of the methods we support separately.
>
> Some questions, though:
>
> (1) Why are we doing the open_sync different-size write test?  AFAIK,
> this doesn't match any behavior which PostgreSQL has.

I added program output to explain this.

> (2) In this patch, I'm stepping down the number of loops which
> fsync_writethrough does by 90%.  The reason for that was that on the
> platforms where I tested writethrough (desktop machines), doing 10,000
> loops took 15-20 *minutes*, which seems hard on the user.  Would be easy
> to revert if you think it's a bad idea.
>     Possibly auto-sizing the number of loops based on the first fsync test
> might be a good idea, but seems like going a bit too far.

I did not know why writethough we always be much slower than other sync
methods so I just reduced the loop count to 2k.

> (3) Should the multi-descriptor test be using writethrough on platforms
> which support it?

Thank you for your patch.  I have applied most of it, attached.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile
index fe3e626..44419ee 100644
*** /tmp/pgdiff.17122/M3047c_Makefile    Sat Jan 15 11:53:43 2011
--- src/tools/fsync/Makefile    Fri Jan 14 22:35:30 2011
*************** override CPPFLAGS := -I$(libpq_srcdir) $
*** 16,24 ****

  OBJS= test_fsync.o

! all: test_fsync

! test_fsync: test_fsync.o | submake-libpq submake-libpgport
      $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

  clean distclean maintainer-clean:
--- 16,24 ----

  OBJS= test_fsync.o

! all: submake-libpq submake-libpgport test_fsync

! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a
      $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

  clean distclean maintainer-clean:
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index 6d9acd3..a1c2ae4 100644
*** /tmp/pgdiff.17122/wFeqrc_README    Sat Jan 15 11:53:43 2011
--- src/tools/fsync/README    Sat Jan 15 11:34:58 2011
***************
*** 1,11 ****
! src/tools/fsync/README
!
! fsync
! =====

  This program tests fsync.  The tests are described as part of the program output.

      Usage:    test_fsync [-f filename] [loops]

- Loops defaults to 5000.  The default output file is /var/tmp/test_fsync.out.
- Consider that /tmp or /var/tmp might be memory-based file systems.
--- 1,20 ----
! test_fsync
! ==========

  This program tests fsync.  The tests are described as part of the program output.

      Usage:    test_fsync [-f filename] [loops]
+
+ test_fsync is intended to give you a reasonable idea of what the fastest
+ fsync_method is on your specific system, as well as supplying diagnostic
+ information in the event of an identified I/O problem.  However,
+ differences shown by test_fsync might not make any difference in real
+ database throughput, especially since many database servers are not
+ speed-limited by their transaction logs.
+
+ The output filename defaults to test_fsync.out in the current directory.
+ test_fsync should be run in the same filesystem as your transaction log
+ directory (pg_xlog).
+
+ Loops default to 2000.  Increase this to get more accurate measurements.

diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..831e2a0 100644
*** /tmp/pgdiff.17122/OSeljb_test_fsync.c    Sat Jan 15 11:53:43 2011
--- src/tools/fsync/test_fsync.c    Sat Jan 15 11:52:03 2011
***************
*** 3,9 ****
   *
   *
   *    test_fsync.c
!  *        test various fsync() methods
   */

  #include "postgres.h"
--- 3,9 ----
   *
   *
   *    test_fsync.c
!  *        tests all supported fsync() methods
   */

  #include "postgres.h"
***************
*** 22,40 ****
  #include <unistd.h>
  #include <string.h>

!
! #ifdef WIN32
  #define FSYNC_FILENAME    "./test_fsync.out"
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME    "/var/tmp/test_fsync.out"
- #endif

  #define WRITE_SIZE    (8 * 1024)    /* 8k */

  #define LABEL_FORMAT    "\t%-30s"

! int            loops = 10000;

  void        die(char *str);
  void        print_elapse(struct timeval start_t, struct timeval stop_t);
--- 22,39 ----
  #include <unistd.h>
  #include <string.h>

! /*
!  * put the temp files in the local directory
!  * unless the user specifies otherwise
!  */
  #define FSYNC_FILENAME    "./test_fsync.out"

  #define WRITE_SIZE    (8 * 1024)    /* 8k */

  #define LABEL_FORMAT    "\t%-30s"

!
! int            loops = 2000;

  void        die(char *str);
  void        print_elapse(struct timeval start_t, struct timeval stop_t);
*************** void        print_elapse(struct timeval start_
*** 42,55 ****
  int
  main(int argc, char *argv[])
  {
!     struct timeval start_t;
!     struct timeval stop_t;
!     int            tmpfile,
!                 i;
      char       *full_buf = (char *) malloc(XLOG_SEG_SIZE),
!                *buf;
!     char       *filename = FSYNC_FILENAME;

      if (argc > 2 && strcmp(argv[1], "-f") == 0)
      {
          filename = argv[2];
--- 41,54 ----
  int
  main(int argc, char *argv[])
  {
!     struct timeval start_t, stop_t;
!     int            tmpfile, i;
      char       *full_buf = (char *) malloc(XLOG_SEG_SIZE),
!                *buf, *filename = FSYNC_FILENAME;

+     /*
+      * arguments: loops and filename (optional)
+      */
      if (argc > 2 && strcmp(argv[1], "-f") == 0)
      {
          filename = argv[2];
*************** main(int argc, char *argv[])
*** 57,73 ****
          argc -= 2;
      }

!     if (argc > 1)
          loops = atoi(argv[1]);

      for (i = 0; i < XLOG_SEG_SIZE; i++)
          full_buf[i] = random();

      if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
          die("Cannot open output file.");
      if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
          die("write failed");
!     /* fsync now so later fsync's don't have to do it */
      if (fsync(tmpfile) != 0)
          die("fsync failed");
      close(tmpfile);
--- 56,77 ----
          argc -= 2;
      }

!     if (argc > 1)
          loops = atoi(argv[1]);

      for (i = 0; i < XLOG_SEG_SIZE; i++)
          full_buf[i] = random();

+     /*
+      * test if we can open the target file
+      */
      if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
          die("Cannot open output file.");
      if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
          die("write failed");
!     /*
!      * fsync now so that dirty buffers don't skew later tests
!      */
      if (fsync(tmpfile) != 0)
          die("fsync failed");
      close(tmpfile);
*************** main(int argc, char *argv[])
*** 77,83 ****
      printf("Loops = %d\n\n", loops);

      /*
!      * Simple write
       */
      printf("Simple write:\n");
      printf(LABEL_FORMAT, "8k write");
--- 81,87 ----
      printf("Loops = %d\n\n", loops);

      /*
!      * Test a simple write without fsync
       */
      printf("Simple write:\n");
      printf(LABEL_FORMAT, "8k write");
*************** main(int argc, char *argv[])
*** 95,104 ****
      print_elapse(start_t, stop_t);

      /*
!      * Compare file sync methods with one 8k write
       */
      printf("\nCompare file sync methods using one write:\n");

  #ifdef OPEN_DATASYNC_FLAG
      printf(LABEL_FORMAT, "open_datasync 8k write");
      fflush(stdout);
--- 99,111 ----
      print_elapse(start_t, stop_t);

      /*
!      * Test all fsync methods using single 8k writes
       */
      printf("\nCompare file sync methods using one write:\n");

+     /*
+      * Test open_datasync if available
+      */
  #ifdef OPEN_DATASYNC_FLAG
      printf(LABEL_FORMAT, "open_datasync 8k write");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 115,124 ****
--- 122,161 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+     /*
+      * If O_DIRECT is enabled, test that with open_datasync
+      */
+ #if PG_O_DIRECT != 0
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+         printf("\t(unavailable: o_direct on this filesystem)\n");
+     else
+     {
+         printf(LABEL_FORMAT, "open_datasync 8k direct I/O write");
+         gettimeofday(&start_t, NULL);
+         for (i = 0; i < loops; i++)
+         {
+             if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+                 die("write failed");
+             if (lseek(tmpfile, 0, SEEK_SET) == -1)
+                 die("seek failed");
+         }
+         gettimeofday(&stop_t, NULL);
+         close(tmpfile);
+         print_elapse(start_t, stop_t);
+     }
+ #else
+         printf("\t(unavailable: o_direct)\n");
+ #endif
+
  #else
      printf("\t(unavailable: open_datasync)\n");
  #endif

+ /*
+  * Test open_sync if available
+  */
  #ifdef OPEN_SYNC_FLAG
      printf(LABEL_FORMAT, "open_sync 8k write");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 135,144 ****
--- 172,211 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+     /*
+      * If O_DIRECT is enabled, test that with open_sync
+      */
+ #if PG_O_DIRECT != 0
+     printf(LABEL_FORMAT, "open_sync 8k direct I/O write");
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
+         printf("\t(unavailable: o_direct on this filesystem)\n");
+     else
+     {
+         gettimeofday(&start_t, NULL);
+         for (i = 0; i < loops; i++)
+         {
+             if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+                 die("write failed");
+             if (lseek(tmpfile, 0, SEEK_SET) == -1)
+                 die("seek failed");
+         }
+         gettimeofday(&stop_t, NULL);
+         close(tmpfile);
+         print_elapse(start_t, stop_t);
+     }
+ #else
+     printf("\t(unavailable: o_direct)\n");
+ #endif
+
  #else
      printf("\t(unavailable: open_sync)\n");
  #endif

+ /*
+  * Test fdatasync if available
+  */
  #ifdef HAVE_FDATASYNC
      printf(LABEL_FORMAT, "8k write, fdatasync");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 160,165 ****
--- 227,235 ----
      printf("\t(unavailable: fdatasync)\n");
  #endif

+ /*
+  * Test fsync
+  */
      printf(LABEL_FORMAT, "8k write, fsync");
      fflush(stdout);
      if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 177,190 ****
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);

      /*
!      * Compare file sync methods with two 8k write
       */
      printf("\nCompare file sync methods using two writes:\n");

  #ifdef OPEN_DATASYNC_FLAG
!     printf(LABEL_FORMAT, "2 open_datasync 8k writes");
      fflush(stdout);
      if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
          die("Cannot open output file.");
--- 247,289 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+ /*
+  * If fsync_writethrough is available, test as well
+  */
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+     printf(LABEL_FORMAT, "8k write, fsync_writethrough");
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+         die("Cannot open output file.");
+     gettimeofday(&start_t, NULL);
+     for (i = 0; i < loops; i++)
+     {
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (fcntl(tmpfile, F_FULLFSYNC ) != 0)
+             die("fsync failed");
+         if (lseek(tmpfile, 0, SEEK_SET) == -1)
+             die("seek failed");
+     }
+     gettimeofday(&stop_t, NULL);
+     close(tmpfile);
+     print_elapse(start_t, stop_t);
+ #else
+     printf("\t(unavailable: fsync_writethrough)\n");
+ #endif

      /*
!      * Compare some of the file sync methods with
!      * two 8k writes to see if timing is different
       */
      printf("\nCompare file sync methods using two writes:\n");

+ /*
+  * Test open_datasync with and without o_direct
+  */
  #ifdef OPEN_DATASYNC_FLAG
!      printf(LABEL_FORMAT, "2 open_datasync 8k writes");
      fflush(stdout);
      if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
          die("Cannot open output file.");
*************** main(int argc, char *argv[])
*** 201,210 ****
--- 300,335 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+ #if PG_O_DIRECT != 0
+     printf(LABEL_FORMAT, "2 open_datasync direct I/O 8k writes");
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+         die("Cannot open output file.");
+     gettimeofday(&start_t, NULL);
+     for (i = 0; i < loops; i++)
+     {
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (lseek(tmpfile, 0, SEEK_SET) == -1)
+             die("seek failed");
+     }
+     gettimeofday(&stop_t, NULL);
+     close(tmpfile);
+     print_elapse(start_t, stop_t);
+ #else
+         printf("\t(unavailable: o_direct)\n");
+ #endif
+
  #else
      printf("\t(unavailable: open_datasync)\n");
  #endif

+ /*
+  * Test open_sync with and without o_direct
+  */
  #ifdef OPEN_SYNC_FLAG
      printf(LABEL_FORMAT, "2 open_sync 8k writes");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 223,230 ****
--- 348,383 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+ #if PG_O_DIRECT != 0
+     printf(LABEL_FORMAT, "2 open_sync direct I/O 8k writes");
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
+         die("Cannot open output file.");
+     gettimeofday(&start_t, NULL);
+     for (i = 0; i < loops; i++)
+     {
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (lseek(tmpfile, 0, SEEK_SET) == -1)
+             die("seek failed");
+     }
+     gettimeofday(&stop_t, NULL);
+     close(tmpfile);
+     print_elapse(start_t, stop_t);
+ #else
+     printf("\t(unavailable: o_direct)\n");
+ #endif
+
+ #else
+     printf("\t(unavailable: open_sync)\n");
  #endif

+ /*
+  *    Test fdatasync
+  */
  #ifdef HAVE_FDATASYNC
      printf(LABEL_FORMAT, "8k write, 8k write, fdatasync");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 248,253 ****
--- 401,409 ----
      printf("\t(unavailable: fdatasync)\n");
  #endif

+ /*
+  * Test basic fsync
+  */
      printf(LABEL_FORMAT, "8k write, 8k write, fsync");
      fflush(stdout);
      if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 267,278 ****
--- 423,466 ----
      gettimeofday(&stop_t, NULL);
      close(tmpfile);
      print_elapse(start_t, stop_t);
+
+ /*
+  * Test fsync_writethrough if available
+  */
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+     printf(LABEL_FORMAT, "8k write, 8k write, fsync_writethrough");
+     fflush(stdout);
+     if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+         die("Cannot open output file.");
+     gettimeofday(&start_t, NULL);
+     for (i = 0; i < loops; i++)
+     {
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+             die("write failed");
+         if (fcntl(tmpfile, F_FULLFSYNC) != 0)
+             die("fsync failed");
+         if (lseek(tmpfile, 0, SEEK_SET) == -1)
+             die("seek failed");
+     }
+     gettimeofday(&stop_t, NULL);
+     close(tmpfile);
+     print_elapse(start_t, stop_t);
+ #else
+     printf("\t(unavailable: fsync_writethrough)\n");
+ #endif

      /*
       * Compare 1 to 2 writes
       */
      printf("\nCompare open_sync with different sizes:\n");
+     printf("(This is designed to compare the cost of one large\n");
+     printf("sync'ed write and two smaller sync'ed writes.)\n");

+ /*
+  * Test open_sync with different size files
+  */
  #ifdef OPEN_SYNC_FLAG
      printf(LABEL_FORMAT, "open_sync 16k write");
      fflush(stdout);
*************** main(int argc, char *argv[])
*** 312,323 ****
  #endif

      /*
!      * Fsync another file descriptor?
       */
      printf("\nTest if fsync on non-write file descriptor is honored:\n");
      printf("(If the times are similar, fsync() can sync data written\n");
      printf("on a different descriptor.)\n");

      printf(LABEL_FORMAT, "8k write, fsync, close");
      fflush(stdout);
      gettimeofday(&start_t, NULL);
--- 500,519 ----
  #endif

      /*
!      * Test whether fsync can sync data written on a different
!      * descriptor for the same file.  This checks the efficiency
!      * of multi-process fsyncs against the same file.
!      * Possibly this should be done with writethrough on platforms
!      * which support it.
       */
      printf("\nTest if fsync on non-write file descriptor is honored:\n");
      printf("(If the times are similar, fsync() can sync data written\n");
      printf("on a different descriptor.)\n");

+     /*
+      * first write, fsync and close, which is the
+      * normal behavior without multiple descriptors
+      */
      printf(LABEL_FORMAT, "8k write, fsync, close");
      fflush(stdout);
      gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 330,343 ****
          if (fsync(tmpfile) != 0)
              die("fsync failed");
          close(tmpfile);
          if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
              die("Cannot open output file.");
-         /* do nothing but the open/close the tests are consistent. */
          close(tmpfile);
      }
      gettimeofday(&stop_t, NULL);
      print_elapse(start_t, stop_t);

       printf(LABEL_FORMAT, "8k write, close, fsync");
       fflush(stdout);
      gettimeofday(&start_t, NULL);
--- 526,547 ----
          if (fsync(tmpfile) != 0)
              die("fsync failed");
          close(tmpfile);
+         /*
+          * open and close the file again to be consistent
+          * with the following test
+          */
          if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
              die("Cannot open output file.");
          close(tmpfile);
      }
      gettimeofday(&stop_t, NULL);
      print_elapse(start_t, stop_t);

+     /*
+      * Now open, write, close, open again and fsync
+      * This simulates processes fsyncing each other's
+      * writes.
+      */
       printf(LABEL_FORMAT, "8k write, close, fsync");
       fflush(stdout);
      gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 358,375 ****
      gettimeofday(&stop_t, NULL);
      print_elapse(start_t, stop_t);

!     /* cleanup */
      free(full_buf);
      unlink(filename);

      return 0;
  }

  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
      double        total_time = (stop_t.tv_sec - start_t.tv_sec) +
-     /* usec subtraction might be negative, e.g. 5.4 - 4.8 */
      (stop_t.tv_usec - start_t.tv_usec) * 0.000001;
      double        per_second = loops / total_time;

--- 562,583 ----
      gettimeofday(&stop_t, NULL);
      print_elapse(start_t, stop_t);

!     /*
!      * cleanup
!      */
      free(full_buf);
      unlink(filename);

      return 0;
  }

+ /*
+  * print out the writes per second for tests
+  */
  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
      double        total_time = (stop_t.tv_sec - start_t.tv_sec) +
      (stop_t.tv_usec - start_t.tv_usec) * 0.000001;
      double        per_second = loops / total_time;