Thread: Final(?) proposal for wal_sync_method changes

Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
After reviewing the two ongoing threads about fixing the wal_sync_method
fiasco, I think there is general agreement on these two points:

1. open_datasync shouldn't be the default choice
2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC

What I suggest we do about the latter is to invent two new 
wal_sync_method values,open_datasync_directopen_sync_direct
which are defined only on platforms that define O_DIRECT (and O_DSYNC
or O_SYNC respectively).  That puts it in the hands of the DBA whether
we try to use O_DIRECT or not.  We'll still keep the hard-wired
optimization of disabling O_DIRECT when archiving or walreceiver are
active.

Dropping open_datasync as the first-choice default is something we have
to back-patch, but I'm less sure about it being a good idea to
back-patch the rearrangement of O_DIRECT management.  Somebody who'd
explicitly specified open_sync or open_datasync as wal_sync_method
would find its behavior changing under him, which might be bad.

Comments?
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Andres Freund
Date:
On Tuesday 07 December 2010 17:24:14 Tom Lane wrote:
> After reviewing the two ongoing threads about fixing the wal_sync_method
> fiasco, I think there is general agreement on these two points:
> 
> 1. open_datasync shouldn't be the default choice
> 2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC
> 
> What I suggest we do about the latter is to invent two new
> wal_sync_method values,
>     open_datasync_direct
>     open_sync_direct
> which are defined only on platforms that define O_DIRECT (and O_DSYNC
> or O_SYNC respectively).  That puts it in the hands of the DBA whether
> we try to use O_DIRECT or not.  We'll still keep the hard-wired
> optimization of disabling O_DIRECT when archiving or walreceiver are
> active.
> 
> Dropping open_datasync as the first-choice default is something we have
> to back-patch, but I'm less sure about it being a good idea to
> back-patch the rearrangement of O_DIRECT management.  Somebody who'd
> explicitly specified open_sync or open_datasync as wal_sync_method
> would find its behavior changing under him, which might be bad.
I vote for changing the order but not doing the O_DIRECT stuff on the 
backbranches.

As I am not seeing myself or clients of mine ever using any O_*SYNC variant I 
am not strongly opionated about what to do there. But I guess adding those two 
variants is not really much work.

Thanks,

Andres


Re: Final(?) proposal for wal_sync_method changes

From
Josh Berkus
Date:
On 12/07/2010 08:24 AM, Tom Lane wrote:
> Dropping open_datasync as the first-choice default is something we have
> to back-patch, but I'm less sure about it being a good idea to
> back-patch the rearrangement of O_DIRECT management.  Somebody who'd
> explicitly specified open_sync or open_datasync as wal_sync_method
> would find its behavior changing under him, which might be bad.

I agree for the backpatch that we should just swap to fdatasync as 
default, and should not attempt to add the extra options.

In addition to the concerns above, adding new GUCS values in an update 
release is something we should only do if required for a critical 
security or data-loss bug.  And this is neither.

--                                   -- Josh Berkus                                     PostgreSQL Experts Inc.
                           http://www.pgexperts.com
 


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I agree for the backpatch that we should just swap to fdatasync as 
> default, and should not attempt to add the extra options.

I noticed while updating the documentation for this that the current
documentation is a flat-out lie.  It claims that the preference order
for wal_sync_method is       open_datasync       fdatasync       fsync_writethrough       fsync       open_sync
ie you get the first-listed method that is supported on a given
platform.  But this is not so: actually, fsync_writethrough will
be selected as default ONLY on Windows.  There are other platforms
where the option exists, OS X being the one I have at hand.  The
misstatement is masked on OS X because it also has open_datasync
and fdatasync.  But since we are about to delete open_datasync from
the list, it's possible there are platforms where it will be exposed.

Oh, and just to add insult to injury, the above is what config.sgml
says, but postgresql.conf.sample says something different.

So I'm wondering whether we should correct the code to match the docs,
or vice versa.  The former would just be a matter of saying#elif defined(HAVE_FSYNC_WRITETHROUGH)#define
DEFAULT_SYNC_METHOD       SYNC_METHOD_FSYNC_WRITETHROUGH
 
in place of#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)#define DEFAULT_SYNC_METHOD
SYNC_METHOD_FSYNC_WRITETHROUGH
To do the latter we'd have to say something like "The default is
fdatasync if it exists, else fsync, except on Windows where it is
fsync_writethrough".

Changing the code would result in a sudden, massive performance change
if there are any platforms for which fsync_writethrough exists but not
fdatasync.  But I'm not sure if there are any.

Another point here is that it's not clear why we're selecting a
known-to-be-insecure default on OS X (where in fact all methods except
fsync_writethrough fail to push data to disk).  We've been around on
that before, of course, and maybe now is not the time to change it.

Thoughts?
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Josh Berkus
Date:
On 12/7/10 2:28 PM, Tom Lane wrote:
> Another point here is that it's not clear why we're selecting a
> known-to-be-insecure default on OS X (where in fact all methods except
> fsync_writethrough fail to push data to disk).  We've been around on
> that before, of course, and maybe now is not the time to change it.

Because nobody sane uses OSX on the server?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 12/7/10 2:28 PM, Tom Lane wrote:
>> Another point here is that it's not clear why we're selecting a
>> known-to-be-insecure default on OS X (where in fact all methods except
>> fsync_writethrough fail to push data to disk).  We've been around on
>> that before, of course, and maybe now is not the time to change it.

> Because nobody sane uses OSX on the server?

Some of us would make the same remark about Windows.  But we go out of
our way to provide a safe default on that platform anyhow.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Andrew Dunstan
Date:

On 12/07/2010 06:11 PM, Tom Lane wrote:
> Josh Berkus<josh@agliodbs.com>  writes:
>> On 12/7/10 2:28 PM, Tom Lane wrote:
>>> Another point here is that it's not clear why we're selecting a
>>> known-to-be-insecure default on OS X (where in fact all methods except
>>> fsync_writethrough fail to push data to disk).  We've been around on
>>> that before, of course, and maybe now is not the time to change it.
>> Because nobody sane uses OSX on the server?
> Some of us would make the same remark about Windows.  But we go out of
> our way to provide a safe default on that platform anyhow.
>
>             


In practice, though, Windows is used a lot on servers and OSX isn't. 
That means we are probably going to have lots less push on this sort of 
thing from the OSX community, which is not to say that we shouldn't try 
to be just as safe on OSX as we try to be everywhere else.

cheers

andrew


Re: Final(?) proposal for wal_sync_method changes

From
"Joshua D. Drake"
Date:
On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > On 12/7/10 2:28 PM, Tom Lane wrote:
> >> Another point here is that it's not clear why we're selecting a
> >> known-to-be-insecure default on OS X (where in fact all methods except
> >> fsync_writethrough fail to push data to disk).  We've been around on
> >> that before, of course, and maybe now is not the time to change it.
> 
> > Because nobody sane uses OSX on the server?
> 
> Some of us would make the same remark about Windows.  But we go out of
> our way to provide a safe default on that platform anyhow.

Not to mention the assertion that people don't use OSX on a server is
patently false. They don't use it on big servers, but it is very popular
for the SMB (big time capital S).

JD

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt



Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
I wrote:
> Some of us would make the same remark about Windows.  But we go out of
> our way to provide a safe default on that platform anyhow.

Oh, wait: that's not the case at all.  As of recent releases, we support
open_datasync on Windows, and that's the default despite being unsafe.
You have to go and choose some nondefault drive setting to make it safe:
http://developer.postgresql.org/pgdocs/postgres/wal-reliability.html

So if we drop open_datasync from the preference list then Windows users
*will* see a sudden huge change in the default behavior.

Because open_datasync does now exist on Windows, this code in
xlogdefs.h:#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)#define DEFAULT_SYNC_METHOD
SYNC_METHOD_FSYNC_WRITETHROUGH
is actually dead code at the moment --- it can never be reached on any
platform.

I am unclear as to the reason why there is a test for
HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
leftover from a previous vision of how this all works?  Or does an
fsync() call actually fail on Windows?

I am now tempted to suggest that HAVE_FSYNC_WRITETHROUGH_ONLY should go
away altogether.  The documented and implemented behavior ought to be
that the default is "fdatasync if it exists, else fsync", full stop,
on every platform.  On both Windows and OS X, you would need to switch
to fsync_writethrough or change OS-level options to get safe behavior;
which is the same as it is today.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote:
>> Josh Berkus <josh@agliodbs.com> writes:
>>> Because nobody sane uses OSX on the server?

>> Some of us would make the same remark about Windows.  But we go out of
>> our way to provide a safe default on that platform anyhow.

> Not to mention the assertion that people don't use OSX on a server is
> patently false. They don't use it on big servers, but it is very popular
> for the SMB (big time capital S).

Actually the previous discussions about this are coming back to me now.
With the current code, we don't actually guarantee safe flush behavior
by default on ANY of the common consumer platforms.  In the Linux case
we can't, because we can't monkey with hdparm settings.  (I think the
same is true on BSDen ... anybody know?)  On Windows and OS X we default
to open_datasync despite its not being safe on either platform.  We
previously debated switching those to fsync_writethrough which would
make them safe by default, but decided not to, partly on grounds of the
inevitable ZOMG ITS SLOW backlash and partly on grounds of keeping
cross-platform consistency.

I don't think it's a good idea to reopen the fsync_writethrough debate
right now, certainly not for something we're contemplating
back-patching.  I think what we'd better do is ensure that that is
*not* selected as the default, on either Windows or OS X.  So we need to
get rid of the HAVE_FSYNC_WRITETHROUGH_ONLY hack.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Josh Berkus
Date:
> I am unclear as to the reason why there is a test for
> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
> leftover from a previous vision of how this all works?  Or does an
> fsync() call actually fail on Windows?

No, fsync responds fine.  It just don't actually sync to disk.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Final(?) proposal for wal_sync_method changes

From
Christophe Pettus
Date:
On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
> Because nobody sane uses OSX on the server?

The XServe running 10.5 server and 9.0.1 at the other end of the office takes your remark personally. :)

--
-- Christophe Pettus  xof@thebuild.com



Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> I am unclear as to the reason why there is a test for
>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
>> leftover from a previous vision of how this all works?  Or does an
>> fsync() call actually fail on Windows?

> No, fsync responds fine.  It just don't actually sync to disk.

Right, which is also an accurate description of its behavior on OS X,
as well as Linux (if you didn't change hdparm settings).  So the real
question here is what's the point of treating Windows differently.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Josh Berkus
Date:
> Right, which is also an accurate description of its behavior on OS X,
> as well as Linux (if you didn't change hdparm settings).  So the real
> question here is what's the point of treating Windows differently.

So, sounds like we should continue treating fsync_writethrough the same
as we have been, and maybe add a doc patch covering some of the above?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Right, which is also an accurate description of its behavior on OS X,
>> as well as Linux (if you didn't change hdparm settings).  So the real
>> question here is what's the point of treating Windows differently.

> So, sounds like we should continue treating fsync_writethrough the same
> as we have been, and maybe add a doc patch covering some of the above?

Yeah, this patch is shaping up to be about five lines of code change
and a hundred of docs ...
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Chris Browne
Date:
xof@thebuild.com (Christophe Pettus) writes:
> On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
>> Because nobody sane uses OSX on the server?
>
> The XServe running 10.5 server and 9.0.1 at the other end of the
> office takes your remark personally. :)

I'd heard that Apple had cancelled XServe.  [Poking back at that...]

Yep, they won't be carrying anything for sale that's particularly
rack-mountable after next January.  Not precisely "dead," but definitely
moving on smelling funny...
-- 
(format nil "~S@~S" "cbbrowne" "gmail.com")
http://www3.sympatico.ca/cbbrowne/emacs.html
Photons have mass? I didn't know they were catholic! 


Re: Final(?) proposal for wal_sync_method changes

From
"Joshua D. Drake"
Date:
On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote:
> xof@thebuild.com (Christophe Pettus) writes:
> > On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
> >> Because nobody sane uses OSX on the server?
> >
> > The XServe running 10.5 server and 9.0.1 at the other end of the
> > office takes your remark personally. :)
> 
> I'd heard that Apple had cancelled XServe.  [Poking back at that...]
> 
> Yep, they won't be carrying anything for sale that's particularly
> rack-mountable after next January.  Not precisely "dead," but definitely
> moving on smelling funny...

A bit off topic but Apple is actually marketing OSX Server on the mini
(with RAID 1). Which honestly for 95% of the businesses out there would
make a very nice, reasonably performant database server for say....
PostBooks, or Drupal.

JD

> -- 
> (format nil "~S@~S" "cbbrowne" "gmail.com")
> http://www3.sympatico.ca/cbbrowne/emacs.html
> Photons have mass? I didn't know they were catholic! 
> 

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt



Re: Final(?) proposal for wal_sync_method changes

From
Andrew Dunstan
Date:

On 12/07/2010 07:32 PM, Joshua D. Drake wrote:
> On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote:
>> xof@thebuild.com (Christophe Pettus) writes:
>>> On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote:
>>>> Because nobody sane uses OSX on the server?
>>> The XServe running 10.5 server and 9.0.1 at the other end of the
>>> office takes your remark personally. :)
>> I'd heard that Apple had cancelled XServe.  [Poking back at that...]
>>
>> Yep, they won't be carrying anything for sale that's particularly
>> rack-mountable after next January.  Not precisely "dead," but definitely
>> moving on smelling funny...
> A bit off topic but Apple is actually marketing OSX Server on the mini
> (with RAID 1). Which honestly for 95% of the businesses out there would
> make a very nice, reasonably performant database server for say....
> PostBooks, or Drupal.
>

Given the constant overheating I have had on my mini, I wouldn't use it 
to host anything much ;-) But I guess YMMV.

cheers

andrerw


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>>> I am unclear as to the reason why there is a test for
>>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
>>> leftover from a previous vision of how this all works?  Or does an
>>> fsync() call actually fail on Windows?

>> No, fsync responds fine.  It just don't actually sync to disk.

Sigh ... The closer I look at the Windows code path here, the more of an
inconsistent, badly documented spaghetti-heap it appears to be.  So far
as a quick Google search unearths, there is no fsync() primitive on
Windows.  What we have actually got is this gem in port/win32.h:

/**    Even though we don't support 'fsync' as a wal_sync_method,*    we do fsync() a few other places where _commit()
isjust fine.*/
 
#define fsync(fd) _commit(fd)

So actually, there is no difference between selecting fsync and
fsync_writethrough on Windows, this comment and the SGML documentation
to the contrary.  Both settings result in invoking _commit() and
presumably are safe.  One wonders why we bothered to invent a separate
fsync_writethrough setting on Windows.

What this means is that switching to a simple preference order
"fdatasync, then fsync" will result in choosing fsync on Windows (since
it hasn't got fdatasync), meaning _commit, meaning Windows users see
a behavioral change after all.

I'm afraid that if we don't want a major behavioral change, there's
no option except having a Windows-specific rule for the choice of
default.  It'll have to be "fdatasync, then fsync, except on Windows
where open_datasync is the default".  Grumble.  But it's not like
Windows hasn't got a hundred other special cases already.

Would someone verify via pgbench or similar test (*not* test_fsync) that
on Windows, wal_sync_method = fsync or fsync_writethrough perform the
same (ie tps ~= disk rotation rate) while open_datasync is too fast to
be real?  I'm losing confidence that I've found all the spaghetti ends
here, and I don't have a Windows setup to try it myself.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Greg Smith
Date:
Tom Lane wrote:
> So actually, there is no difference between selecting fsync and
> fsync_writethrough on Windows, this comment and the SGML documentation
> to the contrary.  Both settings result in invoking _commit() and
> presumably are safe.  One wonders why we bothered to invent a separate
> fsync_writethrough setting on Windows.
>   

Quite; I documented some the details about mapping to _commit() long ago 
at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm but 
forgot to suggest fixing the mistakes in the docs afterwards (Windows is 
not exactly my favorite platform). 
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php 
explains some of the history I think you're looking for here.

> Would someone verify via pgbench or similar test (*not* test_fsync) that
> on Windows, wal_sync_method = fsync or fsync_writethrough perform the
> same (ie tps ~= disk rotation rate) while open_datasync is too fast to
> be real?  I'm losing confidence that I've found all the spaghetti ends
> here, and I don't have a Windows setup to try it myself.
>   

I can look into this tomorrow.  The laptop I posted Ubuntu/RHEL6 
test_fsync numbers from before also boots into Vista, so I can compare 
all those platforms on the same hardware.  I just need to be aware of 
the slightly different sequential speeds on each partition of the drive.

As far as your major battle plan goes, I think what we should do is find 
the simplest possible patch that just fixes the newer Linux kernel 
problem, preferrably without changing any other platform, then commit 
that to HEAD and appropriate backports.  Then the larger O_DIRECT 
remapping can proceed forward after that, along with cleanup to the 
writethrough choices and unifying test_fsync against the server.  I 
wrote a patch that shuffled around a lot of this code last night, but 
the first thing I coded was junk because of some mistaken assumptions.  
Have been coming to same realizations about how messy this really is you 
have.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Final(?) proposal for wal_sync_method changes

From
Magnus Hagander
Date:
On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>>>> I am unclear as to the reason why there is a test for
>>>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync().  Perhaps that is also
>>>> leftover from a previous vision of how this all works?  Or does an
>>>> fsync() call actually fail on Windows?
>
>>> No, fsync responds fine.  It just don't actually sync to disk.

First of all a warning - I'm writing this on way too little sleep :-)
Blame pgday.eu...


> Sigh ... The closer I look at the Windows code path here, the more of an
> inconsistent, badly documented spaghetti-heap it appears to be.  So far
> as a quick Google search unearths, there is no fsync() primitive on
> Windows.  What we have actually got is this gem in port/win32.h:

Correct.


> /*
>  *      Even though we don't support 'fsync' as a wal_sync_method,
>  *      we do fsync() a few other places where _commit() is just fine.
>  */
> #define fsync(fd) _commit(fd)
>
> So actually, there is no difference between selecting fsync and
> fsync_writethrough on Windows, this comment and the SGML documentation
> to the contrary.  Both settings result in invoking _commit() and
> presumably are safe.  One wonders why we bothered to invent a separate
> fsync_writethrough setting on Windows.

IIRC, using _commit(fd) *is* fsync_writethrough. That's what we
shipped with. It even writes through the cache on a RAID controller
that has BBU'ed write-cache. We had to implement the *other* options
in order to "lower" the safety (it doesn't actually lower the safety
*if* you have a BBU, which is a very good use-case for those options)


> What this means is that switching to a simple preference order
> "fdatasync, then fsync" will result in choosing fsync on Windows (since
> it hasn't got fdatasync), meaning _commit, meaning Windows users see
> a behavioral change after all.

_commit() != fsync()

I think this is the discussion and subsequent changes:

http://archives.postgresql.org/pgsql-patches/2005-03/msg00230.php



> Would someone verify via pgbench or similar test (*not* test_fsync) that
> on Windows, wal_sync_method = fsync or fsync_writethrough perform the
> same (ie tps ~= disk rotation rate) while open_datasync is too fast to
> be real?  I'm losing confidence that I've found all the spaghetti ends
> here, and I don't have a Windows setup to try it myself.

Please note that if you're re-verifying this, verify both on crappy
disk *and* on a proper BBU'ed RAID-controller. The reason for this
originally was that we performed about the same in those two, wihch
made no sense...

Merlin, IIRC you did a lot of the testing around this - do you recall
any more details?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ win32.h says ]
>> #define fsync(fd) _commit(fd)

>> What this means is that switching to a simple preference order
>> "fdatasync, then fsync" will result in choosing fsync on Windows (since
>> it hasn't got fdatasync), meaning _commit, meaning Windows users see
>> a behavioral change after all.

> _commit() != fsync()

Um, the macro quoted above makes them the same, no?  One of us
is confused.
        regards, tom lane


Re: Final(?) proposal for wal_sync_method changes

From
Tom Lane
Date:
Given my concerns around exactly what is going on in the Windows code,
I'm now afraid to mess with an all-platforms change to fdatasync as the
preferred default; if we do that it should probably just be in HEAD not
the back branches.  So I've come around to the idea that Marti's
proposal of a PLATFORM_DEFAULT_SYNC_METHOD symbol is the best way.
(One reason for adopting that rather than some other way is that it
seems quite likely we'll end up needing it for Windows.)

I haven't touched the documentation yet, but attached is a proposed
code patch against HEAD.  This forces the default to fdatasync on Linux,
and makes some cosmetic cleanups around the HAVE_FSYNC_WRITETHROUGH_ONLY
confusion.

            regards, tom lane

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index fd5ec78..4f7dc39 100644
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
*************** static bool looks_like_temp_rel_name(con
*** 260,271 ****
  int
  pg_fsync(int fd)
  {
! #ifndef HAVE_FSYNC_WRITETHROUGH_ONLY
!     if (sync_method != SYNC_METHOD_FSYNC_WRITETHROUGH)
!         return pg_fsync_no_writethrough(fd);
      else
  #endif
!         return pg_fsync_writethrough(fd);
  }


--- 260,272 ----
  int
  pg_fsync(int fd)
  {
!     /* #if is to skip the sync_method test if there's no need for it */
! #if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
!     if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
!         return pg_fsync_writethrough(fd);
      else
  #endif
!         return pg_fsync_no_writethrough(fd);
  }


diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 18b214e..072096d 100644
*** a/src/include/access/xlogdefs.h
--- b/src/include/access/xlogdefs.h
*************** typedef uint32 TimeLineID;
*** 123,134 ****
  #endif
  #endif

! #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_OPEN_DSYNC
  #elif defined(HAVE_FDATASYNC)
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_FDATASYNC
- #elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)
- #define DEFAULT_SYNC_METHOD        SYNC_METHOD_FSYNC_WRITETHROUGH
  #else
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_FSYNC
  #endif
--- 123,134 ----
  #endif
  #endif

! #if defined(PLATFORM_DEFAULT_SYNC_METHOD)
! #define DEFAULT_SYNC_METHOD        PLATFORM_DEFAULT_SYNC_METHOD
! #elif defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_OPEN_DSYNC
  #elif defined(HAVE_FDATASYNC)
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_FDATASYNC
  #else
  #define DEFAULT_SYNC_METHOD        SYNC_METHOD_FSYNC
  #endif
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index b9498b2..bcaa42d 100644
*** a/src/include/port/linux.h
--- b/src/include/port/linux.h
***************
*** 12,14 ****
--- 12,22 ----
   * to have a kernel version test here.
   */
  #define HAVE_LINUX_EIDRM_BUG
+
+ /*
+  * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
+  * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
+  * perform better and (b) causes outright failures on ext4 data=journal
+  * filesystems, because those don't support O_DIRECT.
+  */
+ #define PLATFORM_DEFAULT_SYNC_METHOD    SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/win32.h b/src/include/port/win32.h
index 3417ab5..9c2ae4d 100644
*** a/src/include/port/win32.h
--- b/src/include/port/win32.h
***************
*** 34,47 ****
  /* Must be here to avoid conflicting with prototype in windows.h */
  #define mkdir(a,b)    mkdir(a)

- #define HAVE_FSYNC_WRITETHROUGH
- #define HAVE_FSYNC_WRITETHROUGH_ONLY
  #define ftruncate(a,b)    chsize(a,b)
  /*
!  *    Even though we don't support 'fsync' as a wal_sync_method,
!  *    we do fsync() a few other places where _commit() is just fine.
   */
! #define fsync(fd) _commit(fd)

  #define USES_WINSOCK

--- 34,51 ----
  /* Must be here to avoid conflicting with prototype in windows.h */
  #define mkdir(a,b)    mkdir(a)

  #define ftruncate(a,b)    chsize(a,b)
+
+ /* Windows doesn't have fsync() as such, use _commit() */
+ #define fsync(fd) _commit(fd)
+
  /*
!  * For historical reasons, we allow setting wal_sync_method to
!  * fsync_writethrough on Windows, even though it's really identical to fsync
!  * (both code paths wind up at _commit()).
   */
! #define HAVE_FSYNC_WRITETHROUGH
! #define FSYNC_WRITETHROUGH_IS_FSYNC

  #define USES_WINSOCK


Re: Final(?) proposal for wal_sync_method changes

From
Greg Smith
Date:
Since any Windows refactoring has been postponed for now (I'll get back
to performance checks on that platform later), during my testing time
this week instead I did a round of pre-release review of the change Tom
has now committed.  All looks good to me, including the docs changes.

I confirmed that:

-Ubuntu system with an older kernel still has the same wal_sync_method
(fdatasync) and performance after pulling the update
-RHEL6 system changes as planned from using open_datasync to fdatasync
once I updated to a HEAD after the commit

On the RHEL6 system, I also checked the commit rate using pgbench with
the attached INSERT only script, rather than relying on test_fsync.
This is 7200 RPM drive, so theoretical max of 120 commits/second, on
ext4; this is the same test setup I described in more detail back in
http://archives.postgresql.org/message-id/4CE2EBF8.4040602@2ndquadrant.com

$ psql -c "show wal_sync_method"
 wal_sync_method
-----------------
 fdatasync
(1 row)

$ pgbench -i -s 10 pgbench

[gsmith@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6733
tps = 112.208795 (including connections establishing)
tps = 112.216904 (excluding connections establishing)

And then manually switched over to test performance of the troublesome
old default:

[gsmith@meddle ~]$ psql -c "show wal_sync_method"
 wal_sync_method
-----------------
 open_datasync

[gsmith@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6672
tps = 111.185802 (including connections establishing)
tps = 111.195089 (excluding connections establishing)

This is interesting, because test_fsync consistently reported a rate of
about half this when using open_datasync instead of the equal
performance I'm getting from the database.  I'll see if I can reproduce
that further, but it's no reason to be concerned about the change that's
been made I think.  Just more evidence that test_fsync has quirks left
to be sorted out.  But that's not backbranch material, it should be part
of 9.1 only refactoring, already in progress via the patch Josh
submitted.  There's a bit of time left to get that done.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);

Re: Final(?) proposal for wal_sync_method changes

From
Josh Berkus
Date:
Greg,

> This is interesting, because test_fsync consistently reported a rate of
> about half this when using open_datasync instead of the equal
> performance I'm getting from the database.  I'll see if I can reproduce
> that further, but it's no reason to be concerned about the change that's
> been made I think.  Just more evidence that test_fsync has quirks left
> to be sorted out.  But that's not backbranch material, it should be part
> of 9.1 only refactoring, already in progress via the patch Josh
> submitted.  There's a bit of time left to get that done.

Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
figures you had from test_fsync earlier were without O_DIRECT.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Final(?) proposal for wal_sync_method changes

From
Greg Smith
Date:
Josh Berkus wrote:
> Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
> figures you had from test_fsync earlier were without O_DIRECT.
>   

No--I was just focused on testing the changes that had already been 
committed.  The use of O_DIRECT in the server but not test_fsync could 
very well be the reason for the difference; don't know yet.  I'm trying 
to get through this CF before I start getting distracted by newer 
patches, I'll get to yours soon I hope.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Final(?) proposal for wal_sync_method changes

From
Magnus Hagander
Date:
On Wed, Dec 8, 2010 at 15:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> [ win32.h says ]
>>> #define fsync(fd) _commit(fd)
>
>>> What this means is that switching to a simple preference order
>>> "fdatasync, then fsync" will result in choosing fsync on Windows (since
>>> it hasn't got fdatasync), meaning _commit, meaning Windows users see
>>> a behavioral change after all.
>
>> _commit() != fsync()
>
> Um, the macro quoted above makes them the same, no?  One of us
> is confused.

Uh, yeah. Sorry, that was the unclear:ness from being too preoccupied
with the conference.. Pretty sure I'm the confused one.
.
_commit() is definitely the same as fsync() on the API level. And this
correspond to fsync_writethrough, not fsync, when you talk about the
wal_sync_method parameter. It will always sync through the write
cache, even if it's hardware BBU'ed cache.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


New test_fsync messages for direct I/O

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Greg,
>
> > This is interesting, because test_fsync consistently reported a rate of
> > about half this when using open_datasync instead of the equal
> > performance I'm getting from the database.  I'll see if I can reproduce
> > that further, but it's no reason to be concerned about the change that's
> > been made I think.  Just more evidence that test_fsync has quirks left
> > to be sorted out.  But that's not backbranch material, it should be part
> > of 9.1 only refactoring, already in progress via the patch Josh
> > submitted.  There's a bit of time left to get that done.
>
> Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
> figures you had from test_fsync earlier were without O_DIRECT.

I have modified test_fsync with the attached, applied patch to report
cases where we are testing without O_DIRECT when only O_DIRECT would be
used by the server, and cases where O_DIRECT fails because of the file
system type.   Josh Berkus wanted the first case kept in case we decide
to offer non-direct-io options on machines that support direct i/o.

The new messages are:

    * This non-direct I/O option is not used by Postgres.

    ** This file system and its mount options do not support direct
    I/O, e.g. ext4 in journaled mode.

You can see the first one below in my output from Ubuntu:

    $ ./test_fsync
    Ops-per-test = 2000

    Simple non-sync'ed write:
            8k write                           58.175 ops/sec

    Compare file sync methods using one write:
    (in wal_sync_method preference order, except fdatasync
    is Linux's default)
            open_datasync                                 n/a
            8k write, fdatasync                68.425 ops/sec
            8k write, fsync                    63.932 ops/sec
            fsync_writethrough                            n/a
            open_sync 8k write*                73.785 ops/sec
            open_sync 8k direct I/O write      82.929 ops/sec
    * This non-direct I/O option is not used by Postgres.

    Compare file sync methods using two writes:
    (in wal_sync_method preference order, except fdatasync
    is Linux's default)
            open_datasync                                 n/a
            8k write, 8k write, fdatasync      42.728 ops/sec
            8k write, 8k write, fsync          43.625 ops/sec
            fsync_writethrough                            n/a
            2 open_sync 8k writes*             37.150 ops/sec
            2 open_sync 8k direct I/O writes   43.722 ops/sec
    * This non-direct I/O option is not used by Postgres.

    Compare open_sync with different sizes:
    (This is designed to compare the cost of one large
    sync'ed write and two smaller sync'ed writes.)
            open_sync 16k write                46.428 ops/sec
            2 open_sync 8k writes              38.703 ops/sec

    Test if fsync on non-write file descriptor is honored:
    (If the times are similar, fsync() can sync data written
    on a different descriptor.)
            8k write, fsync, close             65.744 ops/sec
            8k write, close, fsync             63.077 ops/sec

I believe test_fsync now matches the backend code.  If we decide to
change things, it can be adjusted.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index cd2b1f2..562da66 100644
*** /tmp/8wSNdd_test_fsync.c    Sat Jan 15 18:42:58 2011
--- src/tools/fsync/test_fsync.c    Sat Jan 15 18:38:57 2011
*************** void
*** 163,169 ****
  test_sync(int writes_per_op)
  {
      int            tmpfile, ops, writes;
!
      if (writes_per_op == 1)
          printf("\nCompare file sync methods using one write:\n");
      else
--- 163,170 ----
  test_sync(int writes_per_op)
  {
      int            tmpfile, ops, writes;
!     bool        fs_warning = false;
!
      if (writes_per_op == 1)
          printf("\nCompare file sync methods using one write:\n");
      else
*************** test_sync(int writes_per_op)
*** 176,184 ****
       */
  #ifdef OPEN_DATASYNC_FLAG
      if (writes_per_op == 1)
!         printf(LABEL_FORMAT, "open_datasync 8k write");
      else
!          printf(LABEL_FORMAT, "2 open_datasync 8k writes");
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
--- 177,193 ----
       */
  #ifdef OPEN_DATASYNC_FLAG
      if (writes_per_op == 1)
!         printf(LABEL_FORMAT, "open_datasync 8k write"
! #if PG_O_DIRECT != 0
!         "*"
! #endif
!         );
      else
!          printf(LABEL_FORMAT, "2 open_datasync 8k writes"
! #if PG_O_DIRECT != 0
!         "*"
! #endif
!         );
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
*************** test_sync(int writes_per_op)
*** 201,207 ****
       */
  #if PG_O_DIRECT != 0
      if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
!         printf(NA_FORMAT, "o_direct", "n/a on this filesystem\n");
      else
      {
          if (writes_per_op == 1)
--- 210,219 ----
       */
  #if PG_O_DIRECT != 0
      if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
!     {
!         printf(NA_FORMAT, "o_direct", "n/a**\n");
!         fs_warning = true;
!     }
      else
      {
          if (writes_per_op == 1)
*************** test_sync(int writes_per_op)
*** 321,329 ****
   */
  #ifdef OPEN_SYNC_FLAG
      if (writes_per_op == 1)
!         printf(LABEL_FORMAT, "open_sync 8k write");
      else
!         printf(LABEL_FORMAT, "2 open_sync 8k writes");
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG, 0)) == -1)
--- 333,349 ----
   */
  #ifdef OPEN_SYNC_FLAG
      if (writes_per_op == 1)
!         printf(LABEL_FORMAT, "open_sync 8k write"
! #if PG_O_DIRECT != 0
!         "*"
! #endif
!         );
      else
!         printf(LABEL_FORMAT, "2 open_sync 8k writes"
! #if PG_O_DIRECT != 0
!         "*"
! #endif
!         );
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG, 0)) == -1)
*************** test_sync(int writes_per_op)
*** 352,358 ****
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
!         printf(NA_FORMAT, "o_direct", "n/a on this filesystem\n");
      else
      {
          gettimeofday(&start_t, NULL);
--- 372,381 ----
      fflush(stdout);

      if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
!     {
!         printf(NA_FORMAT, "o_direct", "n/a**\n");
!         fs_warning = true;
!     }
      else
      {
          gettimeofday(&start_t, NULL);
*************** test_sync(int writes_per_op)
*** 375,380 ****
--- 398,414 ----
  #else
      printf(NA_FORMAT, "open_sync", "n/a\n");
  #endif
+
+ #if defined(OPEN_DATASYNC_FLAG) || defined(OPEN_SYNC_FLAG)
+     if (PG_O_DIRECT != 0)
+         printf("* This non-direct I/O option is not used by Postgres.\n");
+ #endif
+
+     if (fs_warning)
+     {
+         printf("** This file system and its mount options do not support direct\n");
+         printf("I/O, e.g. ext4 in journaled mode.\n");
+     }
  }

  void
*************** test_open_syncs(void)
*** 389,394 ****
--- 423,430 ----
      printf("(This is designed to compare the cost of one large\n");
      printf("sync'ed write and two smaller sync'ed writes.)\n");

+     /* XXX no PG_O_DIRECT */
+
  /*
   * Test open_sync with different size files
   */