Thread: O_DSYNC broken on MacOS X?

O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

30 September 2010, 16:26:20

Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
came out 5 times faster.  The benchmark isn't very thoroughly
described, but it turns out not to matter.

http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html

Apparently, the reason we're faster is that wal_sync_method =
open_datasync, which is the default on MacOS X, doesn't actually work.

[rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 4
number of threads: 4
number of transactions per client: 100000
number of transactions actually processed: 400000/400000
tps = 1292.258304 (including connections establishing)
tps = 1292.281493 (excluding connections establishing)

Clearly we're not getting 1292 (or even 1292/4) fsync per second out
of whatever HD is in my laptop.  So what happens if we change to
fsync_writethrough, which is the equivalent of what InnoDB apparently
does out of the box?

[rhaas pgsql]$ pg_ctl reload
server signaled
LOG:  received SIGHUP, reloading configuration files
LOG:  parameter "wal_sync_method" changed to "fsync_writethrough"
[rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 4
number of threads: 4
number of transactions per client: 100000
number of transactions actually processed: 400000/400000
tps = 27.845797 (including connections establishing)
tps = 27.845809 (excluding connections establishing)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Dave Page

Date:

30 September 2010, 17:09:32

On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
> came out 5 times faster.  The benchmark isn't very thoroughly
> described, but it turns out not to matter.
>
> http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html
>
> Apparently, the reason we're faster is that wal_sync_method =
> open_datasync, which is the default on MacOS X, doesn't actually work.

That might be true, but if you check the comments, Jayant replied to say:

@Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
does not support fsync_writethrough
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

So I don't think that invalidates his benchmark. Something else might
of course...

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Dave Page

Date:

30 September 2010, 17:11:37

On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
> came out 5 times faster.  The benchmark isn't very thoroughly
> described, but it turns out not to matter.
>
> http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html
>
> Apparently, the reason we're faster is that wal_sync_method =
> open_datasync, which is the default on MacOS X, doesn't actually work.

That might be true, but if you check the comments, Jayant replied to say:

@Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
does not support fsync_writethrough
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

So I don't think that invalidates his benchmark. Something else might
of course...

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

30 September 2010, 17:27:42

On Thu, Sep 30, 2010 at 4:09 PM, Dave Page <dpage@pgadmin.org> wrote:
> On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG
>> came out 5 times faster.  The benchmark isn't very thoroughly
>> described, but it turns out not to matter.
>>
>> http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html
>>
>> Apparently, the reason we're faster is that wal_sync_method =
>> open_datasync, which is the default on MacOS X, doesn't actually work.
>
> That might be true, but if you check the comments, Jayant replied to say:
>
> @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux
> does not support fsync_writethrough
> http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
>
> So I don't think that invalidates his benchmark. Something else might
> of course...

Oh, I missed that.  Actually, I wasn't really so concerned with
whether his benchmark is correct.  I *am* concerned about being broken
out of the box on MacOS X.

(I also suspect problems with the benchmark.  It's hard to believe
we're 5x faster than InnoDB on an apples-to-apples comparison on
trivial queries.  I'd believe 20% either way, but 5x is a lot.  But
that's a question for another day.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Tom Lane

Date:

30 September 2010, 18:02:24

Robert Haas <robertmhaas@gmail.com> writes:
> Oh, I missed that.  Actually, I wasn't really so concerned with
> whether his benchmark is correct.  I *am* concerned about being broken
> out of the box on MacOS X.

Actually, the problem with OSX is that OSX is broken out of the box,
at least by that standard.  The system's normal configuration is that
fsync() does nothing, so it's hardly surprising that O_DSYNC is no
better.  You have to use wal_sync_method = fsync_writethrough to get
actual bits-to-the-platter behavior.

I'm not sure whether we should select fsync_writethrough as the default
on OSX.  We don't make an equivalent attempt to prevent OS or storage
malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
ahead of the game in that you *can* force writethrough without resorting
to arcane hacks with hdparm or some such.

We could definitely stand to be a bit more verbose about documenting
the platform-specific issues in this area.
        regards, tom lane

Re: O_DSYNC broken on MacOS X?

From

"A.M."

Date:

30 September 2010, 18:22:35

On Sep 30, 2010, at 5:02 PM, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> Oh, I missed that.  Actually, I wasn't really so concerned with
>> whether his benchmark is correct.  I *am* concerned about being broken
>> out of the box on MacOS X.
>
> Actually, the problem with OSX is that OSX is broken out of the box,
> at least by that standard.  The system's normal configuration is that
> fsync() does nothing,

That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which
meetsthe specification and is distinctly different from doing nothing. 

"The fsync() function can be used by an application to indicate that all data for the open file description named by
fildesis to be transferred to the storage device associated with the file described by fildes in an
implementation-dependentmanner." 
http://opengroup.org/onlinepubs/007908799/xsh/fsync.html

"On MacOS X, fsync() always has and always will flush all file data
from host memory to the drive on which the file resides."
http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html

> I'm not sure whether we should select fsync_writethrough as the default
> on OSX.  We don't make an equivalent attempt to prevent OS or storage
> malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
> ahead of the game in that you *can* force writethrough without resorting
> to arcane hacks with hdparm or some such.
>
> We could definitely stand to be a bit more verbose about documenting
> the platform-specific issues in this area.

Not only is this issue platform-specific, it is also bus-, controller- and disk-specific. Luckily, hardware that ships
fromApple responds properly to F_FULLFSYNC. It's too bad there is no cross-platform way to ask what level of
hardware-syncingis available. 

Cheers,
M

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

30 September 2010, 18:31:41

On Thu, Sep 30, 2010 at 5:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Oh, I missed that.  Actually, I wasn't really so concerned with
>> whether his benchmark is correct.  I *am* concerned about being broken
>> out of the box on MacOS X.
>
> Actually, the problem with OSX is that OSX is broken out of the box,
> at least by that standard.  The system's normal configuration is that
> fsync() does nothing, so it's hardly surprising that O_DSYNC is no
> better.  You have to use wal_sync_method = fsync_writethrough to get
> actual bits-to-the-platter behavior.
>
> I'm not sure whether we should select fsync_writethrough as the default
> on OSX.  We don't make an equivalent attempt to prevent OS or storage
> malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
> ahead of the game in that you *can* force writethrough without resorting
> to arcane hacks with hdparm or some such.
>
> We could definitely stand to be a bit more verbose about documenting
> the platform-specific issues in this area.

I think some documentation is definitely in order, at the least.  It's
certainly astonishing that the default settings aren't crash-safe.
I'd really like to understand how this shakes out on different
plaforms.

Whether we should try to work around them is a trickier question, but
I'm somewhat inclined to say yes.  If we're trying to have the system
be performant in the default config, turning off synchronous_commit
would be saner than failing to make use of a system call which we know
absolutely for sure to be necessary to avoid the possibility of
database corruption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Greg Stark

Date:

30 September 2010, 19:08:26

On Thu, Sep 30, 2010 at 2:22 PM, A.M. <agentm@themactionfaction.com> wrote:
> That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which
meetsthe specification and is distinctly different from doing nothing.

How exactly is it different from doing nothing? That is, in what
situation does doing this change in any way the behaviour from the
user's point of view versus doing nothing?

People keep saying it "meets the specification" but it's a useless
interpretation of the specification. And it requires a pretty
malicious reading of "transferred to the storage device" to read it as
"transferred it from one set of ram buffers to another more closely
associated with the actual persistent storage".

> It's too bad there is no cross-platform way to ask what level of hardware-syncing is available.

Why would the user want to ask this? As far as the user is concerned
either there are only two "levels": synced or not synced. If it's not
guaranteed to persist after a power failure it's not synced. It
doesn't matter whether it's in kernel buffers, drive buffers, or
anywhere else -- they're all the same from the user's point of view --
they're non-persistent.

The other useful syncing behaviour would be to get write barriers. But
that's a whole different api, not just a behaviour that can be
attached to fsync.

-- 
greg

Re: O_DSYNC broken on MacOS X?

From

Greg Smith

Date:

30 September 2010, 21:53:26

Tom Lane wrote:
> I'm not sure whether we should select fsync_writethrough as the default
> on OSX.  We don't make an equivalent attempt to prevent OS or storage
> malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit
> ahead of the game in that you *can* force writethrough without resorting
> to arcane hacks with hdparm or some such.
>   

The same issue exists on Windows too, with the same workaround:  
normally unsafe by default, have to switch to fsync_writethrough in 
order to get commit safety and write caching for regular writes, still 
ahead of most Unixes because at least it's that easy.

The main downside to switching the default on either OS X or Windows is 
developers using those platforms for test deployments will suffer 
greatly from a performance drop for data they don't really care about.  
As those two in particular are much more likely to be client development 
platforms, too, that's a scary thing to consider.

As for the documentation, I wrote 
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm a 
while ago trying to address this better, but never got to merging some 
of those notes into the official docs.  At the time, such heavy linking 
to external URLs was particularly frowned upon in the official docs.  
While that's eased up a bit, what I think I could do now is convert that 
article into something that's on the Wiki instead, and we could point 
the "Reliability" section of the docs toward there as something that 
could stay live as platform-specific changes happen.  I'm thinking of 
the whole ext4 behavior change when I say that; stuff like that will 
only be addressable in a live document, since changes were happening in 
between major releases in that area.

More recently, I've written a few things covering this area in what 
approaches excruciating detail, for this little document you might have 
been spammed in somebody's signature about.  I'm going to ask the 
publisher to make that specific material the public sample chapter for 
the book, in hopes of making it easier for people to find a detailed 
discussion of this topic.  Regardless of whether that works out, I have 
to clean up documentation in this whole area up for the checkpoint 
changes I'm working on for 9.1 anyway.  That tinkers with the timing on 
fsync calls, and whether that works or not is very filesystem specific.  
I'm basically stuck with documenting exactly what happens for most 
possibilities in order to provide a comprehensive manual section 
covering that.  I have a deadline for when I have to get my work in 
progress organized to share with the world now:  
https://www.postgresqlconference.org/content/righting-your-writes , so 
after I get back from that conference in November I'll see what I can do 
about the docs too.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: O_DSYNC broken on MacOS X?

From

Greg Smith

Date:

30 September 2010, 22:03:01

A.M. wrote:
> That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which
meetsthe specification and is distinctly different from doing nothing...

> "On MacOS X, fsync() always has and always will flush all file data
> from host memory to the drive on which the file resides."
> http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html
>   

You didn't quote the next part of that, which says "fsync() is not 
sufficient to guarantee that your data is on stable
storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask 
the drive to flush all buffered data to stable storage."  That's exactly 
what turning on fsync_writethrough does in PostgreSQL.  See 
http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the 
first post on this topic that ultimately led to that behavior being 
implemented.
From the perspective of the database, whether or not the behavior is 
standards compliant isn't the issue.  Whether pages make it to physical 
disk or not when fsync is called, or when O_DSYNC writes are done on 
platforms that support them, is the important part.  If you the OS 
doesn't do that, it is doing nothing useful from the perspective of the 
database's expectations.  And that's not true on Darwin unless you 
specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL.  It 
only does that when you switch wal_sync_method=fsync_writethrough

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: O_DSYNC broken on MacOS X?

From

Darren Duncan

Date:

01 October 2010, 00:03:35

Greg Smith wrote:
> You didn't quote the next part of that, which says "fsync() is not 
> sufficient to guarantee that your data is on stable
> storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask 
> the drive to flush all buffered data to stable storage."  That's exactly 
> what turning on fsync_writethrough does in PostgreSQL.  See 
> http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the 
> first post on this topic that ultimately led to that behavior being 
> implemented.
> 
>  From the perspective of the database, whether or not the behavior is 
> standards compliant isn't the issue.  Whether pages make it to physical 
> disk or not when fsync is called, or when O_DSYNC writes are done on 
> platforms that support them, is the important part.  If you the OS 
> doesn't do that, it is doing nothing useful from the perspective of the 
> database's expectations.  And that's not true on Darwin unless you 
> specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL.  It 
> only does that when you switch wal_sync_method=fsync_writethrough

Greg Smith also wrote:
> The main downside to switching the default on either OS X or Windows is
developers using those platforms for test deployments will suffer greatly from a
performance drop for data they don't really care about. As those two in
particular are much more likely to be client development platforms, too, that's
a scary thing to consider.

I think that, bottom line, Postgres should be defaulting to whatever the safest 
and most reliable behavior is, per each platform, because data integrity is the 
most important thing, ensuring that a returning commit has actually written data 
to disk.  If performance is worse, then so what?  Code that does nothing has the 
best performance of all, and is also generally useless.

Whenever there is a tradeoff to be made, reliability for speed, then users 
should have to explicitly choose the less reliable option, which would 
demonstrate they know what they're doing.  Let the testers explicitly choose a 
faster and less reliable option for the data they don't care about, and 
otherwise by default users who don't better should get the safest option, for 
data they likely care about.  That is a DBMS priority.

This matter reminds me of a discussion on the SQLite list years ago about 
whether pragma synchronous=normal or synchronous=full should be the default, and 
thankfully 'full' won.

-- Darren Duncan

Re: O_DSYNC broken on MacOS X?

From

Greg Smith

Date:

01 October 2010, 03:35:58

Darren Duncan wrote:
> This matter reminds me of a discussion on the SQLite list years ago 
> about whether pragma synchronous=normal or synchronous=full should be 
> the default, and thankfully 'full' won.

Right now, when I see deployments in the field, serious database servers 
setup by professional DBAs tend to use the right hardware and setup to 
do the correct thing with PostgreSQL.  And trivial installs done for 
testing purposes cheat, but many of those users don't really care 
because they know they are not running a real server, and expect that 
their desktop is neither reliable nor fast at database work.  The SQLite 
situation has a slightly different context than this, because the places 
it's put into don't so regularly have a DBA involved at all in 
situations where the data is important.  It's often just system software 
sitting in the background nobody even is aware of.

I also remember when SQLite did come out of the background, when it was 
crucified for being the cause of Firefox slowdowns actually linked to 
changed kernel fsync behavior.  That's the sort of bad press this 
project really doesn't need right now, when it actually doesn't matter 
on so many production database servers.  You may not be aware that 
there's already such a change floating around out there.  PostgreSQL 
installs on Linux kernel 2.6.32 or later using ext4 are dramatically 
slower out of the box than they used to be, because the OS started doing 
the right thing by default; no change in the database code.  I remain in 
mild terror that this news is going to break in a bad way and push this 
community into damage control.  So far I've only seen that reported on 
Phoronix, and that included a testimony from a kernel developer that 
they introduced the regression so it wasn't so bad.  The next such 
publicized report may not be so informed.

Some of this works out to when to change things rather than what to 
change.  PostgreSQL is at a somewhat critical spot right now.  If people 
grab a new version, and performance sucks compared to earlier ones, 
they're not going to think "oh, maybe they changed an option and the new 
version is tuned for safety better".  They're going to say "performance 
sucks on this database now" and give up on it.  Many evals are done on 
hardware that isn't representative of a real database server, and if we 
make a change that only hurts those people--while not actually impacting 
production quality hardware--that needs to be done carefully.  And 
that's exactly what I think would happen here if this was just changed 
all of the sudden.

I don't think anyone is seriously opposed to changing the defaults for 
safety instead of performance.  The problem is that said change would 
need to be *preceeded* by a major update to the database documentation, 
and perhaps even some code changes to issue warnings when you create a 
cluster with what is going to turn out to now be a slow configuration.  
We'd need to make it really obvious to people who upgrade and notice 
that performance tanks that it's because of a configuration change made 
for safety reasons, one that they can undo for test deployments.  That 
particular area, giving people better advice about what they should do 
to properly tune a new install for its intended workload, is something 
that's been making slow progress but still needs a lot of work.  I think 
if some better tools there come along, so that most people are expected 
to follow a path that involves a tuning tool, it will be much easier to 
stomach the idea of changing the default--knowing that something that 
will undo that change is likely to appears to the user that suggests the 
possibility is available.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: O_DSYNC broken on MacOS X?

From

Peter Eisentraut

Date:

01 October 2010, 04:50:11

On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote:
> > It's too bad there is no cross-platform way to ask what level of
> hardware-syncing is available.
> 
> Why would the user want to ask this? As far as the user is concerned
> either there are only two "levels": synced or not synced. If it's not
> guaranteed to persist after a power failure it's not synced. It
> doesn't matter whether it's in kernel buffers, drive buffers, or
> anywhere else -- they're all the same from the user's point of view --
> they're non-persistent.

Well, it's not really useful, but that's how it works "everywhere".  On
Linux, fsync carries the stuff from the kernel's RAM to the disk
controller's RAM, and then it depends on some hdparm magic or something
what happens next.

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

05 October 2010, 00:41:27

On Fri, Oct 1, 2010 at 3:50 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote:
>> > It's too bad there is no cross-platform way to ask what level of
>> hardware-syncing is available.
>>
>> Why would the user want to ask this? As far as the user is concerned
>> either there are only two "levels": synced or not synced. If it's not
>> guaranteed to persist after a power failure it's not synced. It
>> doesn't matter whether it's in kernel buffers, drive buffers, or
>> anywhere else -- they're all the same from the user's point of view --
>> they're non-persistent.
>
> Well, it's not really useful, but that's how it works "everywhere".  On
> Linux, fsync carries the stuff from the kernel's RAM to the disk
> controller's RAM, and then it depends on some hdparm magic or something
> what happens next.

That's a bit vaguer than I'd like.  TFD says "The aim of WAL is to
ensure that the log is written before database records are altered,
but this can be subverted by disk drives that falsely report a
successful write to the kernel, when in fact they have only cached the
data and not yet stored it on the disk. A power failure in such a
situation might lead to irrecoverable data corruption. Administrators
should try to ensure that disks holding PostgreSQL's WAL log files do
not make such false reports."  This leaves open the question of how
they should attempt to do this; we should say what we know about that.

I also notice the following sentence in our documentation, which now
appears to me to be flat-out wrong: "The wal_sync_method parameter
determines how PostgreSQL will ask the kernel to force WAL  updates
out to disk. All the options should be the same in terms of
reliability, but it's quite platform-specific which one will be the
fastest."  Obviously, we know now (if we didn't before) that this
isn't the case, per my OP.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Peter Eisentraut

Date:

05 October 2010, 09:11:53

On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote:
> > Well, it's not really useful, but that's how it works "everywhere".  On
> > Linux, fsync carries the stuff from the kernel's RAM to the disk
> > controller's RAM, and then it depends on some hdparm magic or something
> > what happens next.
> 
> That's a bit vaguer than I'd like.  TFD says "The aim of WAL is to
> ensure that the log is written before database records are altered,
> but this can be subverted by disk drives that falsely report a
> successful write to the kernel, when in fact they have only cached the
> data and not yet stored it on the disk. A power failure in such a
> situation might lead to irrecoverable data corruption. Administrators
> should try to ensure that disks holding PostgreSQL's WAL log files do
> not make such false reports."  This leaves open the question of how
> they should attempt to do this; we should say what we know about that.

That is explained in section 29.1 "Reliability".

> I also notice the following sentence in our documentation, which now
> appears to me to be flat-out wrong: "The wal_sync_method parameter
> determines how PostgreSQL will ask the kernel to force WAL  updates
> out to disk. All the options should be the same in terms of
> reliability, but it's quite platform-specific which one will be the
> fastest."  Obviously, we know now (if we didn't before) that this
> isn't the case, per my OP.

Right.  It was true before fsync_writethrough was invented.

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

07 October 2010, 12:11:29

On Tue, Oct 5, 2010 at 8:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote:
>> > Well, it's not really useful, but that's how it works "everywhere".  On
>> > Linux, fsync carries the stuff from the kernel's RAM to the disk
>> > controller's RAM, and then it depends on some hdparm magic or something
>> > what happens next.
>>
>> That's a bit vaguer than I'd like.  TFD says "The aim of WAL is to
>> ensure that the log is written before database records are altered,
>> but this can be subverted by disk drives that falsely report a
>> successful write to the kernel, when in fact they have only cached the
>> data and not yet stored it on the disk. A power failure in such a
>> situation might lead to irrecoverable data corruption. Administrators
>> should try to ensure that disks holding PostgreSQL's WAL log files do
>> not make such false reports."  This leaves open the question of how
>> they should attempt to do this; we should say what we know about that.
>
> That is explained in section 29.1 "Reliability".
>
>> I also notice the following sentence in our documentation, which now
>> appears to me to be flat-out wrong: "The wal_sync_method parameter
>> determines how PostgreSQL will ask the kernel to force WAL  updates
>> out to disk. All the options should be the same in terms of
>> reliability, but it's quite platform-specific which one will be the
>> fastest."  Obviously, we know now (if we didn't before) that this
>> isn't the case, per my OP.
>
> Right.  It was true before fsync_writethrough was invented.

Proposed doc patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

document-wal-caveats.patch

Re: O_DSYNC broken on MacOS X?

From

Greg Smith

Date:

07 October 2010, 12:45:57

Robert Haas wrote:
> Proposed doc patch attached.
>   

Looks accurate to me.  I like the additional linking to the Reliability 
page you put in there too.  Heavily referencing that important page from 
related areas is a good thing, particularly now that it's got a lot more 
details than it used to. 

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us

Re: O_DSYNC broken on MacOS X?

From

Tom Lane

Date:

07 October 2010, 12:52:41

Robert Haas <robertmhaas@gmail.com> writes:
> Proposed doc patch attached.

"discusesed"?  Otherwise +1
        regards, tom lane

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

07 October 2010, 13:24:57

On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Proposed doc patch attached.
>
> "discusesed"?  Otherwise +1

Woops, thanks.  Committed with that change.  I back-patched it back to
8.3, which is as far as it applied with only minor conflicts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

07 October 2010, 13:26:16

On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Proposed doc patch attached.
>
> Looks accurate to me.  I like the additional linking to the Reliability page
> you put in there too.  Heavily referencing that important page from related
> areas is a good thing, particularly now that it's got a lot more details
> than it used to.

Cool, thanks for the fast review.  I suspect there are more details
that could stand to be added to the WAL reliability page as well, but
I don't know what they are so I can't add them.

I still have the feeling that we have not put quite a large enough
red, blinking light around this issue, but I don't have a concrete
suggestion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: O_DSYNC broken on MacOS X?

From

"A.M."

Date:

07 October 2010, 14:28:40

On Oct 7, 2010, at 12:26 PM, Robert Haas wrote:

> On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Robert Haas wrote:
>>> Proposed doc patch attached.
>>
>> Looks accurate to me.  I like the additional linking to the Reliability page
>> you put in there too.  Heavily referencing that important page from related
>> areas is a good thing, particularly now that it's got a lot more details
>> than it used to.
>
> Cool, thanks for the fast review.  I suspect there are more details
> that could stand to be added to the WAL reliability page as well, but
> I don't know what they are so I can't add them.
>
> I still have the feeling that we have not put quite a large enough
> red, blinking light around this issue, but I don't have a concrete
> suggestion.

I think the general problem is that there is no simple way to verify that a PostgreSQL commit is pushing the bits to
persistentstorage. It would be helpful if there were a platform-specific, volume-specific tool to deduce this.
Currently,there is no warning light that goes on when commits are not persistent. 

On Linux, a tool could check filesystem parameters, hdparm (if relevant), and hard drive and controller specs (possibly
againsta blacklist of known liars). 

Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within
therealm of possibility. 

How else can a DBA today ensure that a commit is a commit?

Cheers,
M

Re: O_DSYNC broken on MacOS X?

From

Greg Smith

Date:

07 October 2010, 20:29:21

A.M. wrote:
> Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are
withinthe realm of possibility.

>   

This is what the test_fsync utility that already ships with the database 
should be useful for.  The way Bruce changed it to report numbers in 
commits/second for 9.0 makes it a lot easier to use for this purpose 
than it used to be.  I think there's still some additional improvements 
that could be made there, but it's a tricky test to run accurately.  The 
current code is probably too detailed in some ways (it delivers a lot of 
output not relevant to this use-case) and not detailed enough in 
others.  Providing a summary that understands things like 
fsync_writethrough on platforms that support it was the first 
refactoring I had in my mind.  If that thing came back and said 
"fsync_writethrough works for you, so don't even consider the other 
possibilities if you want reliability even though they are faster", that 
would be nice for example.

> How else can a DBA today ensure that a commit is a commit?
>   

You can't ensure a commit is a commit without running a pull the plug 
test.  And I think the best way to do that accurately is using a "remote 
witness" server focusing on finding this particular problem to look for 
glitches, rather than than using the database as your test program and 
seeing if you happen to hit corruption or not.  The documentation for 
9.0 now suggests running the diskchecker.pl program for this exact 
purpose.  I've seen enough reports of it finding even subtle cache loss 
situations to believe that encouraging heavier use of that would be 
enough to make people much safer than they typically are today.  What we 
probably need to do next is provide people with an exact walkthrough of 
setting up and using the program, showing what a passing result looks 
like, and what a failing one looks like.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us

Re: O_DSYNC broken on MacOS X?

From

Bruce Momjian

Date:

19 October 2010, 12:07:12

Robert Haas wrote:
> On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> Proposed doc patch attached.
> >
> > "discusesed"? ?Otherwise +1
>
> Woops, thanks.  Committed with that change.  I back-patched it back to
> 8.3, which is as far as it applied with only minor conflicts.

I have applied the attached patch which mentions tools/fsync for testing
fsync method performance, and clarified the new paragraph about sync
methods.

I am glad to see we are beefing up this area of the docs.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 87d182c..f178835 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1569,13 +1569,13 @@ SET ENABLE_SEQSCAN TO OFF;
        </itemizedlist>
        <para>
         Not all of these choices are available on all platforms.
-        The default is the first method in the above list that is supported
-        by the platform.  The default is not necessarily best; it may be
-        necessary to change this setting, or other aspects of your system
-        configuration, in order to create a crash-safe configuration, as
-        discussed in <xref linkend="wal-reliability">, or to achieve best
-        performance.
         The <literal>open_</>* options also use <literal>O_DIRECT</> if available.
+        The default is the first method in the above list that is supported
+        by the platform.  The default is not necessarily ideal; it might be
+        necessary to change this setting or other aspects of your system
+        configuration in order to create a crash-safe configuration or
+        achieve optimal performance.
+        These aspects are discussed in <xref linkend="wal-reliability">.
         The utility <filename>src/tools/fsync</> in the PostgreSQL source tree
         can do performance testing of various fsync methods.
         This parameter can only be set in the <filename>postgresql.conf</>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 5678d6e..7b50bdd 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -530,11 +530,13 @@
   <para>
    The <xref linkend="guc-wal-sync-method"> parameter determines how
    <productname>PostgreSQL</productname> will ask the kernel to force
-    <acronym>WAL</acronym> updates out to disk.
-   With the exception of <literal>fsync_writethrough</>, which can sometimes
-   force a flush of the disk cache even when other options do not do so,
-   all the options should be the same in terms of reliability.
-   However, it's quite platform-specific which one will be the fastest.
+   <acronym>WAL</acronym> updates out to disk.
+   All the options should be the same in terms of reliability, with
+   the exception of <literal>fsync_writethrough</>, which can sometimes
+   force a flush of the disk cache even when other options do not do so.
+   However, it's quite platform-specific which one will be the fastest;
+   you can test option speeds using the utility <filename>src/tools/fsync</>
+   in the PostgreSQL source tree.
    Note that this parameter is irrelevant if <varname>fsync</varname>
    has been turned off.
   </para>

Re: O_DSYNC broken on MacOS X?

From

Bruce Momjian

Date:

19 October 2010, 12:23:12

Greg Smith wrote:
> A.M. wrote:
> > Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are
withinthe realm of possibility.
 
> >   
> 
> This is what the test_fsync utility that already ships with the database 
> should be useful for.  The way Bruce changed it to report numbers in 
> commits/second for 9.0 makes it a lot easier to use for this purpose 
> than it used to be.  I think there's still some additional improvements 
> that could be made there, but it's a tricky test to run accurately.  The 

test_fsync was designed to test various things like whether several
open-sync writes are better than two write and an fsync, and whether you
can fsync data written on a different file descriptor.  It is really a
catch-all test right now, not one specific for choosing sync methods.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: O_DSYNC broken on MacOS X?

From

"A.M."

Date:

19 October 2010, 12:38:38

On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote:

> Greg Smith wrote:
>> A.M. wrote:
>>> Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are
withinthe realm of possibility. 
>>>
>>
>> This is what the test_fsync utility that already ships with the database
>> should be useful for.  The way Bruce changed it to report numbers in
>> commits/second for 9.0 makes it a lot easier to use for this purpose
>> than it used to be.  I think there's still some additional improvements
>> that could be made there, but it's a tricky test to run accurately.  The
>
> test_fsync was designed to test various things like whether several
> open-sync writes are better than two write and an fsync, and whether you
> can fsync data written on a different file descriptor.  It is really a
> catch-all test right now, not one specific for choosing sync methods.

I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser
basedon the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for
suspiciousness.The goal is to make it more accessible. I was also thinking about adding some notes along the lines of
"Yourdrive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates
ashigh as RAM- your fsync method may be wrong." 

Currently, the test tool is not even compiled by default.

Thoughts?

Cheers,
M

Re: O_DSYNC broken on MacOS X?

From

Bruce Momjian

Date:

19 October 2010, 13:16:11

A.M. wrote:
> 
> On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote:
> 
> > Greg Smith wrote:
> >> A.M. wrote:
> >>> Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are
withinthe realm of possibility.
 
> >>> 
> >> 
> >> This is what the test_fsync utility that already ships with the database 
> >> should be useful for.  The way Bruce changed it to report numbers in 
> >> commits/second for 9.0 makes it a lot easier to use for this purpose 
> >> than it used to be.  I think there's still some additional improvements 
> >> that could be made there, but it's a tricky test to run accurately.  The 
> > 
> > test_fsync was designed to test various things like whether several
> > open-sync writes are better than two write and an fsync, and whether you
> > can fsync data written on a different file descriptor.  It is really a
> > catch-all test right now, not one specific for choosing sync methods.
> 
> I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser
basedon the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for
suspiciousness.The goal is to make it more accessible. I was also thinking about adding some notes along the lines of
"Yourdrive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates
ashigh as RAM- your fsync method may be wrong."
 
> 
> Currently, the test tool is not even compiled by default.
> 
> Thoughts?

Agreed.  Let me know if you have any questions.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

25 October 2010, 10:34:11

On Tue, Oct 19, 2010 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> Proposed doc patch attached.
>> >
>> > "discusesed"? ?Otherwise +1
>>
>> Woops, thanks.  Committed with that change.  I back-patched it back to
>> 8.3, which is as far as it applied with only minor conflicts.
>
> I have applied the attached patch which mentions tools/fsync for testing
> fsync method performance, and clarified the new paragraph about sync
> methods.
>
> I am glad to see we are beefing up this area of the docs.

It seems we're still missing some relevant details, because hdparm
doesn't seem to work on SCSI devices.  Is sdparm the right utility in
that case?  Does anyone know what the correct incantations look like?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: O_DSYNC broken on MacOS X?

From

Peter Eisentraut

Date:

25 October 2010, 13:51:42

On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote:
> It seems we're still missing some relevant details, because hdparm
> doesn't seem to work on SCSI devices.  Is sdparm the right utility in
> that case?  Does anyone know what the correct incantations look like?

Search the sdparm man page for "Writeback Cache".  It has detailed
examples.

Re: O_DSYNC broken on MacOS X?

From

Robert Haas

Date:

26 October 2010, 22:07:05

On Mon, Oct 25, 2010 at 12:51 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote:
>> It seems we're still missing some relevant details, because hdparm
>> doesn't seem to work on SCSI devices.  Is sdparm the right utility in
>> that case?  Does anyone know what the correct incantations look like?
>
> Search the sdparm man page for "Writeback Cache".  It has detailed
> examples.

Here's a patch.  This adds a few more details about sdparm and makes
it clear that it applies to both FreeBSD and Linux.  But, perhaps more
significantly, it rearranges what is currently a fairly long paragraph
into a bulleted list, which I think is more readable.  Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

wal-reliability-list.patch