Thread: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
Hi pgsql-performance,

I was doing mass insertions on my desktop machine and getting at most
1 MB/s disk writes (apart from occasional bursts of 16MB). Inserting 1
million rows with a single integer (data+index 56 MB total) took over
2 MINUTES! The only tuning I had done was shared_buffers=256MB. So I
got around to tuning the WAL writer and found that wal_buffers=16MB
works MUCH better. wal_sync_method=fdatasync also got similar results.

First of all, I'm running PostgreSQL 9.0.1 on Arch Linux
* Linux kernel 2.6.36 (also tested with 2.6.35.
* Quad-core Phenom II
* a single Seagate 7200RPM SATA drive (write caching on)
* ext4 FS over LVM, with noatime, data=writeback

I am creating a table like: create table foo(id integer primary key);
Then measuring performance with the query: insert into foo (id) select
generate_series(1, 1000000);

130438,011 ms    wal_buffers=64kB, wal_sync_method=open_datasync  (all defaults)
29306,847 ms     wal_buffers=1MB, wal_sync_method=open_datasync
4641,113 ms      wal_buffers=16MB, wal_sync_method=open_datasync
^ from 130s to 4.6 seconds by just changing wal_buffers.

5528,534 ms     wal_buffers=64kB, wal_sync_method=fdatasync
4856,712 ms     wal_buffers=16MB, wal_sync_method=fdatasync
^ fdatasync works well even with small wal_buffers

2911,265 ms    wal_buffers=16MB, fsync=off
^ Not bad, getting 60% of ideal throughput

These defaults are not just hurting bulk-insert performance, but also
everyone who uses synchronus_commit=off

Unless fdatasync is unsafe, I'd very much want to see it as the
default for 9.1 on Linux (I don't know about other platforms).  I
can't see any reasons why each write would need to be sync-ed if I
don't commit that often. Increasing wal_buffers probably has the same
effect wrt data safety.

Also, the tuning guide on wiki is understating the importance of these
tunables. Reading it I got the impression that some people change
wal_sync_method but it's dangerous and it even literally claims about
wal_buffers that "1MB is enough for some large systems"

But the truth is that if you want any write throughput AT ALL on a
regular Linux desktop, you absolutely have to change one of these. If
the defaults were better, it would be enough to set
synchronous_commit=off to get all that your hardware has to offer.

I was reading mailing list archives and didn't find anything against
it either. Can anyone clarify the safety of wal_sync_method=fdatasync?
Are there any reasons why it shouldn't be the default?

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Marti Raudsepp wrote:
> Unless fdatasync is unsafe, I'd very much want to see it as the
> default for 9.1 on Linux (I don't know about other platforms).  I
> can't see any reasons why each write would need to be sync-ed if I
> don't commit that often. Increasing wal_buffers probably has the same
> effect wrt data safety.
>

Writes only are sync'd out when you do a commit, or the database does a
checkpoint.

This issue is a performance difference introduced by a recent change to
Linux.  open_datasync support was just added to Linux itself very
recently.  It may be more safe than fdatasync on your platform.  As new
code it may have bugs so that it doesn't really work at all under heavy
load.  No one has really run those tests yet.  See
http://wiki.postgresql.org/wiki/Reliable_Writes for some background, and
welcome to the fun of being an early adopter.  The warnings in the
tuning guide are there for a reason--you're in untested territory now.
I haven't finished validating whether I consider 2.6.32 safe for
production use or not yet, and 2.6.36 is a solid year away from being on
my list for even considering it as a production database kernel.  You
should proceed presuming that all writes are unreliable until proven
otherwise.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Sunday 31 October 2010 20:59:31 Greg Smith wrote:
> Writes only are sync'd out when you do a commit, or the database does a
> checkpoint.
Hm?  WAL is written out to disk after an the space provided by wal_buffers(def
8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb which you reach
pretty quickly - especially after a checkpoint. With O_D?SYNC that will
synchronously get written out during a normal XLogInsert if hits a page
boundary.
*Additionally* its gets written out at a commit if sync commit is not on.

Not having a real O_DSYNC on linux until recently makes it even more dubious
to have it as a default...


Andres

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Sun, Oct 31, 2010 at 21:59, Greg Smith <greg@2ndquadrant.com> wrote:
> open_datasync support was just added to Linux itself very recently.

Oh I didn't realize it was a new feature. Indeed O_DSYNC support was
added in 2.6.33

It seems like bad behavior on PostgreSQL's part to default to new,
untested features.

I have updated the tuning wiki page with my understanding of the problem:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server#wal_sync_method_wal_buffers

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Mark Kirkwood
Date:
On 01/11/10 08:59, Greg Smith wrote:
Marti Raudsepp wrote:
Unless fdatasync is unsafe, I'd very much want to see it as the
default for 9.1 on Linux (I don't know about other platforms).  I
can't see any reasons why each write would need to be sync-ed if I
don't commit that often. Increasing wal_buffers probably has the same
effect wrt data safety.
 

Writes only are sync'd out when you do a commit, or the database does a checkpoint.

This issue is a performance difference introduced by a recent change to Linux.  open_datasync support was just added to Linux itself very recently.  It may be more safe than fdatasync on your platform.  As new code it may have bugs so that it doesn't really work at all under heavy load.  No one has really run those tests yet.  See http://wiki.postgresql.org/wiki/Reliable_Writes for some background, and welcome to the fun of being an early adopter.  The warnings in the tuning guide are there for a reason--you're in untested territory now.  I haven't finished validating whether I consider 2.6.32 safe for production use or not yet, and 2.6.36 is a solid year away from being on my list for even considering it as a production database kernel.  You should proceed presuming that all writes are unreliable until proven otherwise.


Greg,

Your reply is possibly a bit confusingly worded - Marti was suggesting that fdatasync be the default - so he wouldn't be a new adopter, since this call has been implemented in the kernel for ages. I guess you were wanting to stress that *open_datasync* is the new kid, so watch out to see if he bites...

Cheers

Mark

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Andres Freund wrote:
> On Sunday 31 October 2010 20:59:31 Greg Smith wrote:
>
>> Writes only are sync'd out when you do a commit, or the database does a
>> checkpoint.
>>
> Hm?  WAL is written out to disk after an the space provided by wal_buffers(def
> 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb which you reach
> pretty quickly - especially after a checkpoint.

Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that
I forget sometimes that people actually run with the default where this
becomes an important consideration.


> Not having a real O_DSYNC on linux until recently makes it even more dubious
> to have it as a default...
>

If Linux is now defining O_DSYNC, and it's buggy, that's going to break
more software than just PostgreSQL.  It wasn't defined before because it
didn't work.  If the kernel developers have made changes to claim it's
working now, but it doesn't really, I would think they'd consider any
reports of actual bugs here as important to fix.  There's only so much
the database can do in the face of incorrect information reported by the
operating system.

Anyway, I haven't actually seen reports that proves there's any problem
here, I was just pointing out that we haven't seen any positive reports
about database stress testing on these kernel versions yet either.  The
changes here are theoretically the right ones, and defaulting to safe
writes that flush out write caches is a long-term good thing.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Fri, Nov 5, 2010 at 23:10, Greg Smith <greg@2ndquadrant.com> wrote:
>> Not having a real O_DSYNC on linux until recently makes it even more
>> dubious to have it as a default...
>>
>
> If Linux is now defining O_DSYNC

Well, Linux always defined both O_SYNC and O_DSYNC, but they used to
have the same value. The defaults changed due to an unfortunate
heuristic in PostgreSQL, which boils down to:

#if O_DSYNC != O_SYNC
#define DEFAULT_SYNC_METHOD     SYNC_METHOD_OPEN_DSYNC
#else
#define DEFAULT_SYNC_METHOD     SYNC_METHOD_FDATASYNC

(see src/include/access/xlogdefs.h for details)

In fact, I was wrong in my earlier post. Linux always offered O_DSYNC
behavior. What's new is POSIX-compliant O_SYNC, and the fact that
these flags are now distinguished.

Here's the change in Linux:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Friday 05 November 2010 22:10:36 Greg Smith wrote:
> Andres Freund wrote:
> > On Sunday 31 October 2010 20:59:31 Greg Smith wrote:
> >> Writes only are sync'd out when you do a commit, or the database does a
> >> checkpoint.
> >
> > Hm?  WAL is written out to disk after an the space provided by
> > wal_buffers(def 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb
> > which you reach pretty quickly - especially after a checkpoint.
> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that
> I forget sometimes that people actually run with the default where this
> becomes an important consideration.
If you have relatively frequent checkpoints (quite a sensible in some
environments given the burstiness/response time problems you can get) even a
16MB wal_buffers can cause significantly more synchronous writes with O_DSYNC
because of the amounts of wal traffic due to full_page_writes. For one the
background wal writer wont keep up and for another all its writes will be
synchronous...

Its simply a pointless setting.

> > Not having a real O_DSYNC on linux until recently makes it even more
> > dubious to have it as a default...
> If Linux is now defining O_DSYNC, and it's buggy, that's going to break
> more software than just PostgreSQL.  It wasn't defined before because it
> didn't work.  If the kernel developers have made changes to claim it's
> working now, but it doesn't really, I would think they'd consider any
> reports of actual bugs here as important to fix.  There's only so much
> the database can do in the face of incorrect information reported by the
> operating system.
I don't see it being buggy so far. Its just doing what it should. Which is
simply a terrible thing for our implementation. Generally. Independent from
linux.

> Anyway, I haven't actually seen reports that proves there's any problem
> here, I was just pointing out that we haven't seen any positive reports
> about database stress testing on these kernel versions yet either.  The
> changes here are theoretically the right ones, and defaulting to safe
> writes that flush out write caches is a long-term good thing.
I have seen several database which run under 2.6.33 with moderate to high load
for some time now. And two 2.6.35.
Loads of problems, but none kernel related so far ;-)

Andres

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Marti Raudsepp wrote:
> In fact, I was wrong in my earlier post. Linux always offered O_DSYNC
> behavior. What's new is POSIX-compliant O_SYNC, and the fact that
> these flags are now distinguished.
>

While I appreciate that you're trying to help here, I'm unconvinced
you've correctly diagnosed a couple of components to what's going on
here properly yet.  Please refrain from making changes to popular
documents like the tuning guide on the wiki based on speculation about
what's happening.  There's definitely at least one mistake in what you
wrote there, and I just reverted the whole set of changes you made
accordingly until this is sorted out better.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Josh Berkus
Date:
> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that
> I forget sometimes that people actually run with the default where this
> becomes an important consideration.

Do you have any testing in favor of 16mb vs. lower/higher?

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Sat, Nov 6, 2010 at 00:06, Greg Smith <greg@2ndquadrant.com> wrote:
>  Please refrain from making changes to popular documents like the
> tuning guide on the wiki based on speculation about what's happening.

I will grant you that the details were wrong, but I stand by the conclusion.

I can state for a fact that PostgreSQL's default wal_sync_method
varies depending on the <fcntl.h> header.
I have two PostgreSQL 9.0.1 builds, one with older
/usr/include/bits/fcntl.h and one with newer.

When I run "show wal_sync_method;" on one instance, I get fdatasync.
On the other one I get open_datasync.

So let's get down to code.

Older fcntl.h has:
#define O_SYNC         010000
# define O_DSYNC    O_SYNC    /* Synchronize data.  */

Newer has:
#define O_SYNC           04010000
# define O_DSYNC    010000    /* Synchronize data.  */

So you can see that in the older header, O_DSYNC and O_SYNC are equal.

src/include/access/xlogdefs.h does:

#if defined(O_SYNC)
#define OPEN_SYNC_FLAG      O_SYNC
...
#if defined(OPEN_SYNC_FLAG)
/* O_DSYNC is distinct? */
#if O_DSYNC != OPEN_SYNC_FLAG
#define OPEN_DATASYNC_FLAG      O_DSYNC

^ it's comparing O_DSYNC != O_SYNC

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD     SYNC_METHOD_OPEN_DSYNC
#elif defined(HAVE_FDATASYNC)
#define DEFAULT_SYNC_METHOD     SYNC_METHOD_FDATASYNC

^ depending on whether O_DSYNC and O_SYNC were equal, the default
wal_sync_method will change.

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
"Pierre C"
Date:
>> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that
>> I forget sometimes that people actually run with the default where this
>> becomes an important consideration.
>
> Do you have any testing in favor of 16mb vs. lower/higher?

 From some tests I had done some time ago, using separate spindles (RAID1)
for xlog, no battery, on 8.4, with stuff that generates lots of xlog
(INSERT INTO SELECT) :

When using a small wal_buffers, there was a problem when switching from
one xlog file to the next. Basically a fsync was issued, but most of the
previous log segment was still not written. So, postgres was waiting for
the fsync to finish. Of course, the default 64 kB of wal_buffers is
quickly filled up, and all writes wait for the end of this fsync. This
caused hiccups in the xlog traffic, and xlog throughput wassn't nearly as
high as the disks would allow. Sticking a sthetoscope on the xlog
harddrives revealed a lot more random accesses that I would have liked
(this is a much simpler solution than tracing the IOs, lol)

I set wal writer delay to a very low setting (I dont remember which,
perhaps 1 ms) so the walwriter was in effect constantly flushing the wal
buffers to disk. I also used fdatasync instead of fsync. Then I set
wal_buffers to a rather high value, like 32-64 MB. Throughput and
performance were a lot better, and the xlog drives made a much more
"linear-access" noise.

What happened is that, since wal_buffers was larger than what the drives
can write in 1-2 rotations, it could absorb wal traffic during the time
postgres waits for fdatasync / wal segment change, so the inserts would
not have to wait. And lowering the walwriter delay made it write something
on each disk rotation, so that when a COMMIT or segment switch came, most
of the time, the WAL was already synced and there was no wait.

Just my 2 c ;)

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Marti Raudsepp wrote:
> I will grant you that the details were wrong, but I stand by the conclusion.
> I can state for a fact that PostgreSQL's default wal_sync_method
> varies depending on the <fcntl.h> header.
>

Yes; it's supposed to, and that logic works fine on some other
platforms.  The question is exactly what the new Linux O_DSYNC behavior
is doing, in regards to whether it flushes drive caches out or not.
Until you've quantified which of the cases do that--which is required
for reliable operation of PostgreSQL--and which don't, you don't have
any data that can be used to draw a conclusion from.  If some setups are
faster because they write less reliably, that doesn't automatically make
them the better choice.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Monday 08 November 2010 00:35:29 Greg Smith wrote:
> Marti Raudsepp wrote:
> > I will grant you that the details were wrong, but I stand by the
> > conclusion. I can state for a fact that PostgreSQL's default
> > wal_sync_method varies depending on the <fcntl.h> header.
>
> Yes; it's supposed to, and that logic works fine on some other
> platforms.  The question is exactly what the new Linux O_DSYNC behavior
> is doing, in regards to whether it flushes drive caches out or not.
> Until you've quantified which of the cases do that--which is required
> for reliable operation of PostgreSQL--and which don't, you don't have
> any data that can be used to draw a conclusion from.  If some setups are
> faster because they write less reliably, that doesn't automatically make
> them the better choice.
I think thats FUD. Sorry.

Can you explain to me why fsync() may/should/could be *any* less reliable than
O_DSYNC? On *any* platform. Or fdatasync() in the special way its used with
pg, namely completely preallocated files.

I think the reasons why O_DSYNC is, especially, but not only, in combination
with a small wal_buffers setting, slow in most circumstances are pretty clear.

Making a setting which is only supported on a small range of systems highest
in the preferences list is even more doubtfull than the already strange choice
of making O_DSYNC the default given the way it works (i.e. no reordering,
synchronous writes in the bgwriter, synchronous writes on wal_buffers pressure
etc).

Greetings,

Andres

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Andres Freund wrote:
> I think thats FUD. Sorry.
>

Yes, there's plenty of uncertainty and doubt here, but not from me.  The
test reports given so far have been so riddled with errors I don't trust
any of them.

As a counter example showing my expectations here, the "Testing
Sandforce SSD" tests done by Yeb Havinga:
http://archives.postgresql.org/message-id/4C4A9452.9070100@gmail.com
followed the right method for confirming both write integrity and
performance including pull the plug situations.  Those I trusted.  What
Marti had posted, and what Phoronix investigated, just aren't that thorough.

> Can you explain to me why fsync() may/should/could be *any* less reliable than
> O_DSYNC? On *any* platform. Or fdatasync() in the special way its used with
> pg, namely completely preallocated files.
>

If the Linux kernel has done extra work so that O_DSYNC writes are
forced to disk including a cache flush, but that isn't done for just
fdatasync() calls, there could be difference here.  The database still
wouldn't work right in that case, because checkpoint writes are still
going to be using fdatasync.

I'm not sure what the actual behavior is supposed to be, but ultimately
it doesn't matter.  The history of the Linux kernel developers in this
area has been so completely full of bugs and incomplete implementations
that I am working from the assumption that we know nothing about what
actually works and what doesn't without doing careful real-world testing.

> I think the reasons why O_DSYNC is, especially, but not only, in combination
> with a small wal_buffers setting, slow in most circumstances are pretty clear.
>

Where's your benchmarks proving it then?  If you're right about this,
and I'm not saying you aren't, it should be obvious in simple bechmarks
by stepping through various sizes for wal_buffers and seeing the
throughput/latency situation improve.  But since I haven't seen that
done, this one is still in the uncertainty & doubt bucket too.  You're
assuming one of the observed problems corresponds to this theorized
cause.  But you can't prove a performance change on theory.  You have to
isolate it and then you'll know.  So long as there are multiple
uncertainties going on here, I don't have any conclusion yet, just a
list of things to investigate that's far longer than the list of what's
been looked at so far.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Mon, Nov 8, 2010 at 01:35, Greg Smith <greg@2ndquadrant.com> wrote:
> Yes; it's supposed to, and that logic works fine on some other platforms.

No, the logic was broken to begin with. Linux technically supported
O_DSYNC all along. PostgreSQL used fdatasync as the default. Now,
because Linux added proper O_SYNC support, PostgreSQL suddenly prefers
O_DSYNC over fdatasync?

> Until you've
> quantified which of the cases do that--which is required for reliable
> operation of PostgreSQL--and which don't, you don't have any data that can
> be used to draw a conclusion from.  If some setups are faster because they
> write less reliably, that doesn't automatically make them the better choice.

I don't see your point. If fdatasync worked on Linux, AS THE DEFAULT,
all the time until recently, then how does it all of a sudden need
proof NOW?

If anything, the new open_datasync should be scrutinized because it
WASN'T the default before and it hasn't gotten as much testing on
Linux.

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Mon, Nov 8, 2010 at 02:05, Greg Smith <greg@2ndquadrant.com> wrote:
> Where's your benchmarks proving it then?  If you're right about this, and
> I'm not saying you aren't, it should be obvious in simple bechmarks by
> stepping through various sizes for wal_buffers and seeing the
> throughput/latency situation improve.

Since benchmarking is the easy part, I did that. I plotted the time
taken by inserting 2 million rows to a table with a single integer
column and no indexes (total 70MB). Entire script is attached. If you
don't agree with something in this benchmark, please suggest
improvements.

Chart: http://ompldr.org/vNjNiNQ/wal_sync_method1.png
Spreadsheet: http://ompldr.org/vNjNiNg/wal_sync_method1.ods (the 2nd
worksheet has exact measurements)

This is a different machine from the original post, but similar
configuration. One 1TB 7200RPM Seagate Barracuda, no disk controller
cache, 4G RAM, Phenom X4, Linux 2.6.36, PostgreSQL 9.0.1, Arch Linux.

This time I created a separate 20GB ext4 partition specially for
PostgreSQL, with all default settings (shared_buffers=32MB). The
partition is near the end of the disk, so hdparm gives a sequential
read throughput of ~72 MB/s. I'm getting frequent checkpoint warnings,
should I try larger checkpoing_segments too?

The partition is re-created and 'initdb' is re-ran for each test, to
prevent file system allocation from affecting results. I did two runs
of all benchmarks. The points on the graph show a sum of INSERT time +
COMMIT time in seconds.

One surprising thing on the graph is a "plateau", where open_datasync
performs almost equally with wal_buffers=128kB and 256kB.

Another noteworthy difference (not visible on the graph) is that with
open_datasync -- but not fdatasync -- and wal_buffers=128M, INSERT
time keeps shrinking, but COMMIT takes longer. The total INSERT+COMMIT
time remains the same, however.

----

I have a few expendable hard drives here so I can test reliability by
pulling the SATA cable as well. Is this kind of testing useful? What
workloads do you suggest?

Regards,
Marti

Attachment

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Scott Carey
Date:
On Nov 7, 2010, at 6:35 PM, Marti Raudsepp wrote:

> On Mon, Nov 8, 2010 at 01:35, Greg Smith <greg@2ndquadrant.com> wrote:
>> Yes; it's supposed to, and that logic works fine on some other platforms.
>
> No, the logic was broken to begin with. Linux technically supported
> O_DSYNC all along. PostgreSQL used fdatasync as the default. Now,
> because Linux added proper O_SYNC support, PostgreSQL suddenly prefers
> O_DSYNC over fdatasync?
>
>> Until you've
>> quantified which of the cases do that--which is required for reliable
>> operation of PostgreSQL--and which don't, you don't have any data that can
>> be used to draw a conclusion from.  If some setups are faster because they
>> write less reliably, that doesn't automatically make them the better choice.
>
> I don't see your point. If fdatasync worked on Linux, AS THE DEFAULT,
> all the time until recently, then how does it all of a sudden need
> proof NOW?
>
> If anything, the new open_datasync should be scrutinized because it
> WASN'T the default before and it hasn't gotten as much testing on
> Linux.
>

I agree.  Im my opinion, the burden of proof lies with those contending that the default value should _change_ from
fdatasyncto O_DSYNC on linux.  If the default changes, all power-fail testing and other reliability tests done prior on
ahardware configuration may become invalid without users even knowing. 

Unfortunately, a code change in postgres is required to _prevent_ the default from changing when compiled and run
againstthe latest kernels. 

Summary:
Until recently, there was code with a code comment in the Linux kernel that said "For now, when the user asks for
O_SYNC,we'll actually give O_DSYNC".  Linux has had O_DSYNC forever and ever, but not O_SYNC.   
If O_DSYNC is preferred over fdatasync for Postgres xlog (as the code indicates), it should have been the preferred for
yearson Linux as well.  If fdatasync has been the preferred method on Linux, and the O_SYNC = O_DSYNC test was for
that,then the purpose behind the test has broken.   

No matter how you slice it, the default on Linux is implicitly changing and the choice is to either:
 * Return the default to fdatasync
 * Let it implicitly change to O_DSYNC

The latter choice is the one that requires testing to prove that it is the proper and preferred default from the
performanceand data reliability POV.  The former is the status quo -- but requires a code change. 






> Regards,
> Marti
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Scott Carey <scott@richrelevance.com> writes:
> No matter how you slice it, the default on Linux is implicitly changing and the choice is to either:
>  * Return the default to fdatasync
>  * Let it implicitly change to O_DSYNC

> The latter choice is the one that requires testing to prove that it is the proper and preferred default from the
performanceand data reliability POV. 

And, in fact, the game plan is to do that testing and see which default
we want.  I think it's premature to argue further about this until we
have some test results.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Scott Carey wrote:
> Im my opinion, the burden of proof lies with those contending that the default value should _change_ from fdatasync
toO_DSYNC on linux.  If the default changes, all power-fail testing and other reliability tests done prior on a
hardwareconfiguration may become invalid without users even knowing. 
>

This seems to be ignoring the fact that unless you either added a
non-volatile cache or specifically turned off all write caching on your
drives, the results of all power-fail testing done on earlier versions
of Linux was that it failed.  The default configuration of PostgreSQL on
Linux has been that any user who has a simple SATA drive gets unsafe
writes, unless they go out of their way to prevent them.

Whatever newer kernels do by default cannot be worse.  The open question
is whether it's still broken, in which case we might as well favor the
known buggy behavior rather than the new one, or whether everything has
improved enough to no longer be unsafe with the new defaults.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
Hi,

On Monday 08 November 2010 23:12:57 Greg Smith wrote:
> This seems to be ignoring the fact that unless you either added a
> non-volatile cache or specifically turned off all write caching on your
> drives, the results of all power-fail testing done on earlier versions
> of Linux was that it failed.  The default configuration of PostgreSQL on
> Linux has been that any user who has a simple SATA drive gets unsafe
> writes, unless they go out of their way to prevent them.
Which is about *no* argument in favor of any of the options, right?

> Whatever newer kernels do by default cannot be worse.  The open question
> is whether it's still broken, in which case we might as well favor the
> known buggy behavior rather than the new one, or whether everything has
> improved enough to no longer be unsafe with the new defaults.
Either I majorly misunderstand you, or ... I dont know.

There simply *is* no new implementation relevant for this discussion. Full
Stop. What changed is that O_DSYNC is defined differently from O_SYNC these days
and O_SYNC actually does what it should. Which causes pg to move open_datasync
first in the preference list doing what the option with the lowest preference
did up to now.

That does not *at all* change the earlier fdatasync() or fsync()
implementations/tests. It simply makes open_datasync the default doing what
open_sync did earlier.
For that note that open_sync was the method of *least* preference till now...
And that fdatasync() thus was the default till now. Which it is not anymore.

I don't argue *at all* that we have to test the change moving fdatasync before
open_datasync on the *other* operating systems. What I completely don't get is
all that talking about data consistency on linux. Its simply irrelevant in
that context.

Andres




Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Mon, Nov 8, 2010 at 20:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The latter choice is the one that requires testing to prove that it is the proper and preferred default from the
performanceand data reliability POV. 
>
> And, in fact, the game plan is to do that testing and see which default
> we want.  I think it's premature to argue further about this until we
> have some test results.

Who will be doing that testing? You said you're relying on Greg Smith
to manage the testing, but he's obviously uninterested, so it seems
unlikely that this will go anywhere.

I posted my results with the simple INSERT test, but nobody cared. I
could do some pgbench runs, but I have no idea what parameters would
give useful results.

Meanwhile, PostgreSQL performance is regressing and there's still no
evidence that open_datasync is any safer.

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Marti Raudsepp <marti@juffo.org> writes:
> On Mon, Nov 8, 2010 at 20:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> And, in fact, the game plan is to do that testing and see which default
>> we want.  I think it's premature to argue further about this until we
>> have some test results.

> Who will be doing that testing? You said you're relying on Greg Smith
> to manage the testing, but he's obviously uninterested, so it seems
> unlikely that this will go anywhere.

What's your basis for asserting he's uninterested?  Please have a little
patience.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Sat, Nov 13, 2010 at 20:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What's your basis for asserting he's uninterested?  Please have a little
> patience.

My apologies, I was under the impression that he hadn't answered your
request, but he did in the -hackers thread.

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Time for a deeper look at what's going on here...I installed RHEL6 Beta
2 yesterday, on the presumption that since the release version just came
out this week it was likely the same version Marti tested against.
Also, it was the one I already had a DVD to install for.  This was on a
laptop with 7200 RPM hard drive, already containing an Ubuntu
installation for comparison sake.

Initial testing was done with the PostgreSQL test_fsync utility, just to
get a gross idea of what situations the drives involved were likely
flushing data to disk correctly during, and which it was impossible for
that to be true.  7200 RPM = 120 rotations/second, which puts an upper
limit of 120 true fsync executions per second.  The test_fsync released
with PostgreSQL 9.0 now reports its value on the right scale that you
can directly compare against that (earlier versions reported
seconds/commit, not commits/second).

First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD
checkout:

$ cd [PostgreSQL source code tree]
$ cd src/tools/fsync/
$ make

And I started with looking at the Ubuntu system running ext3, which
represents the status quo we've been seeing the past few years.
Initially the drive write cache was turned on:

Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC
2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"

/dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro)

$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      88476.784/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write             1192.135/second
    8k write, fdatasync            1222.158/second
    8k write, fsync                1097.980/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes           527.361/second
    8k write, 8k write, fdatasync  1105.204/second
    8k write, 8k write, fsync      1084.050/second

Compare open_sync with different sizes:
    open_sync 16k write             966.047/second
    2 open_sync 8k writes           529.565/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close         1064.177/second
    8k write, close, fsync         1042.337/second

Two notable things here.  One, there is no open_datasync defined in this
older kernel.  Two, all methods of commit give equally inflated commit
rates, far faster than the drive is capable of.  This proves this setup
isn't flushing the drive's write cache after commit.

You can get safe behavior out of the old kernel by disabling its write
cache:

$ sudo /sbin/hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

Loops = 10000

Simple write:
    8k write                      89023.413/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write              106.968/second
    8k write, fdatasync             108.106/second
    8k write, fsync                 104.238/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes            51.637/second
    8k write, 8k write, fdatasync   109.256/second
    8k write, 8k write, fsync       103.952/second

Compare open_sync with different sizes:
    open_sync 16k write             109.562/second
    2 open_sync 8k writes            52.752/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          107.179/second
    8k write, close, fsync          106.923/second

And now results are as expected:  just under 120/second.

Onto RHEL6.  Setup for this initial test was:

$ uname -a
Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
$ mount
/dev/sda7 on / type ext4 (rw)

And I started with the write cache off to see a straight comparison
against the above:

$ sudo hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)
$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104194.886/second

Compare file sync methods using one write:
    open_datasync 8k write           97.828/second
    open_sync 8k write              109.158/second
    8k write, fdatasync             109.838/second
    8k write, fsync                  20.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    2 open_sync 8k writes            53.721/second
    8k write, 8k write, fdatasync   109.731/second
    8k write, 8k write, fsync        20.918/second

Compare open_sync with different sizes:
    open_sync 16k write             109.552/second
    2 open_sync 8k writes            54.116/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           20.800/second
    8k write, close, fsync           20.868/second

A few changes then.  open_datasync is available now.  It looks slightly
slower than the alternatives on this test, but I didn't see that on the
later tests so I'm thinking that's just occasional run to run
variation.  For some reason regular fsync is dramatically slower in this
kernel than earlier ones.  Perhaps a lot more metadata being flushed all
the way to the disk in that case now?

The issue that I think Marti has been concerned about is highlighted in
this interesting subset of the data:

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    8k write, 8k write, fdatasync   109.731/second

The results here aren't surprising; if you do two dsync writes, that
will take two disk rotations, while two writes followed a single sync
only takes one.  But that does mean that in the case of small values for
wal_buffers, like the default, you could easily end up paying a rotation
sync penalty more than once per commit.

Next question is what happens if I turn the drive's write cache back on:

$ sudo hdparm -W1 /dev/sda

/dev/sda:
 setting drive write-caching to 1 (on)
 write-caching =  1 (on)

$ ./test_fsync

[gsmith@meddle fsync]$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104198.143/second

Compare file sync methods using one write:
    open_datasync 8k write          110.707/second
    open_sync 8k write              110.875/second
    8k write, fdatasync             110.794/second
    8k write, fsync                  28.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        55.731/second
    2 open_sync 8k writes            55.618/second
    8k write, 8k write, fdatasync   110.551/second
    8k write, 8k write, fsync        28.843/second

Compare open_sync with different sizes:
    open_sync 16k write             110.176/second
    2 open_sync 8k writes            55.785/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           28.779/second
    8k write, close, fsync           28.855/second

This is nice to see from a reliability perspective.  On all three of the
viable sync methods here, the speed seen suggests the drive's volatile
write cache is being flushed after every commit.  This is going to be
bad for people who have gotten used to doing development on systems
where that's not honored and they don't care, because this looks like a
90% drop in performance on those systems.  But since the new behavior is
safe and the earlier one was not, it's hard to get mad about it.
Developers probably just need to be taught to turn synchronous_commit
off to speed things up when playing with test data.

test_fsync writes to /var/tmp/test_fsync.out by default, not paying
attention to what directory you're in.  So to use it to test another
filesystem, you have to make sure to give it an explicit full path.
Next I tested against the old Ubuntu partition that was formatted with
ext3, with the write cache still on:

# mount | grep /ext3
/dev/sda5 on /ext3 type ext3 (rw)
# ./test_fsync -f /ext3/test_fsync.out
Loops = 10000

Simple write:
    8k write                      100943.825/second

Compare file sync methods using one write:
    open_datasync 8k write          106.017/second
    open_sync 8k write              108.318/second
    8k write, fdatasync             108.115/second
    8k write, fsync                 105.270/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.313/second
    2 open_sync 8k writes            54.045/second
    8k write, 8k write, fdatasync    55.291/second
    8k write, 8k write, fsync        53.243/second

Compare open_sync with different sizes:
    open_sync 16k write              54.980/second
    2 open_sync 8k writes            53.563/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          105.032/second
    8k write, close, fsync          103.987/second

Strange...it looks like ext3 is executing cache flushes, too.  Note that
all of the "Compare file sync methods using two writes" results are half
speed now; it's as if ext3 is flushing the first write out immediately?
This result was unexpected, and I don't trust it yet; I want to validate
this elsewhere.

What about XFS?  That's a first class filesystem on RHEL6 too:

[root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
Loops = 10000

Simple write:
    8k write                      71878.324/second

Compare file sync methods using one write:
    open_datasync 8k write           36.303/second
    open_sync 8k write               35.714/second
    8k write, fdatasync              35.985/second
    8k write, fsync                  35.446/second

I stopped that there, sick of waiting for it, as there's obviously some
serious work (mounting options or such at a minimum) that needs to be
done before XFS matches the other two.  Will return to that later.

So, what have we learned so far:

1) On these newer kernels, both ext4 and ext3 seem to be pushing data
out through the drive write caches correctly.

2) On single writes, there's no performance difference between the main
three methods you might use, with the straight fsync method having a
serious regression in this use case.

3) WAL writes that are forced by wal_buffers filling will turn into a
commit-length write when using the new, default open_datasync.  Using
the older default of fdatasync avoids that problem, in return for
causing WAL writes to pollute the OS cache.  The main benefit of O_DSYNC
writes over fdatasync ones is avoiding the OS cache.

I want to next go through and replicate some of the actual database
level tests before giving a full opinion on whether this data proves
it's worth changing the wal_sync_method detection.  So far I'm torn
between whether that's the right approach, or if we should just increase
the default value for wal_buffers to something more reasonable.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Robert Haas
Date:
On Tue, Nov 16, 2010 at 3:39 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I want to next go through and replicate some of the actual database level
> tests before giving a full opinion on whether this data proves it's worth
> changing the wal_sync_method detection.  So far I'm torn between whether
> that's the right approach, or if we should just increase the default value
> for wal_buffers to something more reasonable.

How about both?

open_datasync seems problematic for a number of reasons - you get an
immediate write-through whether you need it or not, including, as you
point out, the case where the you want to write several blocks at once
and then force them all out together.

And 64kB for a ring buffer just seems awfully small.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Josh Berkus
Date:
On 11/16/10 12:39 PM, Greg Smith wrote:
> I want to next go through and replicate some of the actual database
> level tests before giving a full opinion on whether this data proves
> it's worth changing the wal_sync_method detection.  So far I'm torn
> between whether that's the right approach, or if we should just increase
> the default value for wal_buffers to something more reasonable.

We'd love to, but wal_buffers uses sysV shmem.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 11/16/10 12:39 PM, Greg Smith wrote:
>> I want to next go through and replicate some of the actual database
>> level tests before giving a full opinion on whether this data proves
>> it's worth changing the wal_sync_method detection.  So far I'm torn
>> between whether that's the right approach, or if we should just increase
>> the default value for wal_buffers to something more reasonable.

> We'd love to, but wal_buffers uses sysV shmem.

Well, we're not going to increase the default to gigabytes, but we could
very probably increase it by a factor of 10 or so without anyone
squawking.  It's been awhile since I heard of anyone trying to run PG in
4MB shmmax.  How much would a change of that size help?

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Marti Raudsepp
Date:
On Wed, Nov 17, 2010 at 01:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, we're not going to increase the default to gigabytes, but we could
> very probably increase it by a factor of 10 or so without anyone
> squawking.  It's been awhile since I heard of anyone trying to run PG in
> 4MB shmmax.  How much would a change of that size help?

In my testing, when running a large bulk insert query with fdatasync
on ext4, changing wal_buffers has very little effect:
http://ompldr.org/vNjNiNQ/wal_sync_method1.png

(More details at
http://archives.postgresql.org/pgsql-performance/2010-11/msg00094.php
)

It would take some more testing to say this conclusively, but looking
at the raw data, there only seems to be an effect when moving from 8
to 16MB. Could be different on other file systems though.

Regards,
Marti

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Mladen Gogala
Date:
Josh Berkus wrote:
> On 11/16/10 12:39 PM, Greg Smith wrote:
>
>> I want to next go through and replicate some of the actual database
>> level tests before giving a full opinion on whether this data proves
>> it's worth changing the wal_sync_method detection.  So far I'm torn
>> between whether that's the right approach, or if we should just increase
>> the default value for wal_buffers to something more reasonable.
>>
>
> We'd love to, but wal_buffers uses sysV shmem.
>
>
Speaking of the SYSV SHMEM, is it possible to use huge pages?

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Wednesday 17 November 2010 00:31:34 Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > On 11/16/10 12:39 PM, Greg Smith wrote:
> >> I want to next go through and replicate some of the actual database
> >> level tests before giving a full opinion on whether this data proves
> >> it's worth changing the wal_sync_method detection.  So far I'm torn
> >> between whether that's the right approach, or if we should just increase
> >> the default value for wal_buffers to something more reasonable.
> >
> > We'd love to, but wal_buffers uses sysV shmem.
>
> Well, we're not going to increase the default to gigabytes
Especially not as I don't think it will have any effect after wal_segment_size
as that will force a write-out anyway. Or am I misremembering the
implementation?

Andres

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On Wednesday 17 November 2010 00:31:34 Tom Lane wrote:
>> Well, we're not going to increase the default to gigabytes

> Especially not as I don't think it will have any effect after wal_segment_size
> as that will force a write-out anyway. Or am I misremembering the
> implementation?

Well, there's a forced fsync after writing the last page of an xlog
file, but I don't believe that proves that more than 16MB of xlog
buffers is useless.  Other processes could still be busy filling the
buffers.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Wednesday 17 November 2010 01:51:28 Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Wednesday 17 November 2010 00:31:34 Tom Lane wrote:
> >> Well, we're not going to increase the default to gigabytes
> >
> > Especially not as I don't think it will have any effect after
> > wal_segment_size as that will force a write-out anyway. Or am I
> > misremembering the implementation?
>
> Well, there's a forced fsync after writing the last page of an xlog
> file, but I don't believe that proves that more than 16MB of xlog
> buffers is useless.  Other processes could still be busy filling the
> buffers.
Maybe I am missing something, but I think the relevant AdvanceXLInsertBuffer()
is currently called with WALInsertLock held?

Andres


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On Wednesday 17 November 2010 01:51:28 Tom Lane wrote:
>> Well, there's a forced fsync after writing the last page of an xlog
>> file, but I don't believe that proves that more than 16MB of xlog
>> buffers is useless.  Other processes could still be busy filling the
>> buffers.

> Maybe I am missing something, but I think the relevant AdvanceXLInsertBuffer()
> is currently called with WALInsertLock held?

The fsync is associated with the write, which is not done with insert
lock held.  We're not quite that dumb.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Andres Freund
Date:
On Wednesday 17 November 2010 02:04:28 Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Wednesday 17 November 2010 01:51:28 Tom Lane wrote:
> >> Well, there's a forced fsync after writing the last page of an xlog
> >> file, but I don't believe that proves that more than 16MB of xlog
> >> buffers is useless.  Other processes could still be busy filling the
> >> buffers.
> >
> > Maybe I am missing something, but I think the relevant
> > AdvanceXLInsertBuffer() is currently called with WALInsertLock held?
>
> The fsync is associated with the write, which is not done with insert
> lock held.  We're not quite that dumb.
Ah, I see. The XLogWrite in AdvanceXLInsertBuffer is only happening if the head
of the buffer gets to the tail - which is more likely if the wal buffers are
small...

Andres


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Josh Berkus
Date:
> Well, we're not going to increase the default to gigabytes, but we could
> very probably increase it by a factor of 10 or so without anyone
> squawking.  It's been awhile since I heard of anyone trying to run PG in
> 4MB shmmax.  How much would a change of that size help?

Last I checked, though, this comes out of the allocation available to
shared_buffers.  And there definitely are several OSes (several linuxes,
OSX) still limited to 32MB by default.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
I wrote:
> The fsync is associated with the write, which is not done with insert
> lock held.  We're not quite that dumb.

But wait --- are you thinking of the call path where a write (and
possible fsync) is forced during AdvanceXLInsertBuffer because there's
no WAL buffer space left?  If so, that's *exactly* the scenario that
can be expected to be less common with more buffer space.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Well, we're not going to increase the default to gigabytes, but we could
>> very probably increase it by a factor of 10 or so without anyone
>> squawking.  It's been awhile since I heard of anyone trying to run PG in
>> 4MB shmmax.  How much would a change of that size help?

> Last I checked, though, this comes out of the allocation available to
> shared_buffers.  And there definitely are several OSes (several linuxes,
> OSX) still limited to 32MB by default.

Sure, but the current default is a measly 64kB.  We could increase that
10x for a relatively small percentage hit in the size of shared_buffers,
if you suppose that there's 32MB available.  The current default is set
to still work if you've got only a couple of MB in SHMMAX.

What we'd want is for initdb to adjust the setting as part of its
probing to see what SHMMAX is set to.

            regards, tom lane

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Robert Haas
Date:
On Tue, Nov 16, 2010 at 6:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 11/16/10 12:39 PM, Greg Smith wrote:
>> I want to next go through and replicate some of the actual database
>> level tests before giving a full opinion on whether this data proves
>> it's worth changing the wal_sync_method detection.  So far I'm torn
>> between whether that's the right approach, or if we should just increase
>> the default value for wal_buffers to something more reasonable.
>
> We'd love to, but wal_buffers uses sysV shmem.

<places tongue firmly in cheek>

Gee, too bad there's not some other shared-memory implementation we could use...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Scott Carey
Date:
On Nov 16, 2010, at 4:05 PM, Mladen Gogala wrote:

> Josh Berkus wrote:
>> On 11/16/10 12:39 PM, Greg Smith wrote:
>>
>>> I want to next go through and replicate some of the actual database
>>> level tests before giving a full opinion on whether this data proves
>>> it's worth changing the wal_sync_method detection.  So far I'm torn
>>> between whether that's the right approach, or if we should just increase
>>> the default value for wal_buffers to something more reasonable.
>>>
>>
>> We'd love to, but wal_buffers uses sysV shmem.
>>
>>
> Speaking of the SYSV SHMEM, is it possible to use huge pages?

RHEL 6  and friends have transparent hugepage support.  I'm not sure if they yet transparently do it for SYSV SHMEM,
butthey do for most everything else.  Sequential traversal of a process heap is several times faster with hugepages.
Unfortunately,postgres doesn't organize its blocks in its shared_mem to be sequential for a relation.  So it might not
mattermuch. 

>
> --
>
> Mladen Gogala
> Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> http://www.vmsinfo.com
> The Leader in Integrated Media Intelligence Solutions
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Scott Carey
Date:
On Nov 16, 2010, at 12:39 PM, Greg Smith wrote:
>
> $ ./test_fsync
> Loops = 10000
>
> Simple write:
>    8k write                      88476.784/second
>
> Compare file sync methods using one write:
>    (unavailable: open_datasync)
>    open_sync 8k write             1192.135/second
>    8k write, fdatasync            1222.158/second
>    8k write, fsync                1097.980/second
>
> Compare file sync methods using two writes:
>    (unavailable: open_datasync)
>    2 open_sync 8k writes           527.361/second
>    8k write, 8k write, fdatasync  1105.204/second
>    8k write, 8k write, fsync      1084.050/second
>
> Compare open_sync with different sizes:
>    open_sync 16k write             966.047/second
>    2 open_sync 8k writes           529.565/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close         1064.177/second
>    8k write, close, fsync         1042.337/second
>
> Two notable things here.  One, there is no open_datasync defined in this
> older kernel.  Two, all methods of commit give equally inflated commit
> rates, far faster than the drive is capable of.  This proves this setup
> isn't flushing the drive's write cache after commit.

Nit: there is no open_sync, only open_dsync.  Prior to recent kernels, only (semantically) open_dsync exists, labeled
asopen_sync.  New kernels move that code to open_datasync and nave a NEW open_sync that supposedly flushes metadata
properly.   

>
> You can get safe behavior out of the old kernel by disabling its write
> cache:
>
> $ sudo /sbin/hdparm -W0 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching =  0 (off)
>
> Loops = 10000
>
> Simple write:
>    8k write                      89023.413/second
>
> Compare file sync methods using one write:
>    (unavailable: open_datasync)
>    open_sync 8k write              106.968/second
>    8k write, fdatasync             108.106/second
>    8k write, fsync                 104.238/second
>
> Compare file sync methods using two writes:
>    (unavailable: open_datasync)
>    2 open_sync 8k writes            51.637/second
>    8k write, 8k write, fdatasync   109.256/second
>    8k write, 8k write, fsync       103.952/second
>
> Compare open_sync with different sizes:
>    open_sync 16k write             109.562/second
>    2 open_sync 8k writes            52.752/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close          107.179/second
>    8k write, close, fsync          106.923/second
>
> And now results are as expected:  just under 120/second.
>
> Onto RHEL6.  Setup for this initial test was:
>
> $ uname -a
> Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
> x86_64 x86_64 x86_64 GNU/Linux
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
> $ mount
> /dev/sda7 on / type ext4 (rw)
>
> And I started with the write cache off to see a straight comparison
> against the above:
>
> $ sudo hdparm -W0 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching =  0 (off)
> $ ./test_fsync
> Loops = 10000
>
> Simple write:
>    8k write                      104194.886/second
>
> Compare file sync methods using one write:
>    open_datasync 8k write           97.828/second
>    open_sync 8k write              109.158/second
>    8k write, fdatasync             109.838/second
>    8k write, fsync                  20.872/second

fsync is working now!  flushing metadata properly reduces performance.
However, shouldn't open_sync slow down vs open_datasync too and be similar to fsync?

Did you recompile your test on the RHEL6 system?
Code compiled on newer kernels will see O_DSYNC and O_SYNC as two separate sentinel values, lets call them 1 and 2
respectively. Code compiled against earlier kernels will see both O_DSYNC and O_SYNC as the same value, 1.  So code
compiledagainst older kernels, asking for O_SYNC on a newer kernel will actually get O_DSYNC behavior!  This was
intended. I can't find the link to the mail, but it was Linus' idea to make old code that expected the 'faster but
incorrect'behavior to retain it on newer kernels.  Only a recompile with newer header files will trigger the new
behaviorand expose the 'correct' open_sync behavior. 

This will be 'fun' for postgres packagers and users -- data reliability behavior differs based on what kernel it is
compiledagainst.  Luckily, the xlogs only need open_datasync semantics. 

>
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.902/second
>    2 open_sync 8k writes            53.721/second
>    8k write, 8k write, fdatasync   109.731/second
>    8k write, 8k write, fsync        20.918/second
>
> Compare open_sync with different sizes:
>    open_sync 16k write             109.552/second
>    2 open_sync 8k writes            54.116/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close           20.800/second
>    8k write, close, fsync           20.868/second
>
> A few changes then.  open_datasync is available now.

Again, noting the detail that it is open_sync that is new (depending on where it is compiled).  The old open_sync is
relabeledto the new open_datasync.  

> It looks slightly
> slower than the alternatives on this test, but I didn't see that on the
> later tests so I'm thinking that's just occasional run to run
> variation.  For some reason regular fsync is dramatically slower in this
> kernel than earlier ones.  Perhaps a lot more metadata being flushed all
> the way to the disk in that case now?
>
> The issue that I think Marti has been concerned about is highlighted in
> this interesting subset of the data:
>
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.902/second
>    8k write, 8k write, fdatasync   109.731/second
>
> The results here aren't surprising; if you do two dsync writes, that
> will take two disk rotations, while two writes followed a single sync
> only takes one.  But that does mean that in the case of small values for
> wal_buffers, like the default, you could easily end up paying a rotation
> sync penalty more than once per commit.
>
> Next question is what happens if I turn the drive's write cache back on:
>
> $ sudo hdparm -W1 /dev/sda
>
> /dev/sda:
> setting drive write-caching to 1 (on)
> write-caching =  1 (on)
>
> $ ./test_fsync
>
> [gsmith@meddle fsync]$ ./test_fsync
> Loops = 10000
>
> Simple write:
>    8k write                      104198.143/second
>
> Compare file sync methods using one write:
>    open_datasync 8k write          110.707/second
>    open_sync 8k write              110.875/second
>    8k write, fdatasync             110.794/second
>    8k write, fsync                  28.872/second
>
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        55.731/second
>    2 open_sync 8k writes            55.618/second
>    8k write, 8k write, fdatasync   110.551/second
>    8k write, 8k write, fsync        28.843/second
>
> Compare open_sync with different sizes:
>    open_sync 16k write             110.176/second
>    2 open_sync 8k writes            55.785/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close           28.779/second
>    8k write, close, fsync           28.855/second
>
> This is nice to see from a reliability perspective.  On all three of the
> viable sync methods here, the speed seen suggests the drive's volatile
> write cache is being flushed after every commit.  This is going to be
> bad for people who have gotten used to doing development on systems
> where that's not honored and they don't care, because this looks like a
> 90% drop in performance on those systems.
>  But since the new behavior is
> safe and the earlier one was not, it's hard to get mad about it.

I would love to see the same tests in this detail for RHEL 5.5 (which has ext3, ext4, and xfs).  I think this data
reliabilityissue that requires turning off write cache was in the kernel ~2.6.26 to 2.6.31 range.  Ubuntu doesn't
reallycare about this stuff which is one reason I avoid it for a prod db.  I know that xfs with the right settings on
RHEL5.5 does not require disabling the write cache. 

> Developers probably just need to be taught to turn synchronous_commit
> off to speed things up when playing with test data.
>

Absolutely.

> test_fsync writes to /var/tmp/test_fsync.out by default, not paying
> attention to what directory you're in.  So to use it to test another
> filesystem, you have to make sure to give it an explicit full path.
> Next I tested against the old Ubuntu partition that was formatted with
> ext3, with the write cache still on:
>
> # mount | grep /ext3
> /dev/sda5 on /ext3 type ext3 (rw)
> # ./test_fsync -f /ext3/test_fsync.out
> Loops = 10000
>
> Simple write:
>    8k write                      100943.825/second
>
> Compare file sync methods using one write:
>    open_datasync 8k write          106.017/second
>    open_sync 8k write              108.318/second
>    8k write, fdatasync             108.115/second
>    8k write, fsync                 105.270/second
>
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.313/second
>    2 open_sync 8k writes            54.045/second
>    8k write, 8k write, fdatasync    55.291/second
>    8k write, 8k write, fsync        53.243/second
>
> Compare open_sync with different sizes:
>    open_sync 16k write              54.980/second
>    2 open_sync 8k writes            53.563/second
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close          105.032/second
>    8k write, close, fsync          103.987/second
>
> Strange...it looks like ext3 is executing cache flushes, too.  Note that
> all of the "Compare file sync methods using two writes" results are half
> speed now; it's as if ext3 is flushing the first write out immediately?
> This result was unexpected, and I don't trust it yet; I want to validate
> this elsewhere.
>
> What about XFS?  That's a first class filesystem on RHEL6 too:
and available on later RHEL 5's.
>
> [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
> Loops = 10000
>
> Simple write:
>    8k write                      71878.324/second
>
> Compare file sync methods using one write:
>    open_datasync 8k write           36.303/second
>    open_sync 8k write               35.714/second
>    8k write, fdatasync              35.985/second
>    8k write, fsync                  35.446/second
>
> I stopped that there, sick of waiting for it, as there's obviously some
> serious work (mounting options or such at a minimum) that needs to be
> done before XFS matches the other two.  Will return to that later.
>

Yes, XFS requires some fiddling.  Its metadata operations are also very slow.

> So, what have we learned so far:
>
> 1) On these newer kernels, both ext4 and ext3 seem to be pushing data
> out through the drive write caches correctly.
>

I suspect that some older kernels are partially OK here too.  The kernel not flushing properly appeared near 2.6.25
ish.

> 2) On single writes, there's no performance difference between the main
> three methods you might use, with the straight fsync method having a
> serious regression in this use case.

I'll ask again -- did you compile the test on RHEL6 for the RHEL6 tests?  The behavior in later kernels for this
dependson what kernel it was compiled against for open_sync.  For fsync, its not a regression, its actually flushing
metadataproperly and therefore actually robust if there is a power failure during a write.  Even the write cache
disabledcase on the ubuntu kernel could leave a filesystem with corrupt data if the power failed in a metadata
intensivewrite situation.  

>
> 3) WAL writes that are forced by wal_buffers filling will turn into a
> commit-length write when using the new, default open_datasync.  Using
> the older default of fdatasync avoids that problem, in return for
> causing WAL writes to pollute the OS cache.  The main benefit of O_DSYNC
> writes over fdatasync ones is avoiding the OS cache.
>
> I want to next go through and replicate some of the actual database
> level tests before giving a full opinion on whether this data proves
> it's worth changing the wal_sync_method detection.  So far I'm torn
> between whether that's the right approach, or if we should just increase
> the default value for wal_buffers to something more reasonable.
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services and Support        www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Scott Carey wrote:
> Did you recompile your test on the RHEL6 system?

On both systems I showed, I checked out a fresh copy of the PostgreSQL
9.1 HEAD from the git repo, and compiled that on the server, to make
sure I was pulling in the appropriate kernel headers.  I wasn't aware of
exactly how the kernel sync stuff was refactored though, thanks for the
concise update on that.  I can do similar tests on a RHEL5 system, but
not on the same hardware.  Can only make my laptop boot so many
operating systems at a time usefully.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Jon Nelson
Date:
On Wed, Nov 17, 2010 at 3:24 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Carey wrote:
>>
>> Did you recompile your test on the RHEL6 system?
>
> On both systems I showed, I checked out a fresh copy of the PostgreSQL 9.1
> HEAD from the git repo, and compiled that on the server, to make sure I was
> pulling in the appropriate kernel headers.  I wasn't aware of exactly how
> the kernel sync stuff was refactored though, thanks for the concise update
> on that.  I can do similar tests on a RHEL5 system, but not on the same
> hardware.  Can only make my laptop boot so many operating systems at a time
> usefully.

One thing to note is that where on a disk things sit can make a /huge/
difference - depending on if Ubuntu is /here/ and RHEL is /there/ and
so on can make a factor of 2 or more difference.  The outside tracks
of most modern SATA disks can do around 120MB/s. The inside tracks
aren't even half of that.

--
Jon

Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Greg Smith
Date:
Jon Nelson wrote:
> One thing to note is that where on a disk things sit can make a /huge/
> difference - depending on if Ubuntu is /here/ and RHEL is /there/ and
> so on can make a factor of 2 or more difference.  The outside tracks
> of most modern SATA disks can do around 120MB/s. The inside tracks
> aren't even half of that.
>

You're talking about changes in sequential read and write speed due to
Zone Bit Recording (ZBR) AKA Zone Constant Angular Velocity (ZCAV).
What I was measuring was commit latency time on small writes.  That
doesn't change as you move around the disk, since it's tied to the raw
rotation speed of the drive rather than density of storage in any zone.
If I get to something that's impacted by sequential transfers rather
than rotation time, I'll be sure to use the same section of disk for
that.  It wasn't really necessary to get these initial gross numbers
anyway.  What I was looking for is the about 10:1 speedup seen on this
hardware when the write cache is used, which could easily be seen even
were there ZBR differences involved.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Scott Carey
Date:
On Nov 17, 2010, at 1:24 PM, Greg Smith wrote:

> Scott Carey wrote:
>> Did you recompile your test on the RHEL6 system?
>
> On both systems I showed, I checked out a fresh copy of the PostgreSQL
> 9.1 HEAD from the git repo, and compiled that on the server, to make
> sure I was pulling in the appropriate kernel headers.  I wasn't aware of
> exactly how the kernel sync stuff was refactored though, thanks for the
> concise update on that.

Thanks!

So this could be another bug in Linux.  Not entirely surprising.
Since fsync/fdatasync relative performance isn't similar to open_sync/open_datasync relative performance on this test
thereis probably a bug that either hurts fsync, or one that is preventing open_sync from dealing with metadata
properly.  Luckily for the xlog, both of those can be avoided -- the real choice is fdatasync vs open_datasync.  And
bothwork in newer kernels or break in certain older ones. 


> I can do similar tests on a RHEL5 system, but
> not on the same hardware.  Can only make my laptop boot so many
> operating systems at a time usefully.

Yeah, I understand.  I might throw this at a RHEL5 system if I get a chance but I need one without a RAID card that is
notin use.  Hopefully it doesn't turn out that fdatasync is write-cache safe but open_sync/open_datasync isn't on that
platform. It could impact the choice of a default value. 

>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services and Support        www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>


Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

From
Jignesh Shah
Date:
On Tue, Nov 16, 2010 at 8:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>>> Well, we're not going to increase the default to gigabytes, but we could
>>> very probably increase it by a factor of 10 or so without anyone
>>> squawking.  It's been awhile since I heard of anyone trying to run PG in
>>> 4MB shmmax.  How much would a change of that size help?
>
>> Last I checked, though, this comes out of the allocation available to
>> shared_buffers.  And there definitely are several OSes (several linuxes,
>> OSX) still limited to 32MB by default.
>
> Sure, but the current default is a measly 64kB.  We could increase that
> 10x for a relatively small percentage hit in the size of shared_buffers,
> if you suppose that there's 32MB available.  The current default is set
> to still work if you've got only a couple of MB in SHMMAX.
>
> What we'd want is for initdb to adjust the setting as part of its
> probing to see what SHMMAX is set to.
>
>                        regards, tom lane
>
>

In all the performance tests that I have done, generally I get a good
bang for the buck with wal_buffers set to 512kB in low memory cases
and mostly I set it to 1MB which is probably enough for most of the
cases even with high memory.

That 1/2 MB wont make drastic change on shared_buffers anyway (except
for edge cases) but will relieve the stress quite a bit on wal
buffers.

Regards,
Jignesh