Re: We really ought to do something about O_DIRECT and data=journalled on ext4 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Date
Msg-id 5875.1291685673@sss.pgh.pa.us
Whole thread Raw
In response to Re: We really ought to do something about O_DIRECT and data=journalled on ext4  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: We really ought to do something about O_DIRECT and data=journalled on ext4
List pgsql-hackers
Greg Smith <greg@2ndquadrant.com> writes:
> So my guess is that some small percentage of Windows users might notice 
> a change here, and some testing on FreeBSD would be useful too.  That's 
> about it for platforms that I think anybody needs to worry about.

To my mind, O_DIRECT is not really the key issue here, it's whether to
prefer O_DSYNC or fdatasync.  I looked back in the archives, and I think
that the main reason we prefer O_DSYNC when available is the results
I got here:

http://archives.postgresql.org/pgsql-hackers/2001-03/msg00381.php

which demonstrated a performance benefit on HPUX 10.20, though with a
test tool much more primitive than test_fsync.  I still have that
machine, although the disk that was in it at the time died awhile back.
What's in there now is a Seagate ST336607LW spinning at 10000 RPM (166
rev/sec) and today I get numbers like this from test_fsync:

Simple write:       8k write                      28331.020/second

Compare file sync methods using one write:       open_datasync 8k write          161.190/second       open_sync 8k
write             156.478/second       8k write, fdatasync              54.302/second       8k write, fsync
    51.810/second
 

Compare file sync methods using two writes:       2 open_datasync 8k writes        81.702/second       2 open_sync 8k
writes           80.172/second       8k write, 8k write, fdatasync    40.829/second       8k write, 8k write, fsync
  39.836/second
 

Compare open_sync with different sizes:       open_sync 16k write              80.192/second       2 open_sync 8k
writes           78.018/second
 

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       8k write, fsync, close           52.527/second       8k write, close, fsync
54.092/second

So *on that rather ancient platform* there's a measurable performance
benefit to O_DSYNC, but this seems to be largely because fdatasync is
stubbed to fsync in userspace rather than because fdatasync wouldn't
be a better idea in the abstract.  Also, a lot of the argument against
fsync at the time was that it forced the kernel to iterate through all
the buffers for the WAL file to see if any were dirty.  I would imagine
that modern kernels are a tad smarter about that; and even if they
aren't, the CPU speed versus disk speed tradeoff has changed enough
since 2001 that iterating through 16MB of buffers isn't as interesting
as it was then.

So to my mind, switching to the preference order fdatasync,
fsync_writethrough, fsync seems like the thing to do.  Since we assume
fsync is always available, that means that O_DSYNC/O_SYNC will not be
the defaults on any platform.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Steve Singer
Date:
Subject: Re: We really ought to do something about O_DIRECT and data=journalled on ext4
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+