possible new option for wal_sync_method - Mailing list pgsql-hackers

From Dan Scales
Subject possible new option for wal_sync_method
Date
Msg-id 1258397887.1997975.1329412703998.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Whole thread Raw
Responses Re: possible new option for wal_sync_method
Re: possible new option for wal_sync_method
Re: possible new option for wal_sync_method
List pgsql-hackers
When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below).  The patch adds a new option for wal_sync_method, which
is "open_direct".  With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC).  For Linux, the use of only
O_DIRECT should be correct.  All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning.  (See
http://lwn.net/Articles/348739/)

The advantage of using O_DIRECT is that there is no fsync/fdatasync()
used.  All of the other wal_sync_methods use fsync/fdatasync(), either
explicitly or implicitly (via the O_SYNC and O_DATASYNC options).
fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed.  There can be many metadata operations happening on the
data files, so the WAL log fsync can wait for metadata operations on
the data files.  Since O_DIRECT does not do any fsync/fdatasync operation,
it avoids this bottleneck, and can finish more quickly on average.
The open_sync and open_dsync options do not have this benefit, because
they do an equivalent of an fsync/fdatasync after every WAL write.

For the open_sync and open_dsync options, O_DIRECT is used for writes
only if the xlog will not need to be consumed by the archiver or
hot-standby.  I am not keying the open_direct behavior based on whether
XLogIsNeeded() is true, because we see performance gain even when
archiving is enabled (using a simple script that copies and compresses
the log segments).  For 2-processor, 50-warehouse DBT2 run on SLES 11, I
get the following NOTPM results:

                      wal_sync_method
                 fdatasync   open_direct  open_sync

archiving off:     17076       18481       17094
archiving on:      15704       16923       15898


Do folks have any interest in this change, or comments on its
usefulness/correctness?  It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.

Dan

Attachment

pgsql-hackers by date:

Previous
From: Jay Levitt
Date:
Subject: Re: Designing an extension for feature-space similarity search
Next
From: Kohei KaiGai
Date:
Subject: Re: pgsql_fdw, FDW for PostgreSQL server