When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below). The patch adds a new option for wal_sync_method, which
is "open_direct". With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only
O_DIRECT should be correct. All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning. (See
http://lwn.net/Articles/348739/)
The advantage of using O_DIRECT is that there is no fsync/fdatasync()
used. All of the other wal_sync_methods use fsync/fdatasync(), either
explicitly or implicitly (via the O_SYNC and O_DATASYNC options).
fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed. There can be many metadata operations happening on the
data files, so the WAL log fsync can wait for metadata operations on
the data files. Since O_DIRECT does not do any fsync/fdatasync operation,
it avoids this bottleneck, and can finish more quickly on average.
The open_sync and open_dsync options do not have this benefit, because
they do an equivalent of an fsync/fdatasync after every WAL write.
For the open_sync and open_dsync options, O_DIRECT is used for writes
only if the xlog will not need to be consumed by the archiver or
hot-standby. I am not keying the open_direct behavior based on whether
XLogIsNeeded() is true, because we see performance gain even when
archiving is enabled (using a simple script that copies and compresses
the log segments). For 2-processor, 50-warehouse DBT2 run on SLES 11, I
get the following NOTPM results:
wal_sync_method
fdatasync open_direct open_sync
archiving off: 17076 18481 17094
archiving on: 15704 16923 15898
Do folks have any interest in this change, or comments on its
usefulness/correctness? It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.
Dan