Potential Large Performance Gain in WAL synching - Mailing list pgsql-hackers

From Curtis Faith
Subject Potential Large Performance Gain in WAL synching
Date
Msg-id DMEEJMCDOJAKPPFACMPMCEBOCEAA.curtis@galtair.com
Whole thread Raw
Responses Re: Potential Large Performance Gain in WAL synching
List pgsql-hackers
I've been looking at the TODO lists and caching issues and think there may
be a way to greatly improve the performance of the WAL.

I've made the following assumptions based on my reading in the manual and
the WAL archives since about November 2000:

1) WAL is currently fsync'd before commit succeeds. This is done to ensure
that the D in ACID is satisfied.
2) The wait on fsync is the biggest time cost for inserts or updates.
3) fsync itself probably increases contention for file i/o on the same file
since some OS file system cache structures must be locked as part of fsync.
Depending on the file system this could be a significant choke on total i/o
throughput.

The issue is that there must be a definite record in durable storage for the
log before one can be certain that a transaction has succeeded.

I'm not familiar with the exact WAL implementation in PostgreSQL but am
familiar with others including ARIES II, however, it seems that it comes
down to making sure that the write to the WAL log has been positively
written to disk.

So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes? A transaction would simple do all
the log writing using aio_write and block until all the last log aio request
has completed using aio_waitcomplete. The call to aio_waitcomplete won't
return until the log record has been written to the disk. Opening with
O_DSYNC ensures that when i/o completes the write has been written to the
disk, and aio_write with O_APPEND opened files ensures that writes append in
the order they are received, hence when the aio_write for the last log entry
for a transaction completes, the transaction can be sure that its log
records are in durable storage (IDE problems aside).

It seems to me that this would:

1) Preserve the required D semantics.
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
3) Obviate the need for commit_delay, since there is no blocking and the
file system and the disk controller can put multiple writes to the log
together as the drive is waiting for the end of the log file to come under
one of the heads.

Here are the relevant TODO's:
   Delay fsync() when other backends are about to commit too [fsync]       Determine optimal commit_delay value
   Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options      Allow multiple blocks to be written to WAL with one
write()


Am I missing something?

Curtis Faith
Principal
Galt Capital, LLP

------------------------------------------------------------------
Galt Capital                            http://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549                           voice: 340.776.0144
Charlotte Amalie,  St. Thomas                    fax: 340.776.0244
United States Virgin Islands  00801             cell: 340.643.5368



pgsql-hackers by date:

Previous
From: "Curtis Faith"
Date:
Subject: Re: Advice: Where could I be of help?
Next
From: Tom Lane
Date:
Subject: Re: Advice: Where could I be of help?