Re: Potential Large Performance Gain in WAL synching - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Potential Large Performance Gain in WAL synching
Date
Msg-id DMEEJMCDOJAKPPFACMPMCECOCEAA.curtis@galtair.com
Whole thread Raw
In response to Re: Potential Large Performance Gain in WAL synching  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Potential Large Performance Gain in WAL synching  (Neil Conway <neilc@samurai.com>)
List pgsql-hackers
I wrote:
> > ... most file systems can't process fsync's
> > simultaneous with other writes, so those writes block because the file
> > system grabs its own internal locks.
>

tom lane replies:
> Oh?  That would be a serious problem, but I've never heard that asserted
> before.  Please provide some evidence.

Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.

> On a filesystem that does have that kind of problem, can't you avoid it
> just by using O_DSYNC on the WAL files?  Then there's no need to call
> fsync() at all, except during checkpoints (which actually issue sync()
> not fsync(), anyway).
>

No, they're not exactly the same thing. Consider:

Process A                    File System
---------                       -----------
Writes index buffer             .idling...
Writes entry to log cache       .
Writes another index buffer     .
Writes another log entry        .
Writes tuple buffer             .
Writes another log entry        .
Index scan                      .
Large table sort                .
Writes tuple buffer             .
Writes another log entry        .
Writes                          .
Writes another index buffer     .
Writes another log entry        .
Writes another index buffer     .
Writes another log entry        .
Index scan                      .
Large table sort                .
Commit                          .
File Write Log Entry            .
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
Write Commit Log Entry          .idling...
.idling...                      Write to cache
Call fsync                      .idling...
.idling...                      Write all buffers to device.
.DONE.

In this case, Process A is waiting for all the buffers to write
at the end of the transaction.

With asynchronous I/O this becomes:

Process A                    File System
---------                       -----------
Writes index buffer             .idling...
Writes entry to log cache       Queue up write - move head to cylinder
Writes another index buffer     Write log entry to media
Writes another log entry        Immediate write to cylinder since head is
still there.
Writes tuple buffer             .
Writes another log entry        Queue up write - move head to cylinder
Index scan                      .busy with scan...
Large table sort                Write log entry to media
Writes tuple buffer             .
Writes another log entry        Queue up write - move head to cylinder
Writes                          .
Writes another index buffer     Write log entry to media
Writes another log entry        Queue up write - move head to cylinder
Writes another index buffer     .
Writes another log entry        Write log entry to media
Index scan                      .
Large table sort                Write log entry to media
Commit                          .
Write Commit Log Entry          Immediate write to cylinder since head is
still there.
.DONE.

Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.

> > Whether by threads or multiple processes, there is the same
> contention on
> > the file through multiple writers. The file system can decide to reorder
> > writes before they start but not after. If a write comes after a
> > fsync starts it will have to wait on that fsync.
>
> AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
> on safety grounds (we want to be sure we have a consistent WAL up to the
> end of what we've written).  Even if we can allow some reordering when a
> single transaction puts out a large volume of WAL data, I fail to see
> where any large gain is going to come from.  We're going to be issuing
> those writes sequentially and that ought to match the disk layout about
> as well as can be hoped anyway.

My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.

> > Likewise a given process's writes can NEVER be reordered if they are
> > submitted synchronously, as is done in the calls to flush the log as
> > well as the dirty pages in the buffer in the current code.
>
> We do not fsync buffer pages; in fact a transaction commit doesn't write
> buffer pages at all.  I think the above is just a misunderstanding of
> what's really happening.  We have synchronous WAL writing, agreed, but
> we want that AFAICS.  Data block writes are asynchronous (between
> checkpoints, anyway).

Hmm, I keep hearing that buffer block writes are asynchronous but I don't
read that in the code at all. There are simple "write" calls with files
that are not opened with O_NOBLOCK, so they'll be done synchronously. The
code for this is relatively straighforward (once you get past the
storage manager abstraction) so I don't see what I might be missing.

It's true that data blocks are not required to be written before the
transaction commits, so they are in some sense asynchronous to the
transactions. However, they still later on block the process that
is requesting a new block when it happens to be dirty forcing a write
of the block in the cache.

It looks to me like BufferAlloc will simply result in a call to
BufferReplace > smgrblindwrt > write for md storage manager objects.

This means that a process will block while the write of dirty cache
buffers takes place.

I'm happy to be wrong on this but I don't see any hard evidence
of asynch file calls anywhere in the code. Unless I am missing something
this is a huuuuge problem.

> There is one thing in the current WAL code that I don't like: if the WAL
> buffers fill up then everybody who would like to make WAL entries is
> forced to wait while some space is freed, which means a write, which is
> synchronous if you are using O_DSYNC.  It would be nice to have a
> background process whose only task is to issue write()s as soon as WAL
> pages are filled, thus reducing the probability that foreground
> processes have to wait for WAL writes (when they're not committing that
> is).  But this could be done portably with one more postmaster child
> process; I see no real need to dabble in aio_write.

Hmm, well, another process writing the log would accomplish the same thing
but isn't that what a file system is? ISTM that aio_write is quite a bit
easier and higher performance? This is especially true for those OS's which
have KAIO support.

- Curtis



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: New lock types
Next
From: Alvaro Herrera
Date:
Subject: Re: ALTER TABLE ... ADD COLUMN