Re: Potential Large Performance Gain in WAL synching - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Potential Large Performance Gain in WAL synching
Date
Msg-id DMEEJMCDOJAKPPFACMPMKECMCEAA.curtis@galtair.com
Whole thread Raw
In response to Re: Potential Large Performance Gain in WAL synching  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Potential Large Performance Gain in WAL synching
List pgsql-hackers
Bruce Momjian wrote:
> I am again confused.  When we do write(), we don't have to lock
> anything, do we?  (Multiple processes can write() to the same file just
> fine.)  We do block the current process, but we have nothing else to do
> until we know it is written/fsync'ed.  Does aio more easily allow the
> kernel to order those write?  Is that the issue?  Well, certainly the
> kernel already order the writes.  Just because we write() doesn't mean
> it goes to disk.  Only fsync() or the kernel do that.

"We" don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.

For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say < 10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.

Further comment on:
<We do block the current process, but we have nothing else to do
>until we know it is written/fsync'ed.

Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.

The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.

> Yes, but Oracle is threaded, right, so, yes, they clearly could win with
> it.  I read the second URL and it said we could issue separate writes
> and have them be done in an optimal order.  However, we use the file
> system, not raw devices, so don't we already have that in the kernel
> with fsync()?

Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.

Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.

> Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
> into the trap where we add stuff that just makes things faster on
> SMP/threaded systems when it makes our code _slower_ on single CPU
> systems, which is exaclty what Informix did in Informix 7, and we know
> how that ended (lost customers, bought by IBM).  I don't think that's
> going to happen to us, but I thought I would mention it.

Yes, I hate "improvements" that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.

- Curtis



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Threaded Sorting
Next
From: Tom Lane
Date:
Subject: Re: Threaded Sorting