Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Analysis of ganged WAL writes
Date
Msg-id DMEEJMCDOJAKPPFACMPMAEFJCEAA.curtis@galtair.com
Whole thread Raw
In response to Re: Analysis of ganged WAL writes  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Analysis of ganged WAL writes
Re: Analysis of ganged WAL writes
List pgsql-hackers
> Well, too bad.  If you haven't gotten your commit record down to disk,
> then *you have not committed*.  This is not negotiable.  (If you think
> it is, then turn off fsync and quit worrying ;-))

I've never disputed this, so if I seem to be suggesting that, I've beee
unclear. I'm just assuming that the disk can get a confirmation back to the
INSERTing process in much less than one rotation. This would allow that
process to start working again, perhaps in time to complete another
transaction.

> An application that is willing to have multiple transactions in flight
> at the same time can open up multiple backend connections to issue those
> transactions, and thereby perhaps beat the theoretical limit.  But for
> serial transactions, there is not anything we can do to beat that limit.
> (At least not with the log structure we have now.  One could imagine
> dropping a commit record into the nearest one of multiple buckets that
> are carefully scattered around the disk.  But exploiting that would take
> near-perfect knowledge about disk head positioning; it's even harder to
> solve than the problem we're considering now.)

Consider the following scenario:

Time measured in disk rotations.

Time 1.00 - Process A Commits - Causing aio_write to log and wait
Time 1.03 - aio_completes for Process A write - wakes process A
Time 1.05 - Process A Starts another transaction.
Time 1.08 - Process A Commits
etc.

I agree that a process can't proceed from a commit until it receives
confirmation of the write, but if the write has hit the disk before a full
rotation then the process should be able to continue processing new
transactions

> You're failing to distinguish total throughput to the WAL drive from
> response time seen by any one transaction.  Yes, a policy of writing
> each WAL block once when it fills would maximize potential throughput,
> but it would also mean a potentially very large delay for a transaction
> waiting to commit.  The lower the system load, the worse the performance
> on that scale.

You are assuming fsync or fdatasync behavior, I am not. There would be no
delay under the scenario I describe. The transaction would exit commit as
soon as the confirmation of the write is received from the aio system. I
would hope that with a decent aio implementation this would generally be
much less than one rotation.

I think that the single transaction response time is very important and
that's one of the chief problems I sought to solve when I proposed
aio_writes for logging in my original email many moons ago.

> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

All aio_write has to do is pass the write off to the device as soon as it
aio_write gets it bypassing the system buffers. The code on the disk's
hardware is very good at knowing when the disk head is coming. IMHO,
bypassing the kernel's less than enlightened writing system is the main
point of using aio_write.

- Curtis



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: Analysis of ganged WAL writes
Next
From: Greg Copeland
Date:
Subject: Re: Analysis of ganged WAL writes