Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers
From | Curtis Faith |
---|---|
Subject | Re: Analysis of ganged WAL writes |
Date | |
Msg-id | DMEEJMCDOJAKPPFACMPMAEFJCEAA.curtis@galtair.com Whole thread Raw |
In response to | Re: Analysis of ganged WAL writes (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Analysis of ganged WAL writes
Re: Analysis of ganged WAL writes |
List | pgsql-hackers |
> Well, too bad. If you haven't gotten your commit record down to disk, > then *you have not committed*. This is not negotiable. (If you think > it is, then turn off fsync and quit worrying ;-)) I've never disputed this, so if I seem to be suggesting that, I've beee unclear. I'm just assuming that the disk can get a confirmation back to the INSERTing process in much less than one rotation. This would allow that process to start working again, perhaps in time to complete another transaction. > An application that is willing to have multiple transactions in flight > at the same time can open up multiple backend connections to issue those > transactions, and thereby perhaps beat the theoretical limit. But for > serial transactions, there is not anything we can do to beat that limit. > (At least not with the log structure we have now. One could imagine > dropping a commit record into the nearest one of multiple buckets that > are carefully scattered around the disk. But exploiting that would take > near-perfect knowledge about disk head positioning; it's even harder to > solve than the problem we're considering now.) Consider the following scenario: Time measured in disk rotations. Time 1.00 - Process A Commits - Causing aio_write to log and wait Time 1.03 - aio_completes for Process A write - wakes process A Time 1.05 - Process A Starts another transaction. Time 1.08 - Process A Commits etc. I agree that a process can't proceed from a commit until it receives confirmation of the write, but if the write has hit the disk before a full rotation then the process should be able to continue processing new transactions > You're failing to distinguish total throughput to the WAL drive from > response time seen by any one transaction. Yes, a policy of writing > each WAL block once when it fills would maximize potential throughput, > but it would also mean a potentially very large delay for a transaction > waiting to commit. The lower the system load, the worse the performance > on that scale. You are assuming fsync or fdatasync behavior, I am not. There would be no delay under the scenario I describe. The transaction would exit commit as soon as the confirmation of the write is received from the aio system. I would hope that with a decent aio implementation this would generally be much less than one rotation. I think that the single transaction response time is very important and that's one of the chief problems I sought to solve when I proposed aio_writes for logging in my original email many moons ago. > ISTM aio_write only improves the picture if there's some magic in-kernel > processing that makes this same kind of judgment as to when to issue the > "ganged" write for real, and is able to do it on time because it's in > the kernel. I haven't heard anything to make me think that that feature > actually exists. AFAIK the kernel isn't much more enlightened about > physical head positions than we are. All aio_write has to do is pass the write off to the device as soon as it aio_write gets it bypassing the system buffers. The code on the disk's hardware is very good at knowing when the disk head is coming. IMHO, bypassing the kernel's less than enlightened writing system is the main point of using aio_write. - Curtis
pgsql-hackers by date: