Re: Group commit, revised - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Group commit, revised |
Date | |
Msg-id | 4F19EAEE.2050501@enterprisedb.com Whole thread Raw |
In response to | Re: Group commit, revised (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Group commit, revised
Re: Group commit, revised |
List | pgsql-hackers |
I spent some time cleaning this up. Details below, but here are the highlights: * Reverted the removal of wal_writer_delay * Doesn't rely on WAL writer. Any process can act as the "leader" now. * Removed heuristic on big flushes * Uses PGSemaphoreLock/Unlock instead of latches On 20.01.2012 17:30, Robert Haas wrote: > On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan<peter@2ndquadrant.com> wrote: >>> + if (delta> XLOG_SEG_SIZE * CheckPointSegments || >>> + !ProcGlobal->groupCommitAvailable) >>> >>> That seems like a gigantic value. I would think that we'd want to >>> forget about group commit any time we're behind by more than one >>> segment (maybe less). >> >> I'm sure that you're right - I myself described it as horrible in my >> original mail. I think that whatever value we set this to ought to be >> well reasoned. Right now, your magic number doesn't seem much better >> than my bogus heuristic (which only ever served as a placeholder >> implementation that hinted at a well-principled formula). What's your >> basis for suggesting that that limit would always be close to optimal? > > It's probably not - I suspect even a single WAL segment still too > large. I'm pretty sure that it would be safe to always flush up to > the next block boundary, because we always write the whole block > anyway even if it's only partially filled. So there's no possible > regression there. Anything larger than "the end of the current 8kB > block" is going to be a trade-off between latency and throughput, and > it seems to me that there's probably only one way to figure out what's > reasonable: test a bunch of different values and see which ones > perform well in practice. Maybe we could answer part of the question > by writing a test program which does repeated sequential writes of > increasingly-large chunks of data. We might find out for example that > 64kB is basically the same as 8kB because most of the overhead is in > the system call anyway, but beyond that the curve kinks. Or it may be > that 16kB is already twice as slow as 8kB, or that you can go up to > 1MB without a problem. I don't see a way to know that without testing > it on a couple different pieces of hardware and seeing what happens. I ripped away that heuristic for a flush that's "too large". After pondering it for a while, I came to the conclusion that as implemented in the patch, it was pointless. The thing is, if the big flush doesn't go through the group commit machinery, it's going to grab the WALWriteLock straight away. Before any smaller commits can proceed, they will need to grab that lock anyway, so the effect is the same as if the big flush had just joined the queue. Maybe we should have a heuristic to split a large flush into smaller chunks. The WAL segment boundary would be a quite natural split point, for example, because when crossing the file boundary you have to issue separate fsync()s for the files anyway. But I'm happy with leaving that out for now, it's not any worse than it used to be without group commit anyway. > Ugh. Our current system doesn't even require this code to be > interruptible, I think: you can't receive cancel or die interrupts > either while waiting for an LWLock or while holding it. Right. Or within HOLD/RESUME_INTERRUPTS blocks. The patch added some CHECK_FOR_INTERRUPTS() calls into various places in the commit/abort codepaths, but that's all dead code; they're all within a HOLD/RESUME_INTERRUPTS blocks. I replaced the usage of latches with the more traditional PGSemaphoreLock/Unlock. It semaphore model works just fine in this case, where we have a lwlock to guard the wait queue, and when a process is waiting we know it will be woken up or something messed up at a pretty low level. We don't need a timeout or to wake up at other signals while waiting. Furthermore, the WAL writer didn't have a latch before this patch; it's not many lines of code to initialize the latch and set up the signal handler for it, but it already has a semaphore that's ready to use. I wonder if we should rename the file into "xlogflush.c" or something like that, to make it clear that this works with any XLOG flushes, not just commits? Group commit is the usual term for this feature, so we should definitely mention that in the comments, but it might be better to rename the files/functions to emphasize that this is about WAL flushing in general. This probably needs some further cleanup to fix things I've broken, and I haven't done any performance testing, but it's progress. Do you have a shell script or something that you used for the performance tests that I could run? Or would you like to re-run the tests you did earlier with this patch? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
pgsql-hackers by date: