Re: Group commit, revised - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Group commit, revised
Date
Msg-id 4F19EAEE.2050501@enterprisedb.com
Whole thread Raw
In response to Re: Group commit, revised  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Group commit, revised
Re: Group commit, revised
List pgsql-hackers
I spent some time cleaning this up. Details below, but here are the
highlights:

* Reverted the removal of wal_writer_delay
* Doesn't rely on WAL writer. Any process can act as the "leader" now.
* Removed heuristic on big flushes
* Uses PGSemaphoreLock/Unlock instead of latches

On 20.01.2012 17:30, Robert Haas wrote:
> On Thu, Jan 19, 2012 at 10:46 PM, Peter Geoghegan<peter@2ndquadrant.com>  wrote:
>>> +       if (delta>  XLOG_SEG_SIZE * CheckPointSegments ||
>>> +                       !ProcGlobal->groupCommitAvailable)
>>>
>>> That seems like a gigantic value.  I would think that we'd want to
>>> forget about group commit any time we're behind by more than one
>>> segment (maybe less).
>>
>> I'm sure that you're right - I myself described it as horrible in my
>> original mail. I think that whatever value we set this to ought to be
>> well reasoned. Right now, your magic number doesn't seem much better
>> than my bogus heuristic (which only ever served as a placeholder
>> implementation that hinted at a well-principled formula). What's your
>> basis for suggesting that that limit would always be close to optimal?
>
> It's probably not - I suspect even a single WAL segment still too
> large.  I'm pretty sure that it would be safe to always flush up to
> the next block boundary, because we always write the whole block
> anyway even if it's only partially filled.  So there's no possible
> regression there.  Anything larger than "the end of the current 8kB
> block" is going to be a trade-off between latency and throughput, and
> it seems to me that there's probably only one way to figure out what's
> reasonable: test a bunch of different values and see which ones
> perform well in practice.  Maybe we could answer part of the question
> by writing a test program which does repeated sequential writes of
> increasingly-large chunks of data.  We might find out for example that
> 64kB is basically the same as 8kB because most of the overhead is in
> the system call anyway, but beyond that the curve kinks.  Or it may be
> that 16kB is already twice as slow as 8kB, or that you can go up to
> 1MB without a problem.  I don't see a way to know that without testing
> it on a couple different pieces of hardware and seeing what happens.

I ripped away that heuristic for a flush that's "too large". After
pondering it for a while, I came to the conclusion that as implemented
in the patch, it was pointless. The thing is, if the big flush doesn't
go through the group commit machinery, it's going to grab the
WALWriteLock straight away. Before any smaller commits can proceed, they
will need to grab that lock anyway, so the effect is the same as if the
big flush had just joined the queue.

Maybe we should have a heuristic to split a large flush into smaller
chunks. The WAL segment boundary would be a quite natural split point,
for example, because when crossing the file boundary you have to issue
separate fsync()s for the files anyway. But I'm happy with leaving that
out for now, it's not any worse than it used to be without group commit
anyway.

> Ugh.  Our current system doesn't even require this code to be
> interruptible, I think: you can't receive cancel or die interrupts
> either while waiting for an LWLock or while holding it.

Right. Or within HOLD/RESUME_INTERRUPTS blocks.

The patch added some CHECK_FOR_INTERRUPTS() calls into various places in
the commit/abort codepaths, but that's all dead code; they're all within
a HOLD/RESUME_INTERRUPTS blocks.

I replaced the usage of latches with the more traditional
PGSemaphoreLock/Unlock. It semaphore model works just fine in this case,
where we have a lwlock to guard the wait queue, and when a process is
waiting we know it will be woken up or something messed up at a pretty
low level. We don't need a timeout or to wake up at other signals while
waiting. Furthermore, the WAL writer didn't have a latch before this
patch; it's not many lines of code to initialize the latch and set up
the signal handler for it, but it already has a semaphore that's ready
to use.

I wonder if we should rename the file into "xlogflush.c" or something
like that, to make it clear that this works with any XLOG flushes, not
just commits? Group commit is the usual term for this feature, so we
should definitely mention that in the comments, but it might be better
to rename the files/functions to emphasize that this is about WAL
flushing in general.

This probably needs some further cleanup to fix things I've broken, and
I haven't done any performance testing, but it's progress. Do you have a
shell script or something that you used for the performance tests that I
could run? Or would you like to re-run the tests you did earlier with
this patch?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: Inline Extension
Next
From: Heikki Linnakangas
Date:
Subject: Re: Inline Extension