Re: Group commit, revised - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Group commit, revised
Date
Msg-id 4F13DBC9.3070603@enterprisedb.com
Whole thread Raw
In response to Group commit, revised  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: Group commit, revised
List pgsql-hackers
On 16.01.2012 00:42, Peter Geoghegan wrote:
> I've also attached the results of a pgbench-tools driven benchmark,
> which are quite striking (Just the most relevant image - e-mail me
> privately if you'd like a copy of the full report, as I don't want to
> send a large PDF file to the list as a courtesy to others). Apart from
> the obvious improvement in throughput, there is also a considerable
> improvement in average and worst latency at all client counts.

Impressive results. How about uploading the PDF to the community wiki?

> The main way that I've added value here is by refactoring and fixing
> bugs. There were some tricky race conditions that caused the
> regression tests to fail for that early draft patch, but there were a
> few other small bugs too. There is an unprecedented latch pattern
> introduced by this patch: Two processes (the WAL Writer and any given
> connection backend) have a mutual dependency. Only one may sleep, in
> which case the other is responsible for waking it. Much of the
> difficulty in debugging this patch, and much of the complexity that
> I've added, came from preventing both from simultaneously sleeping,
> even in the face of various known failure modes like postmaster death.
> If this happens, it does of course represent a denial-of-service, so
> that's something that reviewers are going to want to heavily
> scrutinise. I suppose that sync rep should be fine here, as it waits
> on walsenders, but I haven't actually comprehensively tested the
> patch, so there may be undiscovered unpleasant interactions with other
> xlog related features. I can report that it passes the regression
> tests successfully, and on an apparently consistently basis - I
> battled with intermittent failures for a time.

I think it might be simpler if it wasn't the background writer that's 
responsible for "driving" the group commit queue, but the backends 
themselves. When a flush request comes in, you join the queue, and if 
someone else is already doing the flush, sleep until the driver wakes 
you up. If no-one is doing the flush yet (ie. the queue is empty), start 
doing it yourself. You'll need a state variable to keep track who's 
driving the queue, but otherwise I think it would be simpler as there 
would be no dependency on WAL writer.

I tend think of the group commit facility as a bus. Passengers can hop 
on board at any time, and they take turns on who drives the bus. When 
the first passengers hops in, there is no driver so he takes the driver 
seat. When the next passenger hops in, he sees that someone is driving 
the bus already, so he sits down, and places a big sign on his forehead 
stating the bus stop where he wants to get off, and goes to sleep. When 
the driver has reached his own bus stop, he wakes up all the passengers 
who wanted to get off at the same stop or any of the earlier stops [1]. 
He also wakes up the passenger whose bus stop is the farthest from the 
current stop, and gets off the bus. The woken-up passengers who have 
already reached their stops can immediately get off the bus, and the one 
who has not notices that no-one is driving the bus anymore, so he takes 
the driver seat.

[1] in a real bus, a passenger would not be happy if he's woken up too 
late and finds himself at the next stop instead of the one where he 
wanted to go, but for group commit, that is fine.

In this arrangement, you could use the per-process semaphore for the 
sleep/wakeups, instead of latches. I'm not sure if there's any 
difference, but semaphores are more tried and tested, at least.

> The benefits (and, admittedly, the controversies) of this patch go
> beyond mere batching of commits: it also largely, though not entirely,
> obviates the need for user backends to directly write/flush WAL, and
> the WAL Writer may never sleep if it continually finds work to do -
> wal_writer_delay is obsoleted, as are commit_siblings and
> commit_delay.

wal_writer_delay is still needed for controlling how often asynchronous 
commits are flushed to disk.

> Auxiliary processes cannot use group commit. The changes made prevent
> them from availing of commit_siblings/commit_delay parallelism,
> because it doesn't exist anymore.

Auxiliary processes have PGPROC entries too. Why can't they participate?

> Group commit is sometimes throttled, which seems appropriate - if a
> backend requests that the WAL Writer flush an LSN deemed too far from
> the known flushed point, that request is rejected and the backend goes
> through another path, where XLogWrite() is called.

Hmm, if the backend doing the big flush gets the WALWriteLock before a 
bunch of group committers, the group committers will have to wait until 
the big flush is finished, anyway. I presume the idea of the throttling 
is to avoid the situation where a bunch of small commits need to wait 
for a huge flush to finish.

Perhaps the big flusher should always join the queue, but use some 
heuristic to first flush up to the previous commit request, to wake up 
others quickly, and do another flush to flush its own request after that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Ilya Kosmodemiansky
Date:
Subject: Re: SKIP LOCKED DATA
Next
From: Fujii Masao
Date:
Subject: Re: Standalone synchronous master