Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Spread checkpoint sync
Date
Msg-id AANLkTikj76AzK4yEQ7Sn-guaXpCUEuAons23mnU3NSV1@mail.gmail.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Spread checkpoint sync  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Spread checkpoint sync  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
>> Robert Haas wrote:
>> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> >
>> > > One of the ideas Simon and I had been considering at one point was adding
>> > > some better de-duplication logic to the fsync absorb code, which I'm
>> > > reminded by the pattern here might be helpful independently of other
>> > > improvements.
>> > >
>> >
>> > Hopefully I'm not stepping on any toes here, but I thought this was an
>> > awfully good idea and had a chance to take a look at how hard it would
>> > be today while en route from point A to point B.  The answer turned
>> > out to be "not very", so PFA a patch that seems to work.  I tested it
>> > by attaching gdb to the background writer while running pgbench, and
>> > it eliminate the backend fsyncs without even breaking a sweat.
>> >
>>
>> No toe damage, this is great, I hadn't gotten to coding for this angle
>> yet at all.  Suffering from an overload of ideas and (mostly wasted)
>> test data, so thanks for exploring this concept and proving it works.
>
> No toe damage either, but are we sure we want the de-duplication logic
> and in this place?
>
> I was originally of the opinion that de-duplicating the list would save
> time in the bgwriter, but that guess was wrong by about two orders of
> magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

Well, the point of this is not to save time in the bgwriter - I'm not
surprised to hear that wasn't noticeable.  The point is that when the
fsync request queue fills up, backends start performing an fsync *for
every block they write*, and that's about as bad for performance as
it's possible to be.  So it's worth going to a little bit of trouble
to try to make sure it doesn't happen.  It didn't happen *terribly*
frequently before, but it does seem to be common enough to worry about
- e.g. on one occasion, I was able to reproduce it just by running
pgbench -i -s 25 or something like that on a laptop.

With this patch applied, there's no performance impact vs. current
code in the very, very common case where space remains in the queue -
999 times out of 1000, writing to the fsync queue will be just as fast
as ever.  But in the unusual case where the queue has been filled up,
compacting the queue is much much faster than performing an fsync, and
the best part is that the compaction is generally massive.  I was
seeing things like "4096 entries compressed to 14".  So clearly even
if the compaction took as long as the fsync itself it would be worth
it, because the next 4000+ guys who come along again go through the
fast path.  But in fact I think it's much faster than an fsync.

In order to get pathological behavior even with this patch applied,
you'd need to have NBuffers pending fsync requests and they'd all have
to be different.  I don't think that's theoretically impossible, but
Greg's research seems to indicate that even on busy systems we don't
come even a little bit close to the circumstances that would cause it
to occur in practice.  Every other change we might make in this area
will further improve this case, too: for example, doing an absorb
after each fsync would presumably help, as would the more drastic step
of splitting the bgwriter into two background processes (one to do
background page cleaning, and the other to do checkpoints, for
example).  But even without those sorts of changes, I think this is
enough to effectively eliminate the full fsync queue problem in
practice, which seems worth doing independently of anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: ALTER TYPE 0: Introduction; test cases
Next
From: Robert Haas
Date:
Subject: Re: Bug in pg_describe_object, patch v2