Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Spread checkpoint sync
Date
Msg-id 1295124798.3282.5.camel@ebony
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote:
> On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
> >> Robert Haas wrote:
> >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> >> >
> >> > > One of the ideas Simon and I had been considering at one point was adding
> >> > > some better de-duplication logic to the fsync absorb code, which I'm
> >> > > reminded by the pattern here might be helpful independently of other
> >> > > improvements.
> >> > >
> >> >
> >> > Hopefully I'm not stepping on any toes here, but I thought this was an
> >> > awfully good idea and had a chance to take a look at how hard it would
> >> > be today while en route from point A to point B.  The answer turned
> >> > out to be "not very", so PFA a patch that seems to work.  I tested it
> >> > by attaching gdb to the background writer while running pgbench, and
> >> > it eliminate the backend fsyncs without even breaking a sweat.
> >> >
> >>
> >> No toe damage, this is great, I hadn't gotten to coding for this angle
> >> yet at all.  Suffering from an overload of ideas and (mostly wasted)
> >> test data, so thanks for exploring this concept and proving it works.
> >
> > No toe damage either, but are we sure we want the de-duplication logic
> > and in this place?
> >
> > I was originally of the opinion that de-duplicating the list would save
> > time in the bgwriter, but that guess was wrong by about two orders of
> > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.
> 
> Well, the point of this is not to save time in the bgwriter - I'm not
> surprised to hear that wasn't noticeable.  The point is that when the
> fsync request queue fills up, backends start performing an fsync *for
> every block they write*, and that's about as bad for performance as
> it's possible to be.  So it's worth going to a little bit of trouble
> to try to make sure it doesn't happen.  It didn't happen *terribly*
> frequently before, but it does seem to be common enough to worry about
> - e.g. on one occasion, I was able to reproduce it just by running
> pgbench -i -s 25 or something like that on a laptop.
> 
> With this patch applied, there's no performance impact vs. current
> code in the very, very common case where space remains in the queue -
> 999 times out of 1000, writing to the fsync queue will be just as fast
> as ever.  But in the unusual case where the queue has been filled up,
> compacting the queue is much much faster than performing an fsync, and
> the best part is that the compaction is generally massive.  I was
> seeing things like "4096 entries compressed to 14".  So clearly even
> if the compaction took as long as the fsync itself it would be worth
> it, because the next 4000+ guys who come along again go through the
> fast path.  But in fact I think it's much faster than an fsync.
> 
> In order to get pathological behavior even with this patch applied,
> you'd need to have NBuffers pending fsync requests and they'd all have
> to be different.  I don't think that's theoretically impossible, but
> Greg's research seems to indicate that even on busy systems we don't
> come even a little bit close to the circumstances that would cause it
> to occur in practice.  Every other change we might make in this area
> will further improve this case, too: for example, doing an absorb
> after each fsync would presumably help, as would the more drastic step
> of splitting the bgwriter into two background processes (one to do
> background page cleaning, and the other to do checkpoints, for
> example).  But even without those sorts of changes, I think this is
> enough to effectively eliminate the full fsync queue problem in
> practice, which seems worth doing independently of anything else.

You've persuaded me.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: LOCK for non-tables
Next
From: Magnus Hagander
Date:
Subject: Include WAL in base backup