Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Spread checkpoint sync
Date
Msg-id AANLkTikjC24drpRYriJr=roQ_bBNJ+BOJnUndVaQ4wrr@mail.gmail.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
List pgsql-hackers
On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Once upon a time we got a patch from Itagaki Takahiro whose purpose was to
> sort writes before sending them out:
>
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

Ah, a fine idea!

> Which has very low odds of the sync on "a" finishing quickly, we'd get this
> one:
>
> table block
> a 1
> a 2
> b 1
> b 2
> c 1
> c 2
> sync a
> sync b
> sync c
>
> Which sure seems like a reasonable way to improve the odds data has been
> written before the associated sync comes along.

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.  It seems to me that it's
going to be a bear to figure out what fraction of the checkpoint
you've completed if you put all of the syncs at the end, and this
whole problem appears to be predicated the assumption that the OS
*isn't* writing out in a timely fashion.  Are we sure that postponing
the fsync relative to the writes is anything more than wishful
thinking?

> Also, I could just traverse the sorted list with some simple logic to count
> the number of unique files, and then set the delay between fsync writes
> based on it.  In the above, once the list was sorted, easy to just see how
> many times the table name changes on a linear scan of the sorted data.  3
> files, so if the checkpoint target gives me, say, a minute of time to sync
> them, I can delay 20 seconds between.  Simple math, and exactly the sort I

How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.

> So I fixed the bitrot on the old sorted patch, which was fun as it came from
> before the 8.3 changes.  It seemed to work.  I then moved the structure it
> uses to hold the list of buffers to write, the thing that's sorted, into
> shared memory.  It's got a predictable maximum size, relying on palloc in
> the middle of the checkpoint code seems bad, and there's some potential gain
> from not reallocating it every time through.

Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.

> There's good bits in the patch I submitted for the last CF and in the patch
> you wrote earlier this week.  This unfinished patch may be a valuable idea
> to fit in there too once I fix it, or maybe it's fundamentally flawed and
> one of the other ideas you suggested (or I have sitting on the potential
> design list) will work better.  There's a patch integration problem that
> needs to be solved here, but I think almost all the individual pieces are
> available.  I'd hate to see this fail to get integrated now just for lack of
> time, considering the problem is so serious when you run into it.

Likewise, but committing something half-baked is no good either.  I
think we're in a position to crush the full-fsync-queue problem flat
(my patch should do that, and there are several other obvious things
we can do for extra certainty) but the problem of spreading out the
fsyncs looks to me like something we don't completely know how to
solve.  If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Spread checkpoint sync
Next
From: Heikki Linnakangas
Date:
Subject: Re: Streaming base backups