Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Spread checkpoint sync
Date
Msg-id AANLkTimM2Pn1pHyy574DXG_E4r2-dd7WU=r1pU+D8F=L@mail.gmail.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
Re: Spread checkpoint sync  (Marti Raudsepp <marti@juffo.org>)
Re: Spread checkpoint sync  (Greg Smith <greg@2ndquadrant.com>)
List pgsql-hackers
On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> No toe damage, this is great, I hadn't gotten to coding for this angle yet
> at all.  Suffering from an overload of ideas and (mostly wasted) test data,
> so thanks for exploring this concept and proving it works.

Yeah - obviously I want to make sure that someone reviews the logic
carefully, since a loss of fsyncs or a corruption of the request queue
could affect system stability, but only very rarely, since you'd need
full fsync queue + crash.  But the code is pretty simple, so it should
be possible to convince ourselves as to its correctness (or
otherwise).  Obviously, major credit to you and Simon for identifying
the problem and coming up with a proposed fix.

> I'm not sure what to do with the rest of the work I've been doing in this
> area here, so I'm tempted to just combine this new bit from you with the
> older patch I submitted, streamline, and see if that makes sense.  Expected
> to be there already, then "how about spending 5 minutes first checking out
> that autovacuum lock patch again" turned out to be a wild underestimate.

I'd rather not combine the patches, because this one is pretty simple
and just does one thing, but feel free to write something that applies
over top of it.  Looking through your old patch (sync-spread-v3),
there seem to be a couple of components there:

- Compact the fsync queue based on percentage fill rather than number
of writes per absorb.  If we apply my queue-compacting logic, do we
still need this?  The queue compaction may hold the BgWriterCommLock
for slightly longer than AbsorbFsyncRequests() would, but I'm not
inclined to jump to the conclusion that this is worth getting excited
about.  The whole idea of accessing BgWriterShmem->num_requests
without the lock gives me the willies anyway - sure, it'll probably
work OK most of the time, especially on x86, but it seems hard to
predict whether there will be occasional bad behavior on platforms
with weak memory ordering.

- Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests().
Not sure what the motivation for this is.

- CheckpointSyncDelay(), to make sure that we absorb fsync requests
and free up buffers during a long checkpoint.  I think this part is
clearly valuable, although I'm not sure we've satisfactorily solved
the problem of how to spread out the fsyncs and still complete the
checkpoint on schedule.

As to that, I have a couple of half-baked ideas I'll throw out so you
can laugh at them.  Some of these may be recycled versions of ideas
you've already had/mentioned, so, again, credit to you for getting the
ball rolling.

Idea #1: When we absorb fsync requests, don't just remember that there
was an fsync request; also remember the time of said fsync request.
If a new fsync request arrives for a segment for which we're already
remembering an fsync request, update the timestamp on the request.
Periodically scan the fsync request queue for requests older than,
say, 30 s, and perform one such request.   The idea is - if we wrote a
bunch of data to a relation and then haven't touched it for a while,
force it out to disk before the checkpoint actually starts so that the
volume of work required by the checkpoint is lessened.

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs.  Start by
performing any fsyncs that need to happen but have no buffers to sync
(i.e. everything that must be written to that relation has already
been written).  Then, start performing the writes, decrementing the
pending-write counters as you go.  If the pending-write count for a
relation hits zero, you can add it to the list of fsyncs that can be
performed before the writes are finished.  If it doesn't hit zero
(perhaps because a non-bgwriter process wrote a buffer that we were
going to write anyway), then we'll do it at the end.  One problem with
this - aside from complexity - is that most likely most fsyncs would
either happen at the beginning or very near the end, because there's
no reason to assume that buffers for the same relation would be
clustered together in shared_buffers.  But I'm inclined to think that
at least the idea of performing fsyncs for which no dirty buffers
remain in shared_buffers at the beginning of the checkpoint rather
than at the end might have some value.

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  Every
30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
closes the file, thus dumping the cache and preventing a ridiculous
growth in the amount of data to be sync'd at checkpoint time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Joel Jacobson
Date:
Subject: Re: pov 1.0 is released, testers with huge schemas needed
Next
From: Robert Haas
Date:
Subject: Re: LOCK for non-tables