Re: Load Distributed Checkpoints, revised patch - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Load Distributed Checkpoints, revised patch
Date
Msg-id 4674E807.3050805@enterprisedb.com
Whole thread Raw
In response to Re: Load Distributed Checkpoints, revised patch  ("Simon Riggs" <simon@2ndquadrant.com>)
Responses Re: Load Distributed Checkpoints, revised patch  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-patches
Simon Riggs wrote:
> On Fri, 2007-06-15 at 11:34 +0100, Heikki Linnakangas wrote:
>
>> - What units should we use for the new GUC variables? From
>> implementation point of view, it would be simplest if
>> checkpoint_write_rate is given as pages/bgwriter_delay, similarly to
>> bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a
>> more natural unit from users perspective would be KB/s.
>
> checkpoint_maxpages would seem like a better name; we've already had
> those _maxpages settings for 3 releases, so changing that is not really
> an option (at so late a stage).

As Tom pointed out, we don't promise compatibility of conf-files over
major releases. I wasn't actually thinking of changing any of the
existing parameters, just thinking about the best name and behavior for
the new ones.

> We don't really care about units because
> the way you use it is to nudge it up a little and see if that works
> etc..

Not necessarily. If it's given in KB/s, you might very well have an idea
of how much I/O your hardware is capable of, and set aside a fraction of
that for checkpoints.

> Can we avoid having another parameter? There must be some protection in
> there to check that a checkpoint lasts for no longer than
> checkpoint_timeout, so it makes most sense to vary the checkpoint in
> relation to that parameter.

Sure, that's what checkpoint_write_percent is for. checkpoint_rate can
be used to finish the checkpoint faster, if there's not much work to do.
For example, if there's only 10 pages to flush in a checkpoint,
checkpoint_timeout is 30 minutes and checkpoint_write_percent = 50%, you
don't want to spread out those 10 writes over 15 minutes, that would be
just silly. checkpoint_rate sets the *minimum* rate used to write. If
writing at that minimum rate isn't enough to finish the checkpoint in
time, as defined by by checkpoint interval * checkpoint_write_percent,
we write more aggressively.

I'm more interested in checkpoint_write_percent myself as well, but Greg
Smith said he wanted the checkpoint to use a constant I/O rate and let
the length of the checkpoint to vary.

>> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
>> Bgwriter now needs to deal immediate checkpoint requests, like those
>> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
>> from those triggered by checkpoint_segments. I'm afraid there might be
>> race conditions when a CHECKPOINT is issued at the same instant as
>> checkpoint_segments triggers one. What might happen then is that the
>> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
>> command has to wait for that to finish which might take a long time. I
>> have not been able to convince myself neither that the race condition
>> exists or that it doesn't.
>
> Is there a mechanism for requesting immediate/non-immediate checkpoints?

No, CHECKPOINT requests an immediate one. Is there a use case for
CHECKPOINT LAZY?

> pg_start_backup() should be a normal checkpoint I think. No need for
> backup to be an intrusive process.

Good point. A spread out checkpoint can take a long time to finish,
though. Is there risk for running into a timeout or something if it
takes say 10 minutes for a call to pg_start_backup to finish?

>> - to coordinate the writes with with checkpoint_segments, we need to
>> read the WAL insertion location. To do that, we need to acquire the
>> WALInsertLock. That means that in the worst case, WALInsertLock is
>> acquired every bgwriter_delay when a checkpoint is in progress. I don't
>> think that's a problem, it's only held for a very short duration, but I
>> thought I'd mention it.
>
> I think that is a problem.

Why?

> Do we need to know it so exactly that we look
> at WALInsertLock? Maybe use info_lck to request the latest page, since
> that is less heavily contended and we need never wait across I/O.

Is there such a value available, that's protected by just info_lck? I
can't see one.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Maintaining cluster order on insert
Next
From: "Simon Riggs"
Date:
Subject: Re: Load Distributed Checkpoints, revised patch