Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From Tom Lane
Subject Re: Load Distributed Checkpoints, take 3
Date
Msg-id 9651.1182471272@sss.pgh.pa.us
Whole thread Raw
In response to Re: Load Distributed Checkpoints, take 3  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Load Distributed Checkpoints, take 3
Re: Load Distributed Checkpoints, take 3
List pgsql-patches
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> So the question is, why in the heck would anyone want the behavior that
>> "checkpoints take exactly X time"??

> Because it's easier to tune. You don't need to know how much checkpoint
> I/O you can tolerate. The system will use just enough I/O bandwidth to
> meet the deadline, but not more than that.

Uh, not really.  After consuming more caffeine and reading the patch
more carefully, I think there are several problems here:

1. checkpoint_rate is used thusly:

    writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

where writes_per_nap is the max number of dirty blocks to write before
taking a bgwriter nap.  Now surely this is completely backward: if
BgWriterDelay is increased, the number of writes to allow per nap
decreases?  If you think checkpoint_rate is expressed in some kind of
physical bytes/sec unit, that cannot be right; the number of blocks
per nap has to increase if the naps get longer.  (BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)

2. checkpoint_smoothing is used thusly:

    /* scale progress according to CheckPointSmoothing */
    progress *= CheckPointSmoothing;

where the progress value being scaled is the fraction so far completed
of the total number of dirty pages we expect to have to write.  This
is then compared against estimates of the total fraction of the time-
between-checkpoints that has elapsed; if less, we are behind schedule
and should not nap, if more, we are ahead of schedule and may nap.
(This is a bit odd, but I guess it's all right because it's equivalent
to dividing the elapsed-time estimate by CheckPointSmoothing, which
seems a more natural way of thinking about what needs to happen.)

What's bugging me about this is that we are either going to be writing
at checkpoint_rate if ahead of schedule, or max possible rate if behind;
that's not "smoothing" to me, that's introducing some pretty bursty
behavior.  ISTM that actual "smoothing" would involve adjusting
writes_per_nap up or down according to whether we are ahead or behind
schedule, so as to have a finer degree of control over the I/O rate.
(I'd also consider saving the last writes_per_nap value across
checkpoints so as to have a more nearly accurate starting value next
time.)

In any case I still concur with Takahiro-san that "smoothing" doesn't
seem the most appropriate name for the parameter.  Something along the
lines of "checkpoint_completion_target" would convey more about what it
does, I think.

And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all, because as the code stands, with the default BgWriterDelay you
would have to increase checkpoint_rate to 4x its proposed default before
writes_per_nap moves off its minimum of 1.  This says to me that the
system's tested behavior has been so insensitive to checkpoint_rate
that we probably need not expose such a parameter at all; just hardwire
the minimum writes_per_nap at 1.

            regards, tom lane

pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: Transaction Guarantee, updated version
Next
From: Neil Conway
Date:
Subject: Re: psql: flush output in cursor-fetch mode