Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Load Distributed Checkpoints, take 3
Date
Msg-id 467B85ED.4040207@enterprisedb.com
Whole thread Raw
In response to Re: Load Distributed Checkpoints, take 3  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Load Distributed Checkpoints, take 3
List pgsql-patches
Tom Lane wrote:
> 1. checkpoint_rate is used thusly:
>
>     writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
>
> where writes_per_nap is the max number of dirty blocks to write before
> taking a bgwriter nap.  Now surely this is completely backward: if
> BgWriterDelay is increased, the number of writes to allow per nap
> decreases?  If you think checkpoint_rate is expressed in some kind of
> physical bytes/sec unit, that cannot be right; the number of blocks
> per nap has to increase if the naps get longer.

Uh, you're right, that's just plain wrong. checkpoint_rate is in
pages/s, so that line should be

writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000)

> (BTW, the patch seems
> a bit schizoid about whether checkpoint_rate is int or float.)

Yeah, I've gone back and forth on the data type. I wanted it to be a
float, but guc code doesn't let you specify a float in KB, so I switched
it to int.

> 2. checkpoint_smoothing is used thusly:
>
>     /* scale progress according to CheckPointSmoothing */
>     progress *= CheckPointSmoothing;
>
> where the progress value being scaled is the fraction so far completed
> of the total number of dirty pages we expect to have to write.  This
> is then compared against estimates of the total fraction of the time-
> between-checkpoints that has elapsed; if less, we are behind schedule
> and should not nap, if more, we are ahead of schedule and may nap.

> (This is a bit odd, but I guess it's all right because it's equivalent
> to dividing the elapsed-time estimate by CheckPointSmoothing, which
> seems a more natural way of thinking about what needs to happen.)

Yeah, it's a bit unnatural. I did it that way so we don't have to divide
all three of the estimates: cached_elapsed, progress_in_time and
progress_in_xlog. Maybe it's not worth micro-optimizing that.

> What's bugging me about this is that we are either going to be writing
> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
> that's not "smoothing" to me, that's introducing some pretty bursty
> behavior.  ISTM that actual "smoothing" would involve adjusting
> writes_per_nap up or down according to whether we are ahead or behind
> schedule, so as to have a finer degree of control over the I/O rate.
> (I'd also consider saving the last writes_per_nap value across
> checkpoints so as to have a more nearly accurate starting value next
> time.)

That sounds a lot more complex. The burstiness at that level shouldn't
matter much. The OS is still going to cache the writes, and should even
out the bursts.

Assuming time/xlogs elapse at a steady rate, we will write some multiple
of writes_per_nap pages between each sleep. With a small writes_per_nap,
writing just writes_per_nap isn't enough to catch up after we fall
behind, so we'll write more than that between each sleep. That means
that on each iteration, we'll write either N*writes_per_nap, or
(N+1)*writes_per_nap. At worst, that means either writes_per_nap or
2*writes_per_nap pages on each iteration. That's not too bad, I think.

> In any case I still concur with Takahiro-san that "smoothing" doesn't
> seem the most appropriate name for the parameter.  Something along the
> lines of "checkpoint_completion_target" would convey more about what it
> does, I think.

Ok, I'm not wedded to smoothing.

> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum.  However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1.

Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s,
  using the non-broken formula above, we get:

(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

So I think that's fine. (looking at the patch, I see that the default in
guc.c is actually 100 pages / s, contrary to documentation; needs to be
fixed)

> This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.

I've set checkpoint_rate to a small value in my tests on purpose to
control the feature with the other parameter. That must be why I haven't
noticed the bogus calculation of it before.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: Neil Conway
Date:
Subject: Re: psql: flush output in cursor-fetch mode
Next
From: Magnus Hagander
Date:
Subject: Re: Preliminary GSSAPI Patches