Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Re: Load Distributed Checkpoints, take 3 |
Date | |
Msg-id | 467B85ED.4040207@enterprisedb.com Whole thread Raw |
In response to | Re: Load Distributed Checkpoints, take 3 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Load Distributed Checkpoints, take 3
|
List | pgsql-patches |
Tom Lane wrote: > 1. checkpoint_rate is used thusly: > > writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); > > where writes_per_nap is the max number of dirty blocks to write before > taking a bgwriter nap. Now surely this is completely backward: if > BgWriterDelay is increased, the number of writes to allow per nap > decreases? If you think checkpoint_rate is expressed in some kind of > physical bytes/sec unit, that cannot be right; the number of blocks > per nap has to increase if the naps get longer. Uh, you're right, that's just plain wrong. checkpoint_rate is in pages/s, so that line should be writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000) > (BTW, the patch seems > a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. > 2. checkpoint_smoothing is used thusly: > > /* scale progress according to CheckPointSmoothing */ > progress *= CheckPointSmoothing; > > where the progress value being scaled is the fraction so far completed > of the total number of dirty pages we expect to have to write. This > is then compared against estimates of the total fraction of the time- > between-checkpoints that has elapsed; if less, we are behind schedule > and should not nap, if more, we are ahead of schedule and may nap. > (This is a bit odd, but I guess it's all right because it's equivalent > to dividing the elapsed-time estimate by CheckPointSmoothing, which > seems a more natural way of thinking about what needs to happen.) Yeah, it's a bit unnatural. I did it that way so we don't have to divide all three of the estimates: cached_elapsed, progress_in_time and progress_in_xlog. Maybe it's not worth micro-optimizing that. > What's bugging me about this is that we are either going to be writing > at checkpoint_rate if ahead of schedule, or max possible rate if behind; > that's not "smoothing" to me, that's introducing some pretty bursty > behavior. ISTM that actual "smoothing" would involve adjusting > writes_per_nap up or down according to whether we are ahead or behind > schedule, so as to have a finer degree of control over the I/O rate. > (I'd also consider saving the last writes_per_nap value across > checkpoints so as to have a more nearly accurate starting value next > time.) That sounds a lot more complex. The burstiness at that level shouldn't matter much. The OS is still going to cache the writes, and should even out the bursts. Assuming time/xlogs elapse at a steady rate, we will write some multiple of writes_per_nap pages between each sleep. With a small writes_per_nap, writing just writes_per_nap isn't enough to catch up after we fall behind, so we'll write more than that between each sleep. That means that on each iteration, we'll write either N*writes_per_nap, or (N+1)*writes_per_nap. At worst, that means either writes_per_nap or 2*writes_per_nap pages on each iteration. That's not too bad, I think. > In any case I still concur with Takahiro-san that "smoothing" doesn't > seem the most appropriate name for the parameter. Something along the > lines of "checkpoint_completion_target" would convey more about what it > does, I think. Ok, I'm not wedded to smoothing. > And checkpoint_rate really needs to be named checkpoint_min_rate, if > it's going to be a minimum. However, I question whether we need it at > all, because as the code stands, with the default BgWriterDelay you > would have to increase checkpoint_rate to 4x its proposed default before > writes_per_nap moves off its minimum of 1. Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. (looking at the patch, I see that the default in guc.c is actually 100 pages / s, contrary to documentation; needs to be fixed) > This says to me that the > system's tested behavior has been so insensitive to checkpoint_rate > that we probably need not expose such a parameter at all; just hardwire > the minimum writes_per_nap at 1. I've set checkpoint_rate to a small value in my tests on purpose to control the feature with the other parameter. That must be why I haven't noticed the bogus calculation of it before. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-patches by date: