Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches
From | Tom Lane |
---|---|
Subject | Re: Load Distributed Checkpoints, take 3 |
Date | |
Msg-id | 9651.1182471272@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Load Distributed Checkpoints, take 3 (Heikki Linnakangas <heikki@enterprisedb.com>) |
Responses |
Re: Load Distributed Checkpoints, take 3
Re: Load Distributed Checkpoints, take 3 |
List | pgsql-patches |
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> So the question is, why in the heck would anyone want the behavior that >> "checkpoints take exactly X time"?? > Because it's easier to tune. You don't need to know how much checkpoint > I/O you can tolerate. The system will use just enough I/O bandwidth to > meet the deadline, but not more than that. Uh, not really. After consuming more caffeine and reading the patch more carefully, I think there are several problems here: 1. checkpoint_rate is used thusly: writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); where writes_per_nap is the max number of dirty blocks to write before taking a bgwriter nap. Now surely this is completely backward: if BgWriterDelay is increased, the number of writes to allow per nap decreases? If you think checkpoint_rate is expressed in some kind of physical bytes/sec unit, that cannot be right; the number of blocks per nap has to increase if the naps get longer. (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) 2. checkpoint_smoothing is used thusly: /* scale progress according to CheckPointSmoothing */ progress *= CheckPointSmoothing; where the progress value being scaled is the fraction so far completed of the total number of dirty pages we expect to have to write. This is then compared against estimates of the total fraction of the time- between-checkpoints that has elapsed; if less, we are behind schedule and should not nap, if more, we are ahead of schedule and may nap. (This is a bit odd, but I guess it's all right because it's equivalent to dividing the elapsed-time estimate by CheckPointSmoothing, which seems a more natural way of thinking about what needs to happen.) What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not "smoothing" to me, that's introducing some pretty bursty behavior. ISTM that actual "smoothing" would involve adjusting writes_per_nap up or down according to whether we are ahead or behind schedule, so as to have a finer degree of control over the I/O rate. (I'd also consider saving the last writes_per_nap value across checkpoints so as to have a more nearly accurate starting value next time.) In any case I still concur with Takahiro-san that "smoothing" doesn't seem the most appropriate name for the parameter. Something along the lines of "checkpoint_completion_target" would convey more about what it does, I think. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. regards, tom lane
pgsql-patches by date: