Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | 558D329D.9040702@iki.fi Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments |
List | pgsql-hackers |
On 05/27/2015 12:26 AM, Jeff Janes wrote: > On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > >> On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> One of the points of max_wal_size and its predecessor is to limit how big >>> pg_xlog can grow. But running out of disk space on pg_xlog is no more >> fun >>> during archive recovery than it is during normal operations. So why >>> shouldn't max_wal_size be active during recovery? >> >> The following message of commit 7181530 explains why. >> >> In standby mode, respect checkpoint_segments in addition to >> checkpoint_timeout to trigger restartpoints. We used to deliberately >> only >> do time-based restartpoints, because if checkpoint_segments is small we >> would spend time doing restartpoints more often than really necessary. >> But now that restartpoints are done in bgwriter, they're not as >> disruptive as they used to be. Secondly, because streaming replication >> stores the streamed WAL files in pg_xlog, we want to clean it up more >> often to avoid running out of disk space when checkpoint_timeout is >> large >> and checkpoint_segments small. >> >> Previously users were more likely to fall into this trouble (i.e., too >> frequent >> occurrence of restartpoints) because the default value of >> checkpoint_segments >> was very small, I guess. But we increased the default of max_wal_size, so >> now >> the risk of that trouble seems to be smaller than before, and maybe we can >> allow max_wal_size to trigger restartpoints. > > I see. The old behavior was present for the same reason we decided to split > checkpoint_segments into max_wal_size and min_wal_size. > > That is, the default checkpoint_segments was small, and it had to be small > because increasing it would cause more space to be used even when that > extra space was not helpful. > > So perhaps we can consider this change a completion of the max_wal_size > work, rather than a new feature? Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segments made sense when we initially did that, but it has gradually become less and less sensible after that, as we got streaming replication, and as we started to keep all restored segments in pg_xlog even in archive recovery. > It seems to be a trivial change to implement that, although I might be > overlooking something subtle (pasted below, also attached) > > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader, > XLogRecPtr targetPagePtr, int reqLen, > * Request a restartpoint if we've replayed too much xlog > since the > * last one. > */ > - if (StandbyModeRequested && bgwriterLaunched) > + if (bgwriterLaunched) > { > if (XLogCheckpointNeeded(readSegNo)) > { > > This keeps pg_xlog at about 67% of max_wal_size during archive recovery > (because checkpoint_completion_target is accounted for but goes unused) Hmm. checkpoint_completion_target is used when determining progress against checkpoint_timeout just fine, but the problem is that if you do just the above, IsCheckpointOnSchedule() still won't consider consumed WAL when it determines whether the restartpoint is "on time". So the error is in the other direction: if you set max_wal_size to a small value, and checkpoint_timeout to a large value, the restartpoint would think that it has plenty of time to complete, and exceed max_wal_size. We need to fix IsCheckpointOnSchedule() to also track progress against max_wal_size during recovery. I came up with the attached patch as a first attempt. It enables the same logic to calculate if the checkpoint is on schedule to be used in recovery. But there's a little problem (also explained in a comment in the patch): There is a large gap between a checkpoint's redo-pointer, and the checkpoint record itself (determined by checkpoint_completion_target). When we're not in recovery, we set the redo-pointer for the current checkpoint first, then start flushing data, and finally write the checkpoint record. The logic in IsCheckpointOnSchedule() calculates a) how much WAL has been generated since the beginning of the checkpoint, i.e its redo-pointer, and b) what fraction of shared_buffers has been flushed to disk. But in recovery, we only start the restartpoint after replaying the checkpoint record, so at the beginning of a restartpoint, we're actually already behind schedule by the amount of WAL between the redo-pointer and the record itself. I'm not sure what to do about this. With the attached patch, you get the same leisurely pacing with restartpoints as you get with checkpoints, but you exceed max_wal_size during recovery, by the amount determined by checkpoint_completion_target. Alternatively, we could try to perform restartpoints faster then checkpoints, but then you'll get nasty checkpoint I/O storms in recovery. A bigger change would be to write a WAL record at the beginning of a checkpoint. It wouldn't do anything else, but it would be a hint to recovery that there's going to be a checkpoint record later whose redo-pointer will point to that record. We could then start the restartpoint at that record already, before seeing the checkpoint record itself. I think the attached is better than nothing, but I'll take a look at that beginning-of-checkpoint idea. It might be too big a change to do at this point, but I'd really like to fix this properly for 9.5, since we've changed with the way checkpoints are scheduled anyway. - Heikki
Attachment
pgsql-hackers by date: