Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 558D329D.9040702@iki.fi
Whole thread Raw
In response to Re: Redesigning checkpoint_segments  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Redesigning checkpoint_segments  (Robert Haas <robertmhaas@gmail.com>)
Re: Redesigning checkpoint_segments  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 05/27/2015 12:26 AM, Jeff Janes wrote:
> On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>> On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> One of the points of max_wal_size and its predecessor is to limit how big
>>> pg_xlog can grow.  But running out of disk space on pg_xlog is no more
>> fun
>>> during archive recovery than it is during normal operations.  So why
>>> shouldn't max_wal_size be active during recovery?
>>
>> The following message of commit 7181530 explains why.
>>
>>      In standby mode, respect checkpoint_segments in addition to
>>      checkpoint_timeout to trigger restartpoints. We used to deliberately
>> only
>>      do time-based restartpoints, because if checkpoint_segments is small we
>>      would spend time doing restartpoints more often than really necessary.
>>      But now that restartpoints are done in bgwriter, they're not as
>>      disruptive as they used to be. Secondly, because streaming replication
>>      stores the streamed WAL files in pg_xlog, we want to clean it up more
>>      often to avoid running out of disk space when checkpoint_timeout is
>> large
>>      and checkpoint_segments small.
>>
>> Previously users were more likely to fall into this trouble (i.e., too
>> frequent
>> occurrence of restartpoints) because the default value of
>> checkpoint_segments
>> was very small, I guess. But we increased the default of max_wal_size, so
>> now
>> the risk of that trouble seems to be smaller than before, and maybe we can
>> allow max_wal_size to trigger restartpoints.
>
> I see.  The old behavior was present for the same reason we decided to split
> checkpoint_segments into max_wal_size and min_wal_size.
>
> That is, the default checkpoint_segments was small, and it had to be small
> because increasing it would cause more space to be used even when that
> extra space was not helpful.
>
> So perhaps we can consider this change a completion of the max_wal_size
> work, rather than a new feature?

Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segments
made sense when we initially did that, but it has gradually become less
and less sensible after that, as we got streaming replication, and as we
started to keep all restored segments in pg_xlog even in archive recovery.

> It seems to be a trivial change to implement that, although I might be
> overlooking something subtle (pasted below, also attached)
>
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
> XLogRecPtr targetPagePtr, int reqLen,
>                  * Request a restartpoint if we've replayed too much xlog
> since the
>                  * last one.
>                  */
> -               if (StandbyModeRequested && bgwriterLaunched)
> +               if (bgwriterLaunched)
>                 {
>                         if (XLogCheckpointNeeded(readSegNo))
>                         {
>
> This keeps pg_xlog at about 67% of max_wal_size during archive recovery
> (because checkpoint_completion_target is accounted for but goes unused)

Hmm. checkpoint_completion_target is used when determining progress
against checkpoint_timeout just fine, but the problem is that if you do
just the above, IsCheckpointOnSchedule() still won't consider consumed
WAL when it determines whether the restartpoint is "on time". So the
error is in the other direction: if you set max_wal_size to a small
value, and checkpoint_timeout to a large value, the restartpoint would
think that it has plenty of time to complete, and exceed max_wal_size.
We need to fix IsCheckpointOnSchedule() to also track progress against
max_wal_size during recovery.

I came up with the attached patch as a first attempt. It enables the
same logic to calculate if the checkpoint is on schedule to be used in
recovery. But there's a little problem (also explained in a comment in
the patch):

There is a large gap between a checkpoint's redo-pointer, and the
checkpoint record itself (determined by checkpoint_completion_target).
When we're not in recovery, we set the redo-pointer for the current
checkpoint first, then start flushing data, and finally write the
checkpoint record. The logic in IsCheckpointOnSchedule() calculates a)
how much WAL has been generated since the beginning of the checkpoint,
i.e its redo-pointer, and b) what fraction of shared_buffers has been
flushed to disk. But in recovery, we only start the restartpoint after
replaying the checkpoint record, so at the beginning of a restartpoint,
we're actually already behind schedule by the amount of WAL between the
redo-pointer and the record itself.

I'm not sure what to do about this. With the attached patch, you get the
same leisurely pacing with restartpoints as you get with checkpoints,
but you exceed max_wal_size during recovery, by the amount determined by
checkpoint_completion_target. Alternatively, we could try to perform
restartpoints faster then checkpoints, but then you'll get nasty
checkpoint I/O storms in recovery.

A bigger change would be to write a WAL record at the beginning of a
checkpoint. It wouldn't do anything else, but it would be a hint to
recovery that there's going to be a checkpoint record later whose
redo-pointer will point to that record. We could then start the
restartpoint at that record already, before seeing the checkpoint record
itself.

I think the attached is better than nothing, but I'll take a look at
that beginning-of-checkpoint idea. It might be too big a change to do at
this point, but I'd really like to fix this properly for 9.5, since
we've changed with the way checkpoints are scheduled anyway.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Support for N synchronous standby servers - take 2
Next
From: Robert Haas
Date:
Subject: Re: Redesigning checkpoint_segments