Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 51B04B36.5000708@vmware.com
Whole thread Raw
In response to Re: Redesigning checkpoint_segments  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Redesigning checkpoint_segments
List pgsql-hackers
On 05.06.2013 23:16, Josh Berkus wrote:
>> For limiting the time required to recover after crash,
>> checkpoint_segments is awkward because it's difficult to calculate how
>> long recovery will take, given checkpoint_segments=X. A bulk load can
>> use up segments really fast, and recovery will be fast, while segments
>> full of random deletions can need a lot of random I/O to replay, and
>> take a long time. IMO checkpoint_timeout is a much better way to control
>> that, although it's not perfect either.
>
> This is true, but I don't see that your proposal changes this at all
> (for the better or for the worse).

Right, it doesn't. I explained this to justify that it's OK to replace 
checkpoint_segments with max_wal_size. If someone is trying to use 
checkpoint_segments to limit the time required to recover after crash, 
he might find the current checkpoint_segments setting more intuitive 
than my proposed max_wal_size. checkpoint_segments means "perform a 
checkpoint every X segments", so you know that after a crash, you will 
have to replay at most X segments (except that 
checkpoint_completion_target complicates that already). With 
max_wal_size, the relationship is not as clear.

What I tried to argue is that I don't think that's a serious concern.

>> I propose that we do something similar, but not exactly the same. Let's
>> have a setting, max_wal_size, to control the max. disk space reserved
>> for WAL. Once that's reached (or you get close enough, so that there are
>> still some segments left to consume while the checkpoint runs), a
>> checkpoint is triggered.
>
> Refinement of the proposal:
>
> 1. max_wal_size is a hard limit

I'd like to punt on that until later. Making it a hard limit would be a 
much bigger patch, and needs a lot of discussion how it should behave 
(switch to read-only mode, progressively slow down WAL writes, or what?) 
and how to implement it.

But I think there's a clear evolution path here; with current 
checkpoint_segments, it's not sensible to treat that as a hard limit. 
Once we have something like max_wal_size, defined in MB, it's much more 
sensible. So turning it into a hard limit could be a follow-up patch, if 
someone wants to step up to the plate.

> 2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
>     to avoid lockup if checkpoint takes longer than expected.

Will also have to factor in checkpoint_completion_target.

>> Hmm, haven't thought about that. I think a better unit to set
>> wal_keep_segments in would also be MB, not segments.
>
> Well, the ideal unit from the user's point of view is *time*, not space.
>   That is, the user wants the master to keep, say, "8 hours of
> transaction logs", not any amount of MB.  I don't want to complicate
> this proposal by trying to deliver that, though.

OTOH, if you specify it in terms of time, then you don't have any limit 
on the amount of disk space required.

>> In this proposal, the number of segments preallocated is controlled
>> separately from max_wal_size, so that you can set max_wal_size high,
>> without actually consuming that much space in normal operation. It's
>> just a backstop, to avoid completely filling the disk, if there's a
>> sudden burst of activity. The number of segments preallocated is
>> auto-tuned, based on the number of segments used in previous checkpoint
>> cycles.
>
> "based on"; can you give me your algorithmic thinking here?  I'm
> thinking we should have some calculation of last cycle size and peak
> cycle size so that bursty workloads aren't compromised.

Yeah, something like that :-). I was thinking of letting the estimate 
decrease like a moving average, but react to any increases immediately. 
Same thing we do in bgwriter to track buffer allocations:

>     /*
>      * Track a moving average of recent buffer allocations.  Here, rather than
>      * a true average we want a fast-attack, slow-decline behavior: we
>      * immediately follow any increase.
>      */
>     if (smoothed_alloc <= (float) recent_alloc)
>         smoothed_alloc = recent_alloc;
>     else
>         smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
>             smoothing_samples;
>

- Heikki



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: MVCC catalog access
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Redesigning checkpoint_segments