Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | 51B04B36.5000708@vmware.com Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Redesigning checkpoint_segments
|
List | pgsql-hackers |
On 05.06.2013 23:16, Josh Berkus wrote: >> For limiting the time required to recover after crash, >> checkpoint_segments is awkward because it's difficult to calculate how >> long recovery will take, given checkpoint_segments=X. A bulk load can >> use up segments really fast, and recovery will be fast, while segments >> full of random deletions can need a lot of random I/O to replay, and >> take a long time. IMO checkpoint_timeout is a much better way to control >> that, although it's not perfect either. > > This is true, but I don't see that your proposal changes this at all > (for the better or for the worse). Right, it doesn't. I explained this to justify that it's OK to replace checkpoint_segments with max_wal_size. If someone is trying to use checkpoint_segments to limit the time required to recover after crash, he might find the current checkpoint_segments setting more intuitive than my proposed max_wal_size. checkpoint_segments means "perform a checkpoint every X segments", so you know that after a crash, you will have to replay at most X segments (except that checkpoint_completion_target complicates that already). With max_wal_size, the relationship is not as clear. What I tried to argue is that I don't think that's a serious concern. >> I propose that we do something similar, but not exactly the same. Let's >> have a setting, max_wal_size, to control the max. disk space reserved >> for WAL. Once that's reached (or you get close enough, so that there are >> still some segments left to consume while the checkpoint runs), a >> checkpoint is triggered. > > Refinement of the proposal: > > 1. max_wal_size is a hard limit I'd like to punt on that until later. Making it a hard limit would be a much bigger patch, and needs a lot of discussion how it should behave (switch to read-only mode, progressively slow down WAL writes, or what?) and how to implement it. But I think there's a clear evolution path here; with current checkpoint_segments, it's not sensible to treat that as a hard limit. Once we have something like max_wal_size, defined in MB, it's much more sensible. So turning it into a hard limit could be a follow-up patch, if someone wants to step up to the plate. > 2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments ) > to avoid lockup if checkpoint takes longer than expected. Will also have to factor in checkpoint_completion_target. >> Hmm, haven't thought about that. I think a better unit to set >> wal_keep_segments in would also be MB, not segments. > > Well, the ideal unit from the user's point of view is *time*, not space. > That is, the user wants the master to keep, say, "8 hours of > transaction logs", not any amount of MB. I don't want to complicate > this proposal by trying to deliver that, though. OTOH, if you specify it in terms of time, then you don't have any limit on the amount of disk space required. >> In this proposal, the number of segments preallocated is controlled >> separately from max_wal_size, so that you can set max_wal_size high, >> without actually consuming that much space in normal operation. It's >> just a backstop, to avoid completely filling the disk, if there's a >> sudden burst of activity. The number of segments preallocated is >> auto-tuned, based on the number of segments used in previous checkpoint >> cycles. > > "based on"; can you give me your algorithmic thinking here? I'm > thinking we should have some calculation of last cycle size and peak > cycle size so that bursty workloads aren't compromised. Yeah, something like that :-). I was thinking of letting the estimate decrease like a moving average, but react to any increases immediately. Same thing we do in bgwriter to track buffer allocations: > /* > * Track a moving average of recent buffer allocations. Here, rather than > * a true average we want a fast-attack, slow-decline behavior: we > * immediately follow any increase. > */ > if (smoothed_alloc <= (float) recent_alloc) > smoothed_alloc = recent_alloc; > else > smoothed_alloc += ((float) recent_alloc - smoothed_alloc) / > smoothing_samples; > - Heikki
pgsql-hackers by date: