Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Joshua D. Drake
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 51B04B82.3010400@commandprompt.com
Whole thread Raw
In response to Re: Redesigning checkpoint_segments  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments
List pgsql-hackers
On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:
>
> (I'm sure you know this, but:) If you perform a checkpoint as fast and 
> short as possible, the sudden burst of writes and fsyncs will 
> overwhelm the I/O subsystem, and slow down queries. That's what we saw 
> before spread checkpoints: when a checkpoint happens, the response 
> times of queries jumped up.

That isn't quite right. Previously we had lock issues as well and 
checkpoints would take considerable time to complete. What I am talking 
about is that the background writer (and wal writer where applicable) 
have done all the work before a checkpoint is even called. Consider that 
everyone of my clients that I am active with sets the 
checkpoint_completion_target to 0.9. With a proper bgwriter config this 
works.

>
>> 4. Bgwriter. We should be adjusting bgwriter so that it is writing
>> everything in a manner that allows any checkpoint to be in the range of
>> never noticed.
>
> Oh, I see where you're going.

O.k. good. I am not nuts :D
> Yeah, that would be one way to do it. However, spread checkpoints has 
> pretty much the same effect. Imagine that you tune your system like 
> this: disable bgwriter altogether, and set 
> checkpoint_completion_target=0.9. With that, there will be a 
> checkpoint in progress most of the time, because by the time one 
> checkpoint completes, it's almost time to begin the next one already. 
> In that case, the checkpointer will be slowly performing the writes, 
> all the time, in the background, without affecting queries. The effect 
> is the same as what you described above, except that it's the 
> checkpointer doing the writing, not bgwriter.

O.k. if that is true, then we have redundant systems and we need to 
remove one of them.
> Yeah, wal_keep_segments is a hack. We should replace it with something 
> else, like having a registry of standbys in the master, and how far 
> they've streamed. That way the master could keep around the amount of 
> WAL actually needed by them, not more not less. But that's a different 
> story.
>
>> Other oddities:
>>
>> Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
>> It should be gone.
>
> The point of having checkpoint_segments or max_wal_size is to put a 
> limit (albeit a soft one) on the amount of disk space used. If you 
> don't care about that, I guess we could allow max_wal_size=-1 to mean 
> infinite, and checkpoints would be driven off purely based on time, 
> not WAL consumption.
>

I would not only agree with that, I would argue that max_wal_size 
doesn't need to be there at least as a default. Perhaps as an "advanced" 
configuration option that only those in the know see.


>> Basically we start with X amount perhaps to be set at
>> initdb time. That X amount changes dynamically based on the amount of
>> data being written. In order to not suffer from recycling and creation
>> penalties we always keep X+N where N is enough to keep up with new data.
>
> To clarify, here you're referring to controlling the number of WAL 
> segments preallocated/recycled, rather than how often checkpoints are 
> triggered. Currently, both are derived from checkpoint_segments, but I 
> proposed to separate them. The above is exactly what I proposed to do 
> for the preallocation/recycling, it would be tuned automatically, but 
> you still need something like max_wal_size for the other thing, to 
> trigger a checkpoint if too much WAL is being consumed.

You think so? I agree with 90% of this paragraph but it seems to me that 
we can find an algortihm that manages this without the idea of 
max_wal_size (at least as a user settable).

>> Along with the above, I don't see any reason for checkpoint_timeout.
>> Because of bgwriter we should be able to rather indefinitely not worry
>> about checkpoints (with a few exceptions such as pg_start_backup()).
>> Perhaps a setting that causes a checkpoint to happen based on some
>> non-artificial threshold (timeout) such as amount of data currently in
>> need of a checkpoint?
>
> Either I'm not understanding what you said, or you're confused. The 
> point of checkpoint_timeout is put a limit on the time it will take to 
> recover in case of crash. The relation between the two, 
> checkpoint_timeout and how long it will take to recover after a crash, 
> it not straightforward, but that's the best we have.

I may be confused but it is my understanding that bgwriter writes out 
the data from the shared buffer cache that is dirty based on an interval 
and a max pages written. If we are writing data continuously, we don't 
need checkpoints except for special cases (like pg_start_backup())?
>
> Bgwriter does not worry about checkpoints. By "amount of data 
> currently in need of a checkpoint", do you mean the number of dirty 
> buffers in shared_buffers, or something else? I don't see how or why 
> that should affect when you perform a checkpoint.
>
>> Heikki said, "I propose that we do something similar, but not exactly
>> the same. Let's have a setting, max_wal_size, to control the max. disk
>> space reserved for WAL. Once that's reached (or you get close enough, so
>> that there are still some segments left to consume while the checkpoint
>> runs), a checkpoint is triggered.
>>
>> In this proposal, the number of segments preallocated is controlled
>> separately from max_wal_size, so that you can set max_wal_size high,
>> without actually consuming that much space in normal operation. It's
>> just a backstop, to avoid completely filling the disk, if there's a
>> sudden burst of activity. The number of segments preallocated is
>> auto-tuned, based on the number of segments used in previous checkpoint
>> cycles. "
>>
>> This makes sense except I don't see a need for the parameter. Why not
>> just specify how the algorithm works and adhere to that without the need
>> for another GUC?
>
> Because you want to limit the amount of disk space used for WAL. It's 
> a soft limit, but still.
>

Why? This is the point that confuses me. Why do we care? We don't care 
how much disk space PGDATA takes... why do we all of a sudden care about 
pg_xlog?


>> Perhaps at any given point we save 10% of available
>> space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
>> and LOG EXACTLY WHY.
>
> Ah, but we don't know how much disk space is available. Even if we 
> did, there might be quotas or other constraints on the amount that we 
> can actually use. Or the DBA might not want PostgreSQL to use up all 
> the space, because there are other processes on the same system that 
> need it.
>

We could however know how much disk space is available.

Sincerely,

JD

> - Heikki
>




pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Redesigning checkpoint_segments
Next
From: Andres Freund
Date:
Subject: Re: MVCC catalog access