Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Joshua D. Drake |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | 51B04B82.3010400@commandprompt.com Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments Re: Redesigning checkpoint_segments |
List | pgsql-hackers |
On 6/6/2013 1:11 AM, Heikki Linnakangas wrote: > > (I'm sure you know this, but:) If you perform a checkpoint as fast and > short as possible, the sudden burst of writes and fsyncs will > overwhelm the I/O subsystem, and slow down queries. That's what we saw > before spread checkpoints: when a checkpoint happens, the response > times of queries jumped up. That isn't quite right. Previously we had lock issues as well and checkpoints would take considerable time to complete. What I am talking about is that the background writer (and wal writer where applicable) have done all the work before a checkpoint is even called. Consider that everyone of my clients that I am active with sets the checkpoint_completion_target to 0.9. With a proper bgwriter config this works. > >> 4. Bgwriter. We should be adjusting bgwriter so that it is writing >> everything in a manner that allows any checkpoint to be in the range of >> never noticed. > > Oh, I see where you're going. O.k. good. I am not nuts :D > Yeah, that would be one way to do it. However, spread checkpoints has > pretty much the same effect. Imagine that you tune your system like > this: disable bgwriter altogether, and set > checkpoint_completion_target=0.9. With that, there will be a > checkpoint in progress most of the time, because by the time one > checkpoint completes, it's almost time to begin the next one already. > In that case, the checkpointer will be slowly performing the writes, > all the time, in the background, without affecting queries. The effect > is the same as what you described above, except that it's the > checkpointer doing the writing, not bgwriter. O.k. if that is true, then we have redundant systems and we need to remove one of them. > Yeah, wal_keep_segments is a hack. We should replace it with something > else, like having a registry of standbys in the master, and how far > they've streamed. That way the master could keep around the amount of > WAL actually needed by them, not more not less. But that's a different > story. > >> Other oddities: >> >> Yes checkpoint_segments is awkward. We shouldn't have to set it at all. >> It should be gone. > > The point of having checkpoint_segments or max_wal_size is to put a > limit (albeit a soft one) on the amount of disk space used. If you > don't care about that, I guess we could allow max_wal_size=-1 to mean > infinite, and checkpoints would be driven off purely based on time, > not WAL consumption. > I would not only agree with that, I would argue that max_wal_size doesn't need to be there at least as a default. Perhaps as an "advanced" configuration option that only those in the know see. >> Basically we start with X amount perhaps to be set at >> initdb time. That X amount changes dynamically based on the amount of >> data being written. In order to not suffer from recycling and creation >> penalties we always keep X+N where N is enough to keep up with new data. > > To clarify, here you're referring to controlling the number of WAL > segments preallocated/recycled, rather than how often checkpoints are > triggered. Currently, both are derived from checkpoint_segments, but I > proposed to separate them. The above is exactly what I proposed to do > for the preallocation/recycling, it would be tuned automatically, but > you still need something like max_wal_size for the other thing, to > trigger a checkpoint if too much WAL is being consumed. You think so? I agree with 90% of this paragraph but it seems to me that we can find an algortihm that manages this without the idea of max_wal_size (at least as a user settable). >> Along with the above, I don't see any reason for checkpoint_timeout. >> Because of bgwriter we should be able to rather indefinitely not worry >> about checkpoints (with a few exceptions such as pg_start_backup()). >> Perhaps a setting that causes a checkpoint to happen based on some >> non-artificial threshold (timeout) such as amount of data currently in >> need of a checkpoint? > > Either I'm not understanding what you said, or you're confused. The > point of checkpoint_timeout is put a limit on the time it will take to > recover in case of crash. The relation between the two, > checkpoint_timeout and how long it will take to recover after a crash, > it not straightforward, but that's the best we have. I may be confused but it is my understanding that bgwriter writes out the data from the shared buffer cache that is dirty based on an interval and a max pages written. If we are writing data continuously, we don't need checkpoints except for special cases (like pg_start_backup())? > > Bgwriter does not worry about checkpoints. By "amount of data > currently in need of a checkpoint", do you mean the number of dirty > buffers in shared_buffers, or something else? I don't see how or why > that should affect when you perform a checkpoint. > >> Heikki said, "I propose that we do something similar, but not exactly >> the same. Let's have a setting, max_wal_size, to control the max. disk >> space reserved for WAL. Once that's reached (or you get close enough, so >> that there are still some segments left to consume while the checkpoint >> runs), a checkpoint is triggered. >> >> In this proposal, the number of segments preallocated is controlled >> separately from max_wal_size, so that you can set max_wal_size high, >> without actually consuming that much space in normal operation. It's >> just a backstop, to avoid completely filling the disk, if there's a >> sudden burst of activity. The number of segments preallocated is >> auto-tuned, based on the number of segments used in previous checkpoint >> cycles. " >> >> This makes sense except I don't see a need for the parameter. Why not >> just specify how the algorithm works and adhere to that without the need >> for another GUC? > > Because you want to limit the amount of disk space used for WAL. It's > a soft limit, but still. > Why? This is the point that confuses me. Why do we care? We don't care how much disk space PGDATA takes... why do we all of a sudden care about pg_xlog? >> Perhaps at any given point we save 10% of available >> space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint >> and LOG EXACTLY WHY. > > Ah, but we don't know how much disk space is available. Even if we > did, there might be quotas or other constraints on the amount that we > can actually use. Or the DBA might not want PostgreSQL to use up all > the space, because there are other processes on the same system that > need it. > We could however know how much disk space is available. Sincerely, JD > - Heikki >
pgsql-hackers by date: