Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | CAMkU=1wT1NXLA=Bt9L1rnpA4cT3T_GNb1rRvKN-kFDw0QCNbcA@mail.gmail.com Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments ("Joshua D. Drake" <jd@commandprompt.com>) |
List | pgsql-hackers |
On Wed, Jun 5, 2013 at 8:20 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On 06/05/2013 05:37 PM, Robert Haas wrote:Alright, perhaps I am dense. I have read both this thread and the other one on better handling of archive command (http://www.postgresql.org/message-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com). I recognize there are brighter minds than mine on this thread but I just honestly don't get it.- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.
1. WAL writes are already fast. They are the fastest write we have because it is sequential.
2. We don't want them to be slow. We want data written to disk as quickly as possible without adversely affecting production. That's the point.
If speed of archiving is the fundamental bottleneck on the system, how does that bottleneck get communicated forward to the user? PANICs are a horrible way of doing it, throttling the writing of WAL (and hence the acceptance of COMMITs) seems like a reasonable alternative . Maybe speed of archiving is not the fundamental bottleneck on your systems, but...
3. The spread checkpoints have always confused me. If anything we want a checkpoint to be fast and short because:
4. Bgwriter. We should be adjusting bgwriter so that it is writing everything in a manner that allows any checkpoint to be in the range of never noticed.
They do different things. One writes buffers out to make room for incoming ones. One writes them out (and fsyncs the underlying files) to allow redo pointer to advance (limiting soft recovery time) and xlogs to be recycled (limiting disk space).
Now perhaps my customers workloads are different but for us:
1. Checkpoint timeout is set as high as reasonable, usually 30 minutes to an hour. I wish I could set them even further out.
Yeah, I think the limit of 1 hr is rather nanny-ish. I know what I'm doing, and I want the freedom to go longer if that is what I want to do.
2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting based on an actual amount of IO bandwidth it may take per second based on their IO constraints. (Note I know that wal_writer comes into play here but I honestly don't remember where and am reading up on it to refresh my memory).
I find bgwriter to be almost worthless, at least since the fsync queue compaction code went in. When io is free-flowing the kernel accepts writes almost instantaneously, and so the backends can write out dirty buffers themselves very quickly and it is not worth off-loading to a background process. When IO is constipated, it would be worth off-loading except in those circumstances the bgwriter cannot possibly keep up.
3. The biggest issue we see with checkpoint segments is not running out of space because really.... 10GB is how many checkpoint segments? It is with wal_keep_segments. If we don't want to fill up the pg_xlog directory, put the wal logs that are for keep_segments elsewhere.
Which is what archiving does. But then you have a to put a lot of thought into how to clean up the archive, assuming your policy is not to keep it forever. keep_segments can be a nice compromise.
Other oddities:
Yes checkpoint_segments is awkward. We shouldn't have to set it at all. It should be gone. Basically we start with X amount perhaps to be set at initdb time. That X amount changes dynamically based on the amount of data being written. In order to not suffer from recycling and creation penalties we always keep X+N where N is enough to keep up with new data.
Along with the above, I don't see any reason for checkpoint_timeout. Because of bgwriter we should be able to rather indefinitely not worry about checkpoints (with a few exceptions such as pg_start_backup()). Perhaps a setting that causes a checkpoint to happen based on some non-artificial threshold (timeout) such as amount of data currently in need of a checkpoint?
Without checkpoints, how would the redo pointer ever advance?
If the system is io limited during recovery, then checkpoint_segments is a fairly natural way to put a limit on how long recovery from a soft crash will take. If the system is CPU limited during recovery, then checkpoint_timeout is a fairly natural way to put a limit on how long recovery will take. It is probably possible to come with a single merged setting that is better than both of those in almost all circumstances, but how much work would that take to get right?
...
Instead of "running out of disk space PANIC" we should just write to an emergency location within PGDATA and log very loudly that the SA isn't paying attention.
If the SA isn't paying attention, who is it that we are loudly saying these things to?
If whatever caused archiving to break also caused the archiving failure emails to not be delivered, about the only way you can get louder is by refusing new requests from the end user.
Perhaps if that area starts to get to an unhappy place we immediately bounce into read-only mode and log even more loudly that the SA should be fired. I would think read-only mode is safer and more polite than an PANIC crash.
Isn't that effectively what throttling WAL writing is?
Cheers,
Jeff
pgsql-hackers by date: