Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Joshua D. Drake |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | 51AFFFF6.6090902@commandprompt.com Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments |
List | pgsql-hackers |
On 06/05/2013 05:37 PM, Robert Haas wrote: > - If it looks like we're going to exceed limit #3 before the > checkpoint completes, we start exerting back-pressure on writers by > making them wait every time they write WAL, probably in proportion to > the number of bytes written. We keep ratcheting up the wait until > we've slowed down writers enough that will finish within limit #3. As > we reach limit #3, the wait goes to infinity; only read-only > operations can proceed until the checkpoint finishes. Alright, perhaps I am dense. I have read both this thread and the other one on better handling of archive command (http://www.postgresql.org/message-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com). I recognize there are brighter minds than mine on this thread but I just honestly don't get it. 1. WAL writes are already fast. They are the fastest write we have because it is sequential. 2. We don't want them to be slow. We want data written to disk as quickly as possible without adversely affecting production. That's the point. 3. The spread checkpoints have always confused me. If anything we want a checkpoint to be fast and short because: 4. Bgwriter. We should be adjusting bgwriter so that it is writing everything in a manner that allows any checkpoint to be in the range of never noticed. Now perhaps my customers workloads are different but for us: 1. Checkpoint timeout is set as high as reasonable, usually 30 minutes to an hour. I wish I could set them even further out. 2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting based on an actual amount of IO bandwidth it may take per second based on their IO constraints. (Note I know that wal_writer comes into play here but I honestly don't remember where and am reading up on it to refresh my memory). 3. The biggest issue we see with checkpoint segments is not running out of space because really.... 10GB is how many checkpoint segments? It is with wal_keep_segments. If we don't want to fill up the pg_xlog directory, put the wal logs that are for keep_segments elsewhere. Other oddities: Yes checkpoint_segments is awkward. We shouldn't have to set it at all. It should be gone. Basically we start with X amount perhaps to be set at initdb time. That X amount changes dynamically based on the amount of data being written. In order to not suffer from recycling and creation penalties we always keep X+N where N is enough to keep up with new data. Along with the above, I don't see any reason for checkpoint_timeout. Because of bgwriter we should be able to rather indefinitely not worry about checkpoints (with a few exceptions such as pg_start_backup()). Perhaps a setting that causes a checkpoint to happen based on some non-artificial threshold (timeout) such as amount of data currently in need of a checkpoint? Heikki said, "I propose that we do something similar, but not exactly the same. Let's have a setting, max_wal_size, to control the max. disk space reserved for WAL. Once that's reached (or you get close enough, so that there are still some segments left to consume while the checkpoint runs), a checkpoint is triggered. In this proposal, the number of segments preallocated is controlled separately from max_wal_size, so that you can set max_wal_size high, without actually consuming that much space in normal operation. It's just a backstop, to avoid completely filling the disk, if there's a sudden burst of activity. The number of segments preallocated is auto-tuned, based on the number of segments used in previous checkpoint cycles. " This makes sense except I don't see a need for the parameter. Why not just specify how the algorithm works and adhere to that without the need for another GUC? Perhaps at any given point we save 10% of available space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint and LOG EXACTLY WHY. Instead of "running out of disk space PANIC" we should just write to an emergency location within PGDATA and log very loudly that the SA isn't paying attention. Perhaps if that area starts to get to an unhappy place we immediately bounce into read-only mode and log even more loudly that the SA should be fired. I would think read-only mode is safer and more polite than an PANIC crash. I do not think we should worry about filling up the hard disk except to protect against data loss in the event. It is not user unfriendly to assume that a user will pay attention to disk space. Really? Open to people telling me I am off in left field. Sorry if it is noise. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
pgsql-hackers by date: