Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Joshua D. Drake
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 51AFFFF6.6090902@commandprompt.com
Whole thread Raw
In response to Re: Redesigning checkpoint_segments  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments
List pgsql-hackers
On 06/05/2013 05:37 PM, Robert Haas wrote:

> - If it looks like we're going to exceed limit #3 before the
> checkpoint completes, we start exerting back-pressure on writers by
> making them wait every time they write WAL, probably in proportion to
> the number of bytes written.  We keep ratcheting up the wait until
> we've slowed down writers enough that will finish within limit #3.  As
> we reach limit #3, the wait goes to infinity; only read-only
> operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other 
one on better handling of archive command 
(http://www.postgresql.org/message-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com). 
I recognize there are brighter minds than mine on this thread but I just 
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have 
because it is sequential.

2. We don't want them to be slow. We want data written to disk as 
quickly as possible without adversely affecting production. That's the 
point.

3. The spread checkpoints have always confused me. If anything we want a 
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing 
everything in a manner that allows any checkpoint to be in the range of 
never noticed.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes 
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting 
based on an actual amount of IO bandwidth it may take per second based 
on their IO constraints. (Note I know that wal_writer comes into play 
here but I honestly don't remember where and am reading up on it to 
refresh my memory).

3. The biggest issue we see with checkpoint segments is not running out 
of space because really.... 10GB is how many checkpoint segments? It is 
with wal_keep_segments. If we don't want to fill up the pg_xlog 
directory, put the wal logs that are for keep_segments elsewhere.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all. 
It should be gone. Basically we start with X amount perhaps to be set at 
initdb time. That X amount changes dynamically based on the amount of 
data being written. In order to not suffer from recycling and creation 
penalties we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout. 
Because of bgwriter we should be able to rather indefinitely not worry 
about checkpoints (with a few exceptions such as pg_start_backup()). 
Perhaps a setting that causes a checkpoint to happen based on some 
non-artificial threshold (timeout) such as amount of data currently in 
need of a checkpoint?

Heikki said, "I propose that we do something similar, but not exactly 
the same. Let's have a setting, max_wal_size, to control the max. disk 
space reserved for WAL. Once that's reached (or you get close enough, so 
that there are still some segments left to consume while the checkpoint 
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled 
separately from max_wal_size, so that you can set max_wal_size high, 
without actually consuming that much space in normal operation. It's 
just a backstop, to avoid completely filling the disk, if there's a 
sudden burst of activity. The number of segments preallocated is 
auto-tuned, based on the number of segments used in previous checkpoint 
cycles. "

This makes sense except I don't see a need for the parameter. Why not 
just specify how the algorithm works and adhere to that without the need 
for another GUC? Perhaps at any given point we save 10% of available 
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint 
and LOG EXACTLY WHY.

Instead of "running out of disk space PANIC" we should just write to an 
emergency location within PGDATA and log very loudly that the SA isn't 
paying attention. Perhaps if that area starts to get to an unhappy place 
we immediately bounce into read-only mode and log even more loudly that 
the SA should be fired. I would think read-only mode is safer and more 
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to 
protect against data loss in the event. It is not user unfriendly to 
assume that a user will pay attention to disk space. Really?

Open to people telling me I am off in left field. Sorry if it is noise.

Sincerely,

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms   a rose in the deeps of my heart. - W.B. Yeats



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Make targets of doc links used by phpPgAdmin static
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Redesigning checkpoint_segments