Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 51AF9C85.5080700@agliodbs.com
Whole thread Raw
In response to Redesigning checkpoint_segments  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Redesigning checkpoint_segments
List pgsql-hackers
Heikki,

> We shouldn't impose that calculation on the user. It
> should be possible to just specify "checkpoint_segments=512MB", and the
> system would initiate checkpoints so that the total size of WAL in
> pg_xlog stays below 512MB.

Agreed.

> For limiting the time required to recover after crash,
> checkpoint_segments is awkward because it's difficult to calculate how
> long recovery will take, given checkpoint_segments=X. A bulk load can
> use up segments really fast, and recovery will be fast, while segments
> full of random deletions can need a lot of random I/O to replay, and
> take a long time. IMO checkpoint_timeout is a much better way to control
> that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

> A third point is that even if you have 10 GB of disk space reserved for
> WAL, you don't want to actually consume all that 10 GB, if it's not
> required to run the database smoothly.

Agreed.

> I propose that we do something similar, but not exactly the same. Let's
> have a setting, max_wal_size, to control the max. disk space reserved
> for WAL. Once that's reached (or you get close enough, so that there are
> still some segments left to consume while the checkpoint runs), a
> checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit
2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )  to avoid lockup if checkpoint takes longer than
expected.
3. wal_keep_segments is taken out of max_wal_size.a. it automatically defaults to 20% of max_wal_size if
max_wal_senders> 0b. for that reason, we don't allow it to be larger   than 80% of max_wal_size
 
4. preallocated WAL isn't allowed to shrink smaller than
wal_keep_segements + (max_wal_size * 0.1).

This would mean that I could set my server to:

max_wal_size = 2GB

and ...

* by default, 26 segments (416MB) would be kept for wal_keep_segments.
* checkpoint target would be 77 segments (1.2GB)
* preallocated WAL will always be at least 39 segments (624MB),
including keep_segments.

now, if I had a fairly low transaction database, but wanted to make sure
I could recover from an 8-hour break in replication, I might bump up
wal_keep_segments to 1GB.  In that case:

* 64 segments (1GB) would be kept.
* checkpoints would target 96 segments (1.5GB)
* preallocated WAL would always be at least 77 segments (1.2GB)

> Hmm, haven't thought about that. I think a better unit to set
> wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.That is, the user wants the master to keep,
say,"8 hours of
 
transaction logs", not any amount of MB.  I don't want to complicate
this proposal by trying to deliver that, though.

> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high,
> without actually consuming that much space in normal operation. It's
> just a backstop, to avoid completely filling the disk, if there's a
> sudden burst of activity. The number of segments preallocated is
> auto-tuned, based on the number of segments used in previous checkpoint
> cycles.

"based on"; can you give me your algorithmic thinking here?  I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Giovanni Mascellani
Date:
Subject: About large objects asynchronous and non-blocking support
Next
From: Robert Haas
Date:
Subject: Re: MVCC catalog access