Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Redesigning checkpoint_segments |
Date | |
Msg-id | 51AF2C00.5020704@vmware.com Whole thread Raw |
Responses |
Re: Redesigning checkpoint_segments
|
List | pgsql-hackers |
checkpoint_segments is awkward. From an admin's point of view, it controls two things: 1. it limits the amount of disk space needed for pg_xlog. (it's a soft limit, but still) 2. it limits the time required to recover after a crash. For limiting the disk space needed for pg_xlog, checkpoint_segments is awkward because it's defined in terms of 16MB segments between checkpoints. It takes a fair amount of arithmetic to calculate the disk space required to hold the specified number of segments. The manual gives the formula: (2 + checkpoint_completion_target) * checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a rule of thumb. We shouldn't impose that calculation on the user. It should be possible to just specify "checkpoint_segments=512MB", and the system would initiate checkpoints so that the total size of WAL in pg_xlog stays below 512MB. For limiting the time required to recover after crash, checkpoint_segments is awkward because it's difficult to calculate how long recovery will take, given checkpoint_segments=X. A bulk load can use up segments really fast, and recovery will be fast, while segments full of random deletions can need a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout is a much better way to control that, although it's not perfect either. A third point is that even if you have 10 GB of disk space reserved for WAL, you don't want to actually consume all that 10 GB, if it's not required to run the database smoothly. There are several reasons for that: backups based on a filesystem-level snapshot are larger than necessary, if there are a lot of preallocated WAL segments and in a virtualized or shared system, there might be other VMs or applications that could make use of the disk space. On the other hand, you don't want to run out of disk space while writing WAL - that can lead to a PANIC in the worst case. In VMware's vPostgres fork, we've hacked the way that works, so that there is a new setting, checkpoint_segments_max that can be set by the user, but checkpoint_segments is adjusted automatically, on the fly. The system counts how many segments were consumed during the last checkpoint cycle, and that becomes the checkpoint_segments setting for the next cycle. That means that in a system with a steady load, checkpoints are triggered by checkpoint_timeout, and the effective checkpoint_segments value converges at the exact number of segments needed for that. That's simple but very effective. It doesn't behave too well with bursty load, however; during quiet times, checkpoint_segments is dialed way down, and when the next burst comes along, you get several checkpoints in quick succession, until checkpoint_segments is dialed back up again. I propose that we do something similar, but not exactly the same. Let's have a setting, max_wal_size, to control the max. disk space reserved for WAL. Once that's reached (or you get close enough, so that there are still some segments left to consume while the checkpoint runs), a checkpoint is triggered. In this proposal, the number of segments preallocated is controlled separately from max_wal_size, so that you can set max_wal_size high, without actually consuming that much space in normal operation. It's just a backstop, to avoid completely filling the disk, if there's a sudden burst of activity. The number of segments preallocated is auto-tuned, based on the number of segments used in previous checkpoint cycles. I'll write up a patch to do that, but before I do, does anyone disagree on those tuning principles? How do you typically tune checkpoint_segments on your servers? If the system was to tune it automatically, what formula should it use? - Heikki
pgsql-hackers by date: