Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | CAHGQGwE62dvDWq+X7V1xhSrUu8FwVjBzvHf4v9X=BAbQArPpqw@mail.gmail.com Whole thread Raw |
In response to | Redesigning checkpoint_segments (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Redesigning checkpoint_segments
|
List | pgsql-hackers |
On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > checkpoint_segments is awkward. From an admin's point of view, it controls > two things: > > 1. it limits the amount of disk space needed for pg_xlog. (it's a soft > limit, but still) > 2. it limits the time required to recover after a crash. > > For limiting the disk space needed for pg_xlog, checkpoint_segments is > awkward because it's defined in terms of 16MB segments between checkpoints. > It takes a fair amount of arithmetic to calculate the disk space required to > hold the specified number of segments. The manual gives the formula: (2 + > checkpoint_completion_target) * checkpoint_segments + 1, which amounts to > about 1GB per 20 segments as a rule of thumb. We shouldn't impose that > calculation on the user. It should be possible to just specify > "checkpoint_segments=512MB", and the system would initiate checkpoints so > that the total size of WAL in pg_xlog stays below 512MB. > > For limiting the time required to recover after crash, checkpoint_segments > is awkward because it's difficult to calculate how long recovery will take, > given checkpoint_segments=X. A bulk load can use up segments really fast, > and recovery will be fast, while segments full of random deletions can need > a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout > is a much better way to control that, although it's not perfect either. > > A third point is that even if you have 10 GB of disk space reserved for WAL, > you don't want to actually consume all that 10 GB, if it's not required to > run the database smoothly. There are several reasons for that: backups based > on a filesystem-level snapshot are larger than necessary, if there are a lot > of preallocated WAL segments and in a virtualized or shared system, there > might be other VMs or applications that could make use of the disk space. On > the other hand, you don't want to run out of disk space while writing WAL - > that can lead to a PANIC in the worst case. > > > In VMware's vPostgres fork, we've hacked the way that works, so that there > is a new setting, checkpoint_segments_max that can be set by the user, but > checkpoint_segments is adjusted automatically, on the fly. The system counts > how many segments were consumed during the last checkpoint cycle, and that > becomes the checkpoint_segments setting for the next cycle. That means that > in a system with a steady load, checkpoints are triggered by > checkpoint_timeout, and the effective checkpoint_segments value converges at > the exact number of segments needed for that. That's simple but very > effective. It doesn't behave too well with bursty load, however; during > quiet times, checkpoint_segments is dialed way down, and when the next burst > comes along, you get several checkpoints in quick succession, until > checkpoint_segments is dialed back up again. > > > I propose that we do something similar, but not exactly the same. Let's have > a setting, max_wal_size, to control the max. disk space reserved for WAL. > Once that's reached (or you get close enough, so that there are still some > segments left to consume while the checkpoint runs), a checkpoint is > triggered. What if max_wal_size is reached while the checkpoint is running? We should change the checkpoint from spread mode to fast mode? Or, if max_wal_size is hard limit, we should keep the allocation of new WAL file waiting until the checkpoint has finished and removed some old WAL files? > In this proposal, the number of segments preallocated is controlled > separately from max_wal_size, so that you can set max_wal_size high, without > actually consuming that much space in normal operation. It's just a > backstop, to avoid completely filling the disk, if there's a sudden burst of > activity. The number of segments preallocated is auto-tuned, based on the > number of segments used in previous checkpoint cycles. How is wal_keep_segments handled in your approach? > I'll write up a patch to do that, but before I do, does anyone disagree on > those tuning principles? No at least from me. I like your idea. Regards, -- Fujii Masao
pgsql-hackers by date: