Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Redesigning checkpoint_segments
Date
Msg-id CAHGQGwE62dvDWq+X7V1xhSrUu8FwVjBzvHf4v9X=BAbQArPpqw@mail.gmail.com
Whole thread Raw
In response to Redesigning checkpoint_segments  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Redesigning checkpoint_segments
List pgsql-hackers
On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> checkpoint_segments is awkward. From an admin's point of view, it controls
> two things:
>
> 1. it limits the amount of disk space needed for pg_xlog. (it's a soft
> limit, but still)
> 2. it limits the time required to recover after a crash.
>
> For limiting the disk space needed for pg_xlog, checkpoint_segments is
> awkward because it's defined in terms of 16MB segments between checkpoints.
> It takes a fair amount of arithmetic to calculate the disk space required to
> hold the specified number of segments. The manual gives the formula: (2 +
> checkpoint_completion_target) * checkpoint_segments + 1, which amounts to
> about 1GB per 20 segments as a rule of thumb. We shouldn't impose that
> calculation on the user. It should be possible to just specify
> "checkpoint_segments=512MB", and the system would initiate checkpoints so
> that the total size of WAL in pg_xlog stays below 512MB.
>
> For limiting the time required to recover after crash, checkpoint_segments
> is awkward because it's difficult to calculate how long recovery will take,
> given checkpoint_segments=X. A bulk load can use up segments really fast,
> and recovery will be fast, while segments full of random deletions can need
> a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout
> is a much better way to control that, although it's not perfect either.
>
> A third point is that even if you have 10 GB of disk space reserved for WAL,
> you don't want to actually consume all that 10 GB, if it's not required to
> run the database smoothly. There are several reasons for that: backups based
> on a filesystem-level snapshot are larger than necessary, if there are a lot
> of preallocated WAL segments and in a virtualized or shared system, there
> might be other VMs or applications that could make use of the disk space. On
> the other hand, you don't want to run out of disk space while writing WAL -
> that can lead to a PANIC in the worst case.
>
>
> In VMware's vPostgres fork, we've hacked the way that works, so that there
> is a new setting, checkpoint_segments_max that can be set by the user, but
> checkpoint_segments is adjusted automatically, on the fly. The system counts
> how many segments were consumed during the last checkpoint cycle, and that
> becomes the checkpoint_segments setting for the next cycle. That means that
> in a system with a steady load, checkpoints are triggered by
> checkpoint_timeout, and the effective checkpoint_segments value converges at
> the exact number of segments needed for that. That's simple but very
> effective. It doesn't behave too well with bursty load, however; during
> quiet times, checkpoint_segments is dialed way down, and when the next burst
> comes along, you get several checkpoints in quick succession, until
> checkpoint_segments is dialed back up again.
>
>
> I propose that we do something similar, but not exactly the same. Let's have
> a setting, max_wal_size, to control the max. disk space reserved for WAL.
> Once that's reached (or you get close enough, so that there are still some
> segments left to consume while the checkpoint runs), a checkpoint is
> triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode? Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high, without
> actually consuming that much space in normal operation. It's just a
> backstop, to avoid completely filling the disk, if there's a sudden burst of
> activity. The number of segments preallocated is auto-tuned, based on the
> number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

> I'll write up a patch to do that, but before I do, does anyone disagree on
> those tuning principles?

No at least from me. I like your idea.

Regards,

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: pg_rewind, a tool for resynchronizing an old master after failover
Next
From: Heikki Linnakangas
Date:
Subject: Re: Redesigning checkpoint_segments