Re: Load Distributed Checkpoints, final patch - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Load Distributed Checkpoints, final patch
Date
Msg-id 468A0B5E.9060304@enterprisedb.com
Whole thread Raw
In response to Re: Load Distributed Checkpoints, final patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Load Distributed Checkpoints, final patch  (Gregory Stark <stark@enterprisedb.com>)
Re: Load Distributed Checkpoints, final patch  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Heikki Linnakangas wrote:
>>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
>>> minutes there, and the graphs look very smooth. That suggests that
>>> spreading the writes over a longer time wouldn't make a difference, but
>>> smoothing the rush at the beginning of checkpoint might. I'm going to
>>> try the algorithm I posted, that uses the WAL consumption rate from
>>> previous checkpoint interval in the calculations.
>
>> One thing that concerns me is that checkpoint smoothing happening just
>> after the checkpoint is causing I/O at the same time that
>> full_page_writes is causing additional I/O.
>
> I'm tempted to just apply some sort of nonlinear correction to the
> WAL-based progress measurement.  Squaring it would be cheap but is
> probably too extreme.  Carrying over info from the previous cycle
> doesn't seem like it would help much; rather, the point is exactly
> that we *don't* want a constant write speed during the checkpoint.

While thinking about this, I made an observation on full_page_writes.
Currently, we perform a full page write whenever LSN < RedoRecPtr. If
we're clever, we can skip or defer some of the full page writes:

The rule is that when we replay, we need to always replay a full page
image before we apply any regular WAL records on the page. When we begin
a checkpoint, there's two possible outcomes: we crash before the new
checkpoint is finished, and we replay starting from the previous redo
ptr, or we finish the checkpoint successfully, and we replay starting
from the new redo ptr (or we don't crash and don't need to recover).

To be able to recover from the previous redo ptr, we don't need to write
a full page image if we have already written one since the previous redo
ptr.

To be able to recover from the new redo ptr, we don't need to write a
full page image, if we haven't flushed the page yet. It will be written
and fsync'd by the time the checkpoint finishes.

IOW, we can skip full page images of pages that we have already taken a
full page image of since previous checkpoint, and we haven't flushed yet
during the current checkpoint.

This might reduce the overall WAL I/O a little bit, but more
importantly, it spreads the impact of taking full page images over the
checkpoint duration. That's a good thing on its own, but it also makes
it unnecessary to compensate for the full_page_writes rush in the
checkpoint smoothing.

I'm still trying to get my head around the bookkeeping required to get
that right; I think it's possible using the new BM_CHECKPOINT_NEEDED
flag and a new flag in the page header to mark pages that we've skipped
taking the full page image when it was last modified.

For 8.3, we should probably just do some simple compensation in the
checkpoint throttling code, if we want to do anything at all. But this
is something to think about in the future.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: [DOCS] rename of a view
Next
From: "Jacob Rief"
Date:
Subject: Re: SPI-header-files safe for C++-compiler