Re: Load distributed checkpoint V4 - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Load distributed checkpoint V4
Date
Msg-id 46274C3B.2020804@iki.fi
Whole thread Raw
In response to Load distributed checkpoint V4  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
Responses Re: Load distributed checkpoint V4
List pgsql-patches
ITAGAKI Takahiro wrote:
> Here is an updated version of LDC patch (V4).

Thanks! I'll start testing.

> - Progress of checkpoint is controlled not only based on checkpoint_timeout
>   but also checkpoint_segments. -- Now it works better with large
>   checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in the
calculations. We might want to call GetCheckpointProgress something
else, though. It doesn't return the amount of progress made, but rather
the amount of progress we should've made up to that point or we're in
danger of not completing the checkpoint in time.

> We can control the delay of checkpoints using three parameters:
> checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
> If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase,
however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed,
where enough is defined by checkpoint_nap_percent. However, if we're
already past checkpoint_write_percent at the beginning of the nap, I
think we should clamp the nap time so that we don't run out of time
until the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enough
time/segments have passed, assuming that the time to fsync is
proportional to the file length. I'm not sure that's a very good
assumption. We might have one huge files with only very little changed
data, for example a logging table that is just occasionaly appended to.
If we begin by fsyncing that, it'll take a very short time to finish,
and we'll then sleep for a long time. If we then have another large file
to fsync, but that one has all pages dirty, we risk running out of time
because of the unnecessarily long sleep. The segmentation of relations
limits the risk of that, though, by limiting the max. file size, and I
don't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in the
sleeps. On each iteration, we write bgwriter_all_maxpages pages and then
we sleep for bgwriter_delay msecs. checkpoint_write_percent only
controls the maximum amount of time we try spend in the write phase, we
skip the sleeps if we're exceeding checkpoint_write_percent, but it can
very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum*
amount of pages to write between sleeps. If it's not set, we use
WRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. By
setting N we can control the maximum impact of a checkpoint under normal
circumstances. If there's very little work to do, it doesn't make sense
to stretch the write of say 10 buffers across a 15 min period; it's
indeed better to finish the checkpoint earlier. It's similar to
vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it
doesn't feel right, we should at least name it differently. The default
of 1000 is a bit high as well, with the default bgwriter_delay that adds
up to 39MB/s. That's ok for decent a I/O subsystem, but the default
really should be something that will still leave room for other I/O on a
small single-disk server.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that they
exceed 100%?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: ITAGAKI Takahiro
Date:
Subject: Load distributed checkpoint V4
Next
From: Heikki Linnakangas
Date:
Subject: HOT + MVCC-safe cluster conflict fix