Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Load Distributed Checkpoints, take 3
Date
Msg-id 467C3CFD.1070100@enterprisedb.com
Whole thread Raw
In response to Re: Load Distributed Checkpoints, take 3  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: Load Distributed Checkpoints, take 3
List pgsql-patches
Greg Smith wrote:
>> True, you'd have to replay 1.5 checkpoint intervals on average instead
>> of 0.5 (more or less, assuming checkpoints had been short).  I don't
>> think we're in the business of optimizing crash recovery time though.
>
> If you're not, I think you should be.  Keeping that replay interval time
> down was one of the reasons why the people I was working with were
> displeased with the implications of the very spread out style of some
> LDC tunings.  They were already unhappy with the implied recovery time
> of how high they had to set checkpoint_settings for good performance,
> and making it that much bigger aggrevates the issue.  Given a knob where
> the LDC can be spread out a bit but not across the entire interval, that
> makes it easier to control how much expansion there is relative to the
> current behavior.

I agree on that one: we *should* optimize crash recovery time. It may
not be the most important thing on earth, but it's a significant
consideration for some systems.

However, I think shortening the checkpoint interval is a perfectly valid
solution to that. It does lead to more full page writes, but in 8.3 more
full page writes can actually make the recovery go faster, not slower,
because with we no longer read in the previous contents of the page when
we restore it from a full page image. In any case, while people
sometimes complain that we have a large WAL footprint, it's not usually
a problem.

This is off-topic, but at PGCon in May, Itagaki-san and his colleagues
whose names I can't remember, pointed out to me very clearly that our
recovery is *slow*. So slow, that in the benchmarks they were running,
their warm stand-by slave couldn't keep up with the master generating
the WAL, even though both are running on the same kind of hardware.

The reason is simple: There can be tens of backends doing I/O and
generating WAL, but in recovery we serialize them. If you have decent
I/O hardware that could handle for example 10 concurrent random I/Os, at
recovery we'll be issuing them one at a time. That's a scalability
issue, and doesn't show up on a laptop or a small server with a single disk.

That's one of the first things I'm planning to tackle when the 8.4 dev
cycle opens. And I'm planning to look at recovery times in general; I've
never even measured it before so who knows what comes up.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: Greg Smith
Date:
Subject: Re: Load Distributed Checkpoints, take 3
Next
From: Heikki Linnakangas
Date:
Subject: Re: Load Distributed Checkpoints, take 3