Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From Simon Riggs
Subject Re: Load Distributed Checkpoints, take 3
Date
Msg-id 1182677039.9276.382.camel@silverbirch.site
Whole thread Raw
In response to Re: Load Distributed Checkpoints, take 3  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Load Distributed Checkpoints, take 3  (Heikki Linnakangas <heikki@enterprisedb.com>)
Re: Load Distributed Checkpoints, take 3  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-patches
On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:

> However, I think shortening the checkpoint interval is a perfectly valid
> solution to that.

Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
checkpoint_segments as the limit and then complain about unbounded
recovery time, because that was clearly a conscious choice.

> In any case, while people
> sometimes complain that we have a large WAL footprint, it's not usually
> a problem.

IMHO its a huge problem. Turning on full_page_writes means that the
amount of WAL generated varies fairly linearly with number of blocks
touched, which means large databases become a problem.

Suzuki-san's team had results that showed this was a problem also.

> This is off-topic, but at PGCon in May, Itagaki-san and his colleagues
> whose names I can't remember, pointed out to me very clearly that our
> recovery is *slow*. So slow, that in the benchmarks they were running,
> their warm stand-by slave couldn't keep up with the master generating
> the WAL, even though both are running on the same kind of hardware.

> The reason is simple: There can be tens of backends doing I/O and
> generating WAL, but in recovery we serialize them. If you have decent
> I/O hardware that could handle for example 10 concurrent random I/Os, at
> recovery we'll be issuing them one at a time. That's a scalability
> issue, and doesn't show up on a laptop or a small server with a single disk.

The results showed that the current recovery isn't scalable beyond a
certain point, not that it was slow per se. The effect isn't noticeable
on systems generating fewer writes (of any size) or ones where the cache
hit ratio is high. It isn't accurate to say laptops and small servers
only, but it would be accurate to say hi volume I/O bound OLTP is the
place where the scalability of recovery does not match the performance
scalability of the master server.

Yes, we need to make recovery more scalable by de-serializing I/O.

Slony also serializes changes onto the Slave nodes, so the general
problem of scalability of recovery solutions needs to be tackled.

> That's one of the first things I'm planning to tackle when the 8.4 dev
> cycle opens. And I'm planning to look at recovery times in general; I've
> never even measured it before so who knows what comes up.

I'm planning on working on recovery also, as is Florian. Let's make sure
we coordinate what we do to avoid patch conflicts.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



pgsql-patches by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Preliminary GSSAPI Patches
Next
From: Heikki Linnakangas
Date:
Subject: Re: Load Distributed Checkpoints, take 3