Home > mailing lists

Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

From	Simon Riggs
Subject	Re: Load Distributed Checkpoints, take 3
Date	June 24, 2007 06:27:36
Msg-id	1182677039.9276.382.camel@silverbirch.site Whole thread Raw
In response to	Re: Load Distributed Checkpoints, take 3 (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses	Re: Load Distributed Checkpoints, take 3 Re: Load Distributed Checkpoints, take 3
List	pgsql-patches

Tree view

On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:

> However, I think shortening the checkpoint interval is a perfectly valid
> solution to that.

Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
checkpoint_segments as the limit and then complain about unbounded
recovery time, because that was clearly a conscious choice.

> In any case, while people
> sometimes complain that we have a large WAL footprint, it's not usually
> a problem.

IMHO its a huge problem. Turning on full_page_writes means that the
amount of WAL generated varies fairly linearly with number of blocks
touched, which means large databases become a problem.

Suzuki-san's team had results that showed this was a problem also.

> This is off-topic, but at PGCon in May, Itagaki-san and his colleagues
> whose names I can't remember, pointed out to me very clearly that our
> recovery is *slow*. So slow, that in the benchmarks they were running,
> their warm stand-by slave couldn't keep up with the master generating
> the WAL, even though both are running on the same kind of hardware.

> The reason is simple: There can be tens of backends doing I/O and
> generating WAL, but in recovery we serialize them. If you have decent
> I/O hardware that could handle for example 10 concurrent random I/Os, at
> recovery we'll be issuing them one at a time. That's a scalability
> issue, and doesn't show up on a laptop or a small server with a single disk.

The results showed that the current recovery isn't scalable beyond a
certain point, not that it was slow per se. The effect isn't noticeable
on systems generating fewer writes (of any size) or ones where the cache
hit ratio is high. It isn't accurate to say laptops and small servers
only, but it would be accurate to say hi volume I/O bound OLTP is the
place where the scalability of recovery does not match the performance
scalability of the master server.

Yes, we need to make recovery more scalable by de-serializing I/O.

Slony also serializes changes onto the Slave nodes, so the general
problem of scalability of recovery solutions needs to be tackled.

> That's one of the first things I'm planning to tackle when the 8.4 dev
> cycle opens. And I'm planning to look at recovery times in general; I've
> never even measured it before so who knows what comes up.

I'm planning on working on recovery also, as is Florian. Let's make sure
we coordinate what we do to avoid patch conflicts.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

From: Magnus Hagander
Date: 23 June 2007, 09:53:32
Subject: Re: Preliminary GSSAPI Patches

From: Heikki Linnakangas
Date: 24 June 2007, 06:32:49
Subject: Re: Load Distributed Checkpoints, take 3

Re: Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches

Previous

Next