Re: Controlling Load Distributed Checkpoints - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Controlling Load Distributed Checkpoints
Date
Msg-id Pine.GSO.4.64.0706071602360.4005@westnet.com
Whole thread Raw
In response to Re: Controlling Load Distributed Checkpoints  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: Controlling Load Distributed Checkpoints
List pgsql-hackers
On Thu, 7 Jun 2007, Gregory Stark wrote:

> You seem to have imagined that letting the checkpoint take longer will slow
> down transactions.

And you seem to have imagined that I have so much spare time that I'm just 
making stuff up to entertain myself and sow confusion.

I observed some situations where delaying checkpoints too long ends up 
slowing down both transaction rate and response time, using earlier 
variants of the LDC patch and code with similar principles I wrote.  I'm 
trying to keep the approach used here out of the worst of the corner cases 
I ran into, or least to make it possible for people in those situations to 
have some ability to tune out of the bad spots.  I am unfortunately not 
free to disclose all those test results, and since that project is over I 
can't see how the current LDC compares to what I tested at the time.

I plainly stated I had a bias here, one that's not even close to the 
average case.  My concern here was that Heikki would end up optimizing in 
a direction where a really wide spread across the active checkpoint 
interval was strongly preferred.  I wanted to offer some suggestions on 
the type of situation where that might not be true, but where a different 
tuning of LDC would still be an improvement over the current behavior. 
There are some tuning knobs there that I don't want to see go away until 
there's been a wider range of tests to prove they aren't effective.

> Right now we're seeing tests where Postgres stops handling *any* transactions
> for up to a minute. In virtually any real world scenario that would simply be
> unacceptable.

No doubt; I've seen things get close to that bad myself, both on the high 
and low end. I collided with the issue in a situation of "maxing out your 
i/o bandwidth, couldn't buy a faster controller" at one point, which is 
what kicked off my working in this area.  It turned out there were still 
some software tunables left that pulled the worst case down to the 2-5 
second range instead.  With more checkpoint_segments to decrease the 
frequency, that was just enough to make the problem annoying rather than 
crippling.  But after that, I could easily imagine a different application 
scenario where the behavior you describe is the best case.

This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw more 
hardware at it.  I'm perversely glad to hear this is torturing more people 
than just me as it improves the odds the situation will improve.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: "Matthew T. O'Connor"
Date:
Subject: Re: Autovacuum launcher doesn't notice death of postmaster immediately
Next
From: "Florian G. Pflug"
Date:
Subject: Re: [RFC] GSoC Work on readonly queries done so far