Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D31BDB9.9010602@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
> I'll believe it when I see it.  How about this:
>
> a 1
> a 2
> sync a
> b 1
> b 2
> sync b
> c 1
> c 2
> sync c
>
> Or maybe some variant, where we become willing to fsync a file a
> certain number of seconds after writing the last block, or when all
> the writes are done, whichever comes first.

That's going to give worse performance than the current code in some 
cases.  The goal of what's in there now is that you get a sequence like 
this:

a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so 
large that it becomes lazier about writing things.  The fundamental 
idea--push writes out some time before the sync, in hopes the filesystem 
will get to them before that said--it not unsound.  On some systems, 
doing the sync more aggressively than that will be a regression.  This 
approach just breaks down in some cases, and those cases are happening 
more now because their likelihood scales with total RAM.  I don't want 
to screw the people with smaller systems, who may be getting 
considerable benefit from the existing sequence.  Today's little 
systems--which are very similar to the high-end ones the spread 
checkpoint stuff was developed on during 8.3--do get some benefit from 
it as far as I know.

Anyway, now that the ability to get logging on all this stuff went in 
during the last CF, it's way easier to just setup a random system to run 
tests in this area than it used to be.  Whatever testing does happen 
should include, say, a 2GB laptop with a single hard drive in it.  I 
think that's the bottom of what is reasonable to consider a reasonable 
target for tweaking write performance on, given hardware 9.1 is likely 
to be deployed on.

> How does the checkpoint target give you any time to sync them?  Unless
> you squeeze the writes together more tightly, but that seems sketchy.
>   

Obviously the checkpoint target idea needs to be shuffled around some 
too.  I was thinking of making the new default 0.8, and having it split 
the time in half for write and sync.  That will make the write phase 
close to the speed people are seeing now, at the default of 0.5, while 
giving some window for spread sync too.  The exact way to redistribute 
that around I'm not so concerned about yet.  When I get to where that's 
the most uncertain thing left I'll benchmark the TPS vs. latency 
trade-off and see what happens.  If the rest of the code is good enough 
but this just needs to be tweaked, that's a perfect thing to get beta 
feedback to finalize.

> Well you don't have to put it in shared memory on account of any of
> that.  You can just hang it on a global variable.
>   

Hmm.  Because it's so similar to other things being allocated in shared 
memory, I just automatically pushed it over to there.  But you're right; 
it doesn't need to be that complicated.  Nobody is touching it but the 
background writer.

> If we can find something that's a modest improvement on the
> status quo and we can be confident in quickly, good, but I'd rather
> have 9.1 go out the door on time without fully fixing this than delay
> the release.
>   

I'm not somebody who needs to be convinced of that.  There are two near 
commit quality pieces of this out there now:

1) Keep some BGW cleaning and fsync absorption going while sync is 
happening, rather than starting it and ignoring everything else until 
it's done.

2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.  
I realize I only have a very short period of time to complete a major 
integration breakthrough on the pieces floating around before the goal 
here has to drop to something less ambitious.  I head to the West Coast 
for a week on the 23rd; I'll be forced to throw in the towel at that 
point if I can't get the better ideas we have in pieces here all 
assembled well by then.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Streaming base backups
Next
From: Heikki Linnakangas
Date:
Subject: Re: Streaming base backups