Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers

From Greg Smith
Subject Design proposal: fsync absorb linear slider
Date
Msg-id 51EDFD20.5060904@2ndQuadrant.com
Whole thread Raw
Responses Re: Design proposal: fsync absorb linear slider
Re: Design proposal: fsync absorb linear slider
Re: Design proposal: fsync absorb linear slider
List pgsql-hackers
Recently I've been dismissing a lot of suggested changes to checkpoint 
fsync timing without suggesting an alternative.  I have a simple one in 
mind that captures the biggest problem I see:  that the number of 
backend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to write 
out.  That could include up to 128 changed 8K pages, and we allow all of 
them to get dirty before any are forced to disk with fsync.

Rather than second guess the I/O scheduling, I'd like to take this on 
directly by recognizing that the size of the problem is proportional to 
the number of writes to a segment.  If you turned off fsync absorption 
altogether, you'd be at an extreme that allows only 1 write before 
fsync.  That's low latency for each write, but terrible throughput.  The 
maximum throughput case of 128 writes has the terrible latency we get 
reports about.  But what if that trade-off was just a straight, linear 
slider going from 1 to 128?  Just move it to the latency vs. throughput 
position you want, and see how that works out.

The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call .  Its default value is 
-1 (or 0), which corresponds to ignoring this new behavior.

-Whenever the background write absorbs a fsync call for a relation 
that's already in the queue, increment the absorption count.

-max_segment_absorb > 0, have the background writer scan for relations 
where absorption_count > max_segment_absorb.  When it finds one, call 
fsync on that segment.

Note that it's possible for this simple scheme to be fooled when writes 
are actually touching a small number of pages.  A process that 
constantly overwrites the same page is the worst case here.  Overwrite 
it 128 times, and this method would assume you've dirtied every page, 
while only 1 will actually go to disk when you call fsync.  It's 
possible to track this better.  The count mechanism could be replaced 
with a bitmap of the 128 blocks, so that absorbs set a bit instead of 
incrementing a count.  My gut feel is that this is more complexity than 
is really necessary here.  If in fact the fsync is slimmer than 
expected, paying for it too much isn't the worst problem to have here.

I'd like to build this myself, but if someone else wants to take a shot 
at it I won't mind.  Just be aware the review is the big part here.  I 
should be honest about one thing: I have zero incentive to actually work 
on this.  The moderate amount of sponsorship money I've raised for 9.4 
so far isn't getting anywhere near this work.  The checkpoint patch 
review I have been doing recently is coming out of my weekend volunteer 
time.

And I can't get too excited about making this as my volunteer effort 
when I consider what the resulting credit will look like.  Coding is by 
far the smallest part of work like this, first behind coming up with the 
design in the first place.  And both of those are way, way behind how 
long review benchmarking takes on something like this.  The way credit 
is distributed for this sort of feature puts coding first, design not 
credited at all, and maybe you'll see some small review credit for 
benchmarks.  That's completely backwards from the actual work ratio.  If 
all I'm getting out of something is credit, I'd at least like it to be 
an appropriate amount of it.
-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Auto explain target tables
Next
From: Craig Ringer
Date:
Subject: Re: Auto explain target tables