Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Spread checkpoint sync
Date
Msg-id 4D329E3A.2030803@2ndquadrant.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Spread checkpoint sync  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Robert Haas wrote:
> What is the basis for thinking that the sync should get the same
> amount of time as the writes?  That seems pretty arbitrary.  Right
> now, you're allowing 3 seconds per fsync, which could be a lot more or
> a lot less than 40% of the total checkpoint time...

Just that it's where I ended up at when fighting with this for a month 
on the system I've seen the most problems at.  The 3 second number was 
reversed from a computation that said "aim for an internal of X minutes; 
we have Y relations on average involved in the checkpoint".  The 
direction my latest patch is strugling to go is computing a reasonable 
time automatically in the same way--count the relations, do a time 
estimate, add enough delay so the sync calls should be spread linearly 
over the given time range.


> the checkpoint activity is always going to be spikey if it does
> anything at all, so spacing it out *more* isn't obviously useful.
>   

One of the components to the write queue is some notion that writes that 
have been waiting longest should eventually be flushed out.  Linux has 
this number called dirty_expire_centiseconds which suggests it enforces 
just that, set to a default of 30 seconds.  This is why some 5-minute 
interval checkpoints with default parameters, effectively spreading the 
checkpoint over 2.5 minutes, can work under the current design.  
Anything you wrote at T+0 to T+2:00 *should* have been written out 
already when you reach T+2:30 and sync.  Unfortunately, when the system 
gets busy, there is this "congestion control" logic that basically 
throws out any guarantee of writes starting shortly after the expiration 
time.

It turns out that the only thing that really works are the tunables that 
block new writes from happening once the queue is full, but they can't 
be set low enough to work well in earlier kernels when combined with 
lots of RAM.  Using the terminology of 
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point 
you hit a point where "a process generating disk writes will itself 
start writeback."  This is anologous to the PostgreSQL situation where 
backends do their own fsync calls.  The kernel will eventually move to 
where those trying to write new data are instead recruited into being 
additional sources of write flushing.  That's the part you just can't 
make aggressive enough on older kernels; dirty writers can always win.  
Ideally, the system never digs itself into a hole larger than you can 
afford to wait to write out.  It's a transacton speed vs. latency thing 
though, and the older kernels just don't consider the latency side well 
enough.

There is new mechanism in the latest kernels to control this much 
better:  dirty_bytes and dirty_background_bytes are the tunables.  I 
haven't had a chance to test yet.  As mentioned upthread, some of the 
bleding edge kernels that have this feature available in are showing 
such large general performance regressions in our tests, compared to the 
boring old RHEL5 kernel, that whether this feature works or not is 
irrelevant.  I haven't tracked down which new kernel distributions work 
well performance-wise and which don't yet for PostgreSQL.

I'm hoping that when I get there, I'll see results like 
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages 
, where the ideal setting for dirty_bytes  to keep latency under control 
with BBWC was 15MB.  To put that into perspective, the lowest useful 
setting you can set dirty_ratio to is 5% of RAM.  That's 410MB on my 
measly 8GB desktop, and 3.3GB on the 64GB production server I've been 
trying to tune.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Spread checkpoint sync
Next
From: Charles Pritchard
Date:
Subject: W3C Specs: Web SQL Revisit