On Fri, 2 Oct 2009, Scott Marlowe wrote:
> I found that lowering checkpoint completion target was what helped.
> Does that seem counter-intuitive to you?
Generally, but there are plenty of ways you can get into a state where a
short but not immediate checkpoint is better. For example, consider a
case where your buffer cache is filled with really random stuff. There's
a sorting horizon in effect, where your OS and/or controller makes
decisions about what order to write things based on the data it already
has around, not really knowing what's coming in the near future.
Let's say you've got 256MB of cache in the disk controller, you have 1GB
of buffer cache to write out, and there's 8GB of RAM in the server so it
can cache the whole write. If you wrote it out in a big burst, the OS
would elevator sort things and feed them to the controller in disk order.
Very efficient, one pass over the disk to write everything out.
But if you broke that up into 256MB write pieces instead on the database
side, pausing after each chunk was written, the OS would only be sorting
across 256MB at a time, and would basically fill the controller cache up
with that before it saw the larger picture. The disk controller can end
up making seek decisions with that small of a planning window now that are
not really optimal, making more passes over the disk to write the same
data out. If the timing between the DB write cache and the OS is
pathologically out of sync here, the result can end up being slower than
had you just written out in bigger chunks instead. This is one reason I'd
like to see fsync calls happen earlier and more evenly than they do now,
to reduce these edge cases.
The usual approach I take in this situation is to reduce the amount of
write caching the OS does, so at least things get more predictable. A
giant write cache always gives the best average performance, but the
worst-case behavior increases at the same time.
There was a patch floating around at one point that sorted all the
checkpoint writes by block order, which would reduce how likely it is
you'll end up in one of these odd cases. That turned out to be hard to
nail down the benefit of though, because in a typical case the OS caching
here trumps any I/O scheduling you try to do in user land, and it's hard
to repeatibly generate scattered data in a benchmark situation.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD