On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> >> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> >> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
> >> The performance impact of that was inconclusive, but one thing that it
> >> allows nicely is to interleave the fsyncs, so that you write all the buffers
> >> for one file, then fsync it, then next file and so on.
>
> > ...
> > So, *very* clearly sorting is a benefit.
>
> pg_bench alone doesn't convince me on this. The original thread found
> cases where it was a loss, IIRC; you will need to test many more than
> one scenario to prove the point.
Sure. And I'm not claiming Itagaki/your old patch is immediately going
to be ready for commit. But our checkpoint performance has sucked for
years in the field. I don't think we can wave that away.
I think the primary reason it wasn't easily visible as being beneficial
back then was that only the throughput, not the latency and such were
looked at.
> Also, it does not matter how good it looks in test cases if it causes
> outright failures due to OOM; unlike you, I am not prepared to just "wave
> away" that risk.
I'm not "waving away" any risks.
If the sort buffer is allocated when the checkpointer is started, not
everytime we sort, as you've done in your version of the patch I think
that risk is pretty manageable. If we really want to be sure nothing is
happening at runtime, even if the checkpointer was restarted, we can put
the sort array in shared memory.
We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 %
overhead over shared_buffers here. If we decide to got that way, it's a
pretty darn small to price not to regularly have stalls that last
minutes.
> A possible compromise is to sort a limited number of
> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> fsync them, repeat as needed.
Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers. Doing so is rather beneficial because
then fewer independently dirtied pages have to be flushed out - reducing
the impact of the fsync().
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services