On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales@vmware.com> wrote:
> I've been prototyping the double-write buffer idea that Heikki and Simon
> had proposed (as an alternative to a previous patch that only batched up
> writes by the checkpointer). I think it is a good idea, and can help
> double-writes perform better in the case of lots of backend evictions.
> It also centralizes most of the code change in smgr.c. However, it is
> trickier to reason about.
This doesn't compile on MacOS X, because there's no writev().
I don't understand how you can possibly get away with such small
buffers. AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk. That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB. Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction. That's barely enough for the system to do any
write reordering at all.
I am particularly worried about what happens when a ring buffer is in
use. I tried running "pgbench -i -s 10" with this patch applied,
full_page_writes=off, double_writes=on. It took 41.2 seconds to
complete. The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things. I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company