On Mon, 2009-07-13 at 15:53 -0500, Dan Armbrust wrote:
> > So this thought leads to a couple of other things Dan could test.
> > First, see if turning off full_page_writes makes the hiccup go away.
> > If so, we know the problem is in this area (though still not exactly
> > which reason); if not we need another idea. That's not a good permanent
> > fix though, since it reduces crash safety. The other knobs to
> > experiment with are synchronous_commit and wal_sync_method. If the
> > stalls are due to commits waiting for additional xlog to get written,
> > then async commit should stop them. I'm not sure if changing
> > wal_sync_method can help, but it'd be worth experimenting with.
> >
> All of my testing to date has been done with synchronous_commit=off
>
> I just tried setting full_page_writes=off - and like magic, the entire
> hiccup went away.
OK, that seems clear.
I mistakenly referred to the CRC calculation happening while the lock
was held, which confused the discussion. The lock *is* held for longer
when we have backup blocks and the lock does need to be acquired twice
immediately after a checkpoint.
Neither of the above two effects appear, on their own, sufficient to
explain the delay. We should conjecture that a traffic jam exists and go
looking for it.
Propose a DTrace probe immediately after the "goto begin" at line 740 of
xlog.c, so we can start tracing from the first backend following
checkpoint, and turn off tracing when all backends have completed a
transaction.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support