Re: Checkpoint Tuning Question - Mailing list pgsql-general

From Tom Lane
Subject Re: Checkpoint Tuning Question
Date
Msg-id 2479.1247418610@sss.pgh.pa.us
Whole thread Raw
In response to Re: Checkpoint Tuning Question  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Checkpoint Tuning Question
Re: Checkpoint Tuning Question
List pgsql-general
Simon Riggs <simon@2ndQuadrant.com> writes:
> This causes us to queue for the WALInsertLock twice at exactly the time
> when every caller needs to calculate the CRC for complete blocks. So we
> queue twice when the lock-hold-time is consistently high, causing queue
> lengths to go ballistic.

You keep saying that, and it keeps not being true, because the CRC
calculation is *not* done while holding the lock.

It is true that the very first XLogInsert call in each backend after
a checkpoint starts will have to go back and redo its CRC calculation,
but that's a one-time waste of CPU.  It's hard to see how it could have
continuing effects over several seconds, especially in a system that
has CPU to spare.

What I think might be the cause is that just after a checkpoint starts,
quite a large proportion of XLogInserts will include full-page buffer
copies, thus leading to an overall higher rate of WAL creation.  That
means longer hold times for WALInsertLock due to spending more time
copying data into the WAL buffers, and it also means more WAL that has
to be synced to disk before a transaction can commit.  I'm still
convinced that Dan's problem ultimately comes down to inadequate disk
bandwidth, so I think the latter point is probably the key.

So this thought leads to a couple of other things Dan could test.
First, see if turning off full_page_writes makes the hiccup go away.
If so, we know the problem is in this area (though still not exactly
which reason); if not we need another idea.  That's not a good permanent
fix though, since it reduces crash safety.  The other knobs to
experiment with are synchronous_commit and wal_sync_method.  If the
stalls are due to commits waiting for additional xlog to get written,
then async commit should stop them.  I'm not sure if changing
wal_sync_method can help, but it'd be worth experimenting with.

            regards, tom lane

pgsql-general by date:

Previous
From: dkeeney
Date:
Subject: Postgresql databases as a web service
Next
From: Roy Walter
Date:
Subject: Re: xpath() subquery for empty array