On 09.03.2012 13:40, Fujii Masao wrote:
> I found the bug which causes walsender to enter into busy loop
> when replication connection is terminated. Walsender consumes
> lots of CPU resource (%sys), and this situation lasts until it has
> detected the termination of replication connection and exited.
>
> The cause of this bug is that the walsender loop doesn't call
> ResetLatch at all in the above case. Since the latch remains set,
> the walsender loop cannot sleep on the latch, i.e., WaitLatch
> always returns immediately.
>
> We can fix this bug by adding ResetLatch into the top of the
> walsender loop. Patch attached.
>
> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has
> already been fixed by the commit
> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit
> refactors and refines the walsender loop logic in addition to
> adding ResetLatch. So I'm tempted to backport this commit
> (except the deletion of wal_sender_delay) to 9.1 rather than
> applying the attached patch. OTOH, attached patch is quite simple,
> and its impact on 9.1 would be very small, so it's easy to backport that.
> Thought?
This patch makes the code that follows bogus:
> /*
> * If we don't have any pending data in the output buffer, try to send
> * some more.
> */
> if (!pq_is_send_pending())
> {
> XLogSend(output_message, &caughtup);
>
> /*
> * Even if we wrote all the WAL that was available when we started
> * sending, more might have arrived while we were sending this
> * batch. We had the latch set while sending, so we have not
> * received any signals from that time. Let's arm the latch again,
> * and after that check that we're still up-to-date.
> */
> if (caughtup && !pq_is_send_pending())
> {
> ResetLatch(&MyWalSnd->latch);
>
> XLogSend(output_message, &caughtup);
> }
> }
The comment is no longer valid, and the calls to ResetLatch and XLogSend
are no longer necessary, once you have the ResetLatch() call at the top
of the loop.
I also think we should backport commit
cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of
wal_sender_delay).
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com