Re: Bug: walsender and high CPU usage - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Bug: walsender and high CPU usage
Date
Msg-id 4F5DD6EB.5050309@enterprisedb.com
Whole thread Raw
In response to Bug: walsender and high CPU usage  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Bug: walsender and high CPU usage  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On 09.03.2012 13:40, Fujii Masao wrote:
> I found the bug which causes walsender to enter into busy loop
> when replication connection is terminated. Walsender consumes
> lots of CPU resource (%sys), and this situation lasts until it has
> detected the termination of replication connection and exited.
>
> The cause of this bug is that the walsender loop doesn't call
> ResetLatch at all in the above case. Since the latch remains set,
> the walsender loop cannot sleep on the latch, i.e., WaitLatch
> always returns immediately.
>
> We can fix this bug by adding ResetLatch into the top of the
> walsender loop. Patch attached.
>
> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has
> already been fixed by the commit
> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit
> refactors and refines the walsender loop logic in addition to
> adding ResetLatch. So I'm tempted to backport this commit
> (except the deletion of wal_sender_delay) to 9.1 rather than
> applying the attached patch. OTOH, attached patch is quite simple,
> and its impact on 9.1 would be very small, so it's easy to backport that.
> Thought?

This patch makes the code that follows bogus:

>         /*
>          * If we don't have any pending data in the output buffer, try to send
>          * some more.
>          */
>         if (!pq_is_send_pending())
>         {
>             XLogSend(output_message, &caughtup);
>
>             /*
>              * Even if we wrote all the WAL that was available when we started
>              * sending, more might have arrived while we were sending this
>              * batch. We had the latch set while sending, so we have not
>              * received any signals from that time. Let's arm the latch again,
>              * and after that check that we're still up-to-date.
>              */
>             if (caughtup && !pq_is_send_pending())
>             {
>                 ResetLatch(&MyWalSnd->latch);
>
>                 XLogSend(output_message, &caughtup);
>             }
>         }

The comment is no longer valid, and the calls to ResetLatch and XLogSend 
are no longer necessary, once you have the ResetLatch() call at the top 
of the loop.

I also think we should backport commit 
cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of 
wal_sender_delay).

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: NOT NULL violation error handling in file_fdw
Next
From: Alexander Korotkov
Date:
Subject: Re: Incorrect behaviour when using a GiST index on points