Re: Bug: walsender and high CPU usage - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Bug: walsender and high CPU usage
Date
Msg-id CAHGQGwEBE2Ccp8jpUC+8BnsSboPC3nUBBxQUv1doXkB1kmSw1g@mail.gmail.com
Whole thread Raw
In response to Re: Bug: walsender and high CPU usage  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Bug: walsender and high CPU usage
List pgsql-hackers
On Mon, Mar 12, 2012 at 10:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 09.03.2012 13:40, Fujii Masao wrote:
>>>
>>> I found the bug which causes walsender to enter into busy loop
>>> when replication connection is terminated. Walsender consumes
>>> lots of CPU resource (%sys), and this situation lasts until it has
>>> detected the termination of replication connection and exited.
>>>
>>> The cause of this bug is that the walsender loop doesn't call
>>> ResetLatch at all in the above case. Since the latch remains set,
>>> the walsender loop cannot sleep on the latch, i.e., WaitLatch
>>> always returns immediately.
>>>
>>> We can fix this bug by adding ResetLatch into the top of the
>>> walsender loop. Patch attached.
>>>
>>> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has
>>> already been fixed by the commit
>>> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit
>>> refactors and refines the walsender loop logic in addition to
>>> adding ResetLatch. So I'm tempted to backport this commit
>>> (except the deletion of wal_sender_delay) to 9.1 rather than
>>> applying the attached patch. OTOH, attached patch is quite simple,
>>> and its impact on 9.1 would be very small, so it's easy to backport that.
>>> Thought?
>>
>>
>> This patch makes the code that follows bogus:
>>
>>>                /*
>>>                 * If we don't have any pending data in the output buffer,
>>> try to send
>>>                 * some more.
>>>                 */
>>>                if (!pq_is_send_pending())
>>>                {
>>>                        XLogSend(output_message, &caughtup);
>>>
>>>                        /*
>>>                         * Even if we wrote all the WAL that was available
>>> when we started
>>>                         * sending, more might have arrived while we were
>>> sending this
>>>                         * batch. We had the latch set while sending, so we
>>> have not
>>>                         * received any signals from that time. Let's arm
>>> the latch again,
>>>                         * and after that check that we're still
>>> up-to-date.
>>>                         */
>>>                        if (caughtup && !pq_is_send_pending())
>>>                        {
>>>                                ResetLatch(&MyWalSnd->latch);
>>>
>>>                                XLogSend(output_message, &caughtup);
>>>                        }
>>>                }
>>
>>
>> The comment is no longer valid, and the calls to ResetLatch and XLogSend are
>> no longer necessary, once you have the ResetLatch() call at the top of the
>> loop.
>
> Right.
>
>> I also think we should backport commit
>> cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of
>> wal_sender_delay).
>
> Agreed. The attached patch is the same as
> cff75130b5f63e45423c2ed90d6f2e84c21ef840,
> except for the removal of wal_sender_delay. Could you apply this?

Oh, I forgot to attach the patch. Patch attached really.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Bug: walsender and high CPU usage
Next
From: Artur Litwinowicz
Date:
Subject: elegant and effective way for running jobs inside a database