Re: Bug: walsender and high CPU usage - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Bug: walsender and high CPU usage |
Date | |
Msg-id | CAHGQGwEBE2Ccp8jpUC+8BnsSboPC3nUBBxQUv1doXkB1kmSw1g@mail.gmail.com Whole thread Raw |
In response to | Re: Bug: walsender and high CPU usage (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Bug: walsender and high CPU usage
|
List | pgsql-hackers |
On Mon, Mar 12, 2012 at 10:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 09.03.2012 13:40, Fujii Masao wrote: >>> >>> I found the bug which causes walsender to enter into busy loop >>> when replication connection is terminated. Walsender consumes >>> lots of CPU resource (%sys), and this situation lasts until it has >>> detected the termination of replication connection and exited. >>> >>> The cause of this bug is that the walsender loop doesn't call >>> ResetLatch at all in the above case. Since the latch remains set, >>> the walsender loop cannot sleep on the latch, i.e., WaitLatch >>> always returns immediately. >>> >>> We can fix this bug by adding ResetLatch into the top of the >>> walsender loop. Patch attached. >>> >>> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has >>> already been fixed by the commit >>> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit >>> refactors and refines the walsender loop logic in addition to >>> adding ResetLatch. So I'm tempted to backport this commit >>> (except the deletion of wal_sender_delay) to 9.1 rather than >>> applying the attached patch. OTOH, attached patch is quite simple, >>> and its impact on 9.1 would be very small, so it's easy to backport that. >>> Thought? >> >> >> This patch makes the code that follows bogus: >> >>> /* >>> * If we don't have any pending data in the output buffer, >>> try to send >>> * some more. >>> */ >>> if (!pq_is_send_pending()) >>> { >>> XLogSend(output_message, &caughtup); >>> >>> /* >>> * Even if we wrote all the WAL that was available >>> when we started >>> * sending, more might have arrived while we were >>> sending this >>> * batch. We had the latch set while sending, so we >>> have not >>> * received any signals from that time. Let's arm >>> the latch again, >>> * and after that check that we're still >>> up-to-date. >>> */ >>> if (caughtup && !pq_is_send_pending()) >>> { >>> ResetLatch(&MyWalSnd->latch); >>> >>> XLogSend(output_message, &caughtup); >>> } >>> } >> >> >> The comment is no longer valid, and the calls to ResetLatch and XLogSend are >> no longer necessary, once you have the ResetLatch() call at the top of the >> loop. > > Right. > >> I also think we should backport commit >> cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of >> wal_sender_delay). > > Agreed. The attached patch is the same as > cff75130b5f63e45423c2ed90d6f2e84c21ef840, > except for the removal of wal_sender_delay. Could you apply this? Oh, I forgot to attach the patch. Patch attached really. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
pgsql-hackers by date: