Thread: Bug: walsender and high CPU usage
Hi, I found the bug which causes walsender to enter into busy loop when replication connection is terminated. Walsender consumes lots of CPU resource (%sys), and this situation lasts until it has detected the termination of replication connection and exited. The cause of this bug is that the walsender loop doesn't call ResetLatch at all in the above case. Since the latch remains set, the walsender loop cannot sleep on the latch, i.e., WaitLatch always returns immediately. We can fix this bug by adding ResetLatch into the top of the walsender loop. Patch attached. This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has already been fixed by the commit (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit refactors and refines the walsender loop logic in addition to adding ResetLatch. So I'm tempted to backport this commit (except the deletion of wal_sender_delay) to 9.1 rather than applying the attached patch. OTOH, attached patch is quite simple, and its impact on 9.1 would be very small, so it's easy to backport that. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On 09.03.2012 13:40, Fujii Masao wrote: > I found the bug which causes walsender to enter into busy loop > when replication connection is terminated. Walsender consumes > lots of CPU resource (%sys), and this situation lasts until it has > detected the termination of replication connection and exited. > > The cause of this bug is that the walsender loop doesn't call > ResetLatch at all in the above case. Since the latch remains set, > the walsender loop cannot sleep on the latch, i.e., WaitLatch > always returns immediately. > > We can fix this bug by adding ResetLatch into the top of the > walsender loop. Patch attached. > > This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has > already been fixed by the commit > (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit > refactors and refines the walsender loop logic in addition to > adding ResetLatch. So I'm tempted to backport this commit > (except the deletion of wal_sender_delay) to 9.1 rather than > applying the attached patch. OTOH, attached patch is quite simple, > and its impact on 9.1 would be very small, so it's easy to backport that. > Thought? This patch makes the code that follows bogus: > /* > * If we don't have any pending data in the output buffer, try to send > * some more. > */ > if (!pq_is_send_pending()) > { > XLogSend(output_message, &caughtup); > > /* > * Even if we wrote all the WAL that was available when we started > * sending, more might have arrived while we were sending this > * batch. We had the latch set while sending, so we have not > * received any signals from that time. Let's arm the latch again, > * and after that check that we're still up-to-date. > */ > if (caughtup && !pq_is_send_pending()) > { > ResetLatch(&MyWalSnd->latch); > > XLogSend(output_message, &caughtup); > } > } The comment is no longer valid, and the calls to ResetLatch and XLogSend are no longer necessary, once you have the ResetLatch() call at the top of the loop. I also think we should backport commit cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 09.03.2012 13:40, Fujii Masao wrote: >> >> I found the bug which causes walsender to enter into busy loop >> when replication connection is terminated. Walsender consumes >> lots of CPU resource (%sys), and this situation lasts until it has >> detected the termination of replication connection and exited. >> >> The cause of this bug is that the walsender loop doesn't call >> ResetLatch at all in the above case. Since the latch remains set, >> the walsender loop cannot sleep on the latch, i.e., WaitLatch >> always returns immediately. >> >> We can fix this bug by adding ResetLatch into the top of the >> walsender loop. Patch attached. >> >> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has >> already been fixed by the commit >> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit >> refactors and refines the walsender loop logic in addition to >> adding ResetLatch. So I'm tempted to backport this commit >> (except the deletion of wal_sender_delay) to 9.1 rather than >> applying the attached patch. OTOH, attached patch is quite simple, >> and its impact on 9.1 would be very small, so it's easy to backport that. >> Thought? > > > This patch makes the code that follows bogus: > >> /* >> * If we don't have any pending data in the output buffer, >> try to send >> * some more. >> */ >> if (!pq_is_send_pending()) >> { >> XLogSend(output_message, &caughtup); >> >> /* >> * Even if we wrote all the WAL that was available >> when we started >> * sending, more might have arrived while we were >> sending this >> * batch. We had the latch set while sending, so we >> have not >> * received any signals from that time. Let's arm >> the latch again, >> * and after that check that we're still >> up-to-date. >> */ >> if (caughtup && !pq_is_send_pending()) >> { >> ResetLatch(&MyWalSnd->latch); >> >> XLogSend(output_message, &caughtup); >> } >> } > > > The comment is no longer valid, and the calls to ResetLatch and XLogSend are > no longer necessary, once you have the ResetLatch() call at the top of the > loop. Right. > I also think we should backport commit > cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of > wal_sender_delay). Agreed. The attached patch is the same as cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay. Could you apply this? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Mar 12, 2012 at 10:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 09.03.2012 13:40, Fujii Masao wrote: >>> >>> I found the bug which causes walsender to enter into busy loop >>> when replication connection is terminated. Walsender consumes >>> lots of CPU resource (%sys), and this situation lasts until it has >>> detected the termination of replication connection and exited. >>> >>> The cause of this bug is that the walsender loop doesn't call >>> ResetLatch at all in the above case. Since the latch remains set, >>> the walsender loop cannot sleep on the latch, i.e., WaitLatch >>> always returns immediately. >>> >>> We can fix this bug by adding ResetLatch into the top of the >>> walsender loop. Patch attached. >>> >>> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has >>> already been fixed by the commit >>> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit >>> refactors and refines the walsender loop logic in addition to >>> adding ResetLatch. So I'm tempted to backport this commit >>> (except the deletion of wal_sender_delay) to 9.1 rather than >>> applying the attached patch. OTOH, attached patch is quite simple, >>> and its impact on 9.1 would be very small, so it's easy to backport that. >>> Thought? >> >> >> This patch makes the code that follows bogus: >> >>> /* >>> * If we don't have any pending data in the output buffer, >>> try to send >>> * some more. >>> */ >>> if (!pq_is_send_pending()) >>> { >>> XLogSend(output_message, &caughtup); >>> >>> /* >>> * Even if we wrote all the WAL that was available >>> when we started >>> * sending, more might have arrived while we were >>> sending this >>> * batch. We had the latch set while sending, so we >>> have not >>> * received any signals from that time. Let's arm >>> the latch again, >>> * and after that check that we're still >>> up-to-date. >>> */ >>> if (caughtup && !pq_is_send_pending()) >>> { >>> ResetLatch(&MyWalSnd->latch); >>> >>> XLogSend(output_message, &caughtup); >>> } >>> } >> >> >> The comment is no longer valid, and the calls to ResetLatch and XLogSend are >> no longer necessary, once you have the ResetLatch() call at the top of the >> loop. > > Right. > >> I also think we should backport commit >> cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of >> wal_sender_delay). > > Agreed. The attached patch is the same as > cff75130b5f63e45423c2ed90d6f2e84c21ef840, > except for the removal of wal_sender_delay. Could you apply this? Oh, I forgot to attach the patch. Patch attached really. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On 12.03.2012 15:30, Fujii Masao wrote: > On Mon, Mar 12, 2012 at 10:27 PM, Fujii Masao<masao.fujii@gmail.com> wrote: >> On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> I also think we should backport commit >>> cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of >>> wal_sender_delay). >> >> Agreed. The attached patch is the same as >> cff75130b5f63e45423c2ed90d6f2e84c21ef840, >> except for the removal of wal_sender_delay. Could you apply this? > > Oh, I forgot to attach the patch. Patch attached really. Thanks, committed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com