Thread: 001_rep_changes.pl stalls
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is important for promptly updating pg_stat_replication. When caught up, we should impose that logic before every sleep. The one-line fix is to sleep in WalSndLoop() only when pq_is_send_pending(), not when caught up. On my regular development machine, src/test/subscription/t/001_rep_changes.pl stalls for ~10s at this wait_for_catchup: $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep"); # Restart the publisher and check the state of the subscriber which # should be in a streaming state after catching up. $node_publisher->stop('fast'); $node_publisher->start; $node_publisher->wait_for_catchup('tap_sub'); That snippet emits three notable physical WAL records. There's a Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and an XLOG/FPI_FOR_HINT. The buildfarm has stalled there, but it happens probably less than half the time. Examples[1] showing the stall: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check Here's the most-relevant walsender call tree: WalSndLoop XLogSendLogical (caller invokes once per loop iteration, via send_data callback) XLogReadRecord (caller invokes once) ReadPageInternal (caller invokes twice in this test; more calls are possible) logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via state->read_page()callback, registered in StartLogicalReplication) WalSndWaitForWal (caller invokes once; has fast path) The cause is a race involving the flow of reply messages (send_feedback() messages) from logical apply worker to walsender. Here are two sequencing patterns; the more-indented parts are what differ. Stalling pattern: sender reads Transaction/COMMIT and sends the changes receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender accepts the report sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT Non-stalling pattern (more prevalent with lower machine performance): sender reads Transaction/COMMIT and sends the changes sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test is unnecessary when sleeping due to pq_is_send_pending(); in that case, the receiver is not idle and will reply before idling. I changed WalSndLoop() to sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep will happen in WalSndWaitForWal(). Attached. I don't know whether this is important outside of testing scenarios. I lean against back-patching, but I will back-patch if someone thinks this qualifies as a performance bug. Thanks, nm [1] I spot-checked only my animals, since I wanted to experiment on an affected animal.
Attachment
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote: > Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is > important for promptly updating pg_stat_replication. When caught up, we > should impose that logic before every sleep. The one-line fix is to sleep in > WalSndLoop() only when pq_is_send_pending(), not when caught up. This seems to have made the following race condition easier to hit: https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us Now it happened eight times in three days, all on BSD machines: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22 While I don't think that indicates anything wrong with the fix for $SUBJECT, creating more buildfarm noise is itself bad. I am inclined to revert the fix after a week. Not immediately, in case it uncovers lower-probability bugs. I'd then re-commit it after one of those threads fixes the other bug. Would anyone like to argue for a revert earlier, later, or not at all? There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01 Most-relevant lines of the test script: $node_master->safe_psql('postgres', "INSERT INTO decoding(blah) VALUES ('afterbb');"); $node_master->safe_psql('postgres', 'CHECKPOINT'); $node_master->stop('immediate'); The failure suggested the INSERT was not replicated before the immediate stop. I can reproduce that consistently, before or after the fix for $SUBJECT, by modifying walsender to delay 0.2s before sending WAL: --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -65,2 +65,3 @@ #include "libpq/pqformat.h" +#include "libpq/pqsignal.h" #include "miscadmin.h" @@ -2781,2 +2782,5 @@ retry: + PG_SETMASK(&BlockSig); + pg_usleep(200 * 1000); + PG_SETMASK(&UnBlockSig); pq_putmessage_noblock('d', output_message.data, output_message.len); I will shortly push a fix adding a wait_for_catchup to the test. I don't know if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race condition easier to hit.
Noah Misch <noah@leadboat.com> writes: > This seems to have made the following race condition easier to hit: > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us Yeah, I just came to the same guess in the other thread. > While I don't think that indicates anything wrong with the fix for $SUBJECT, > creating more buildfarm noise is itself bad. I am inclined to revert the fix > after a week. Not immediately, in case it uncovers lower-probability bugs. > I'd then re-commit it after one of those threads fixes the other bug. Would > anyone like to argue for a revert earlier, later, or not at all? I don't think you should revert. Those failures are (just) often enough to be annoying but I do not think that a proper fix is very far away. regards, tom lane
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > This seems to have made the following race condition easier to hit: > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us > > Yeah, I just came to the same guess in the other thread. > > > While I don't think that indicates anything wrong with the fix for $SUBJECT, > > creating more buildfarm noise is itself bad. I am inclined to revert the fix > > after a week. Not immediately, in case it uncovers lower-probability bugs. > > I'd then re-commit it after one of those threads fixes the other bug. Would > > anyone like to argue for a revert earlier, later, or not at all? > > I don't think you should revert. Those failures are (just) often enough > to be annoying but I do not think that a proper fix is very far away. That works for me, but an actual defect may trigger a revert. Fujii Masao reported high walsender CPU usage after this patch. The patch caused idle physical walsenders to use 100% CPU. When caught up, the WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps until more WAL is available. XLogSendPhysical() just returns when caught up. No amount of WAL is too small for physical replication to dispatch, but logical replication needs the full xl_tot_len of a record. Some options: 1. Make XLogSendPhysical() more like XLogSendLogical(), calling WalSndWaitForWal() when no WAL is available. A quick version of this passes tests, but I'll need to audit WalSndWaitForWal() for things that are wrong for physical replication. 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when insufficient WAL is available. This complicates the xlogreader.h API to pass back "wait for this XLogRecPtr", and we'd then persist enough state to resume decoding. This doesn't have any advantages to make up for those. 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks further drift between the two wait sites; on the other hand, one could refactor later to help avoid that. 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the minimal fix, but it crudely punches through the abstraction between WalSndLoop() and its WalSndSendDataCallback. I'm favoring (1). Other preferences?
At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > This seems to have made the following race condition easier to hit: > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us > > > > Yeah, I just came to the same guess in the other thread. > > > > > While I don't think that indicates anything wrong with the fix for $SUBJECT, > > > creating more buildfarm noise is itself bad. I am inclined to revert the fix > > > after a week. Not immediately, in case it uncovers lower-probability bugs. > > > I'd then re-commit it after one of those threads fixes the other bug. Would > > > anyone like to argue for a revert earlier, later, or not at all? > > > > I don't think you should revert. Those failures are (just) often enough > > to be annoying but I do not think that a proper fix is very far away. > > That works for me, but an actual defect may trigger a revert. Fujii Masao > reported high walsender CPU usage after this patch. The patch caused idle > physical walsenders to use 100% CPU. When caught up, the > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps > until more WAL is available. XLogSendPhysical() just returns when caught up. > No amount of WAL is too small for physical replication to dispatch, but > logical replication needs the full xl_tot_len of a record. Some options: > > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling > WalSndWaitForWal() when no WAL is available. A quick version of this > passes tests, but I'll need to audit WalSndWaitForWal() for things that are > wrong for physical replication. > > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when > insufficient WAL is available. This complicates the xlogreader.h API to > pass back "wait for this XLogRecPtr", and we'd then persist enough state to > resume decoding. This doesn't have any advantages to make up for those. > > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks > further drift between the two wait sites; on the other hand, one could > refactor later to help avoid that. > > 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the > minimal fix, but it crudely punches through the abstraction between > WalSndLoop() and its WalSndSendDataCallback. > > I'm favoring (1). Other preferences? Starting from the current shape, I think 1 is preferable, since that waiting logic is no longer shared between logical and physical replications. But I'm not sure I like calling WalSndWaitForWal() (maybe with previous location + 1?), because the function seems to do too-much. By the way, if latch is consumed in WalSndLoop, succeeding call to WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that cause missing wakeups? (in other words, overlooking of wakeup latch). Since the only source other than timeout of walsender wakeup is latch, we should avoid wasteful consuming of latch. (It is the same issue with [1]). If wakeup signal is not remembered on walsender (like InterruptPending), WalSndPhysical cannot enter a sleep with confidence. [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry , I wrote something wrong. At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in > > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote: > > > Noah Misch <noah@leadboat.com> writes: > > > > This seems to have made the following race condition easier to hit: > > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us > > > > > > Yeah, I just came to the same guess in the other thread. > > > > > > > While I don't think that indicates anything wrong with the fix for $SUBJECT, > > > > creating more buildfarm noise is itself bad. I am inclined to revert the fix > > > > after a week. Not immediately, in case it uncovers lower-probability bugs. > > > > I'd then re-commit it after one of those threads fixes the other bug. Would > > > > anyone like to argue for a revert earlier, later, or not at all? > > > > > > I don't think you should revert. Those failures are (just) often enough > > > to be annoying but I do not think that a proper fix is very far away. > > > > That works for me, but an actual defect may trigger a revert. Fujii Masao > > reported high walsender CPU usage after this patch. The patch caused idle > > physical walsenders to use 100% CPU. When caught up, the > > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps > > until more WAL is available. XLogSendPhysical() just returns when caught up. > > No amount of WAL is too small for physical replication to dispatch, but > > logical replication needs the full xl_tot_len of a record. Some options: > > > > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling > > WalSndWaitForWal() when no WAL is available. A quick version of this > > passes tests, but I'll need to audit WalSndWaitForWal() for things that are > > wrong for physical replication. > > > > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when > > insufficient WAL is available. This complicates the xlogreader.h API to > > pass back "wait for this XLogRecPtr", and we'd then persist enough state to > > resume decoding. This doesn't have any advantages to make up for those. > > > > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the > > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks > > further drift between the two wait sites; on the other hand, one could > > refactor later to help avoid that. > > > > 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the > > minimal fix, but it crudely punches through the abstraction between > > WalSndLoop() and its WalSndSendDataCallback. > > > > I'm favoring (1). Other preferences? > > Starting from the current shape, I think 1 is preferable, since that > waiting logic is no longer shared between logical and physical > replications. But I'm not sure I like calling WalSndWaitForWal() > (maybe with previous location + 1?), because the function seems to do > too-much. > > By the way, if latch is consumed in WalSndLoop, succeeding call to > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > cause missing wakeups? (in other words, overlooking of wakeup latch). - Since the only source other than timeout of walsender wakeup is latch, - we should avoid wasteful consuming of latch. (It is the same issue - with [1]). + Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]). > If wakeup signal is not remembered on walsender (like > InterruptPending), WalSndPhysical cannot enter a sleep with > confidence. > > > [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in >> I'm favoring (1). Other preferences? > Starting from the current shape, I think 1 is preferable, since that > waiting logic is no longer shared between logical and physical > replications. But I'm not sure I like calling WalSndWaitForWal() > (maybe with previous location + 1?), because the function seems to do > too-much. I'm far from an expert on this code, but it does look like there's a lot of stuff in WalSndWaitForWal that is specific to logical rep, so I'm not sure that (1) is workable. At the very least there'd have to be a bunch more conditionals in that function than there are now. In the end, a separate copy for physical rep might be better. (BTW, I think this code is in desperate need of a renaming campaign to make it clearer which functions are for logical rep, physical rep, or both.) regards, tom lane
On 2020/04/17 14:41, Noah Misch wrote: > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote: >> Noah Misch <noah@leadboat.com> writes: >>> This seems to have made the following race condition easier to hit: >>> https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com >>> https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us >> >> Yeah, I just came to the same guess in the other thread. >> >>> While I don't think that indicates anything wrong with the fix for $SUBJECT, >>> creating more buildfarm noise is itself bad. I am inclined to revert the fix >>> after a week. Not immediately, in case it uncovers lower-probability bugs. >>> I'd then re-commit it after one of those threads fixes the other bug. Would >>> anyone like to argue for a revert earlier, later, or not at all? >> >> I don't think you should revert. Those failures are (just) often enough >> to be annoying but I do not think that a proper fix is very far away. > > That works for me, but an actual defect may trigger a revert. Fujii Masao > reported high walsender CPU usage after this patch. The patch caused idle > physical walsenders to use 100% CPU. When caught up, the > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps > until more WAL is available. XLogSendPhysical() just returns when caught up. > No amount of WAL is too small for physical replication to dispatch, but > logical replication needs the full xl_tot_len of a record. Some options: > > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling > WalSndWaitForWal() when no WAL is available. A quick version of this > passes tests, but I'll need to audit WalSndWaitForWal() for things that are > wrong for physical replication. (1) makes even physical replication walsender sleep in two places and which seems to make the code for physical replication complicated more than necessary. I'd like to avoid (1) if possible. > > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when > insufficient WAL is available. This complicates the xlogreader.h API to > pass back "wait for this XLogRecPtr", and we'd then persist enough state to > resume decoding. This doesn't have any advantages to make up for those. > > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks > further drift between the two wait sites; on the other hand, one could > refactor later to help avoid that. Since the additional call of WalSndKeepalive() is necessary only for logical replication, it should be copied to, e.g., XLogSendLogical(), instead of WalSndLoop()? For example, when XLogSendLogical() sets WalSndCaughtUp to true, it should call WalSndKeepalive()? The root problem seems that when WAL record that's no-opes for logical rep is processed, keep alive message has not sent immediately, in spite of that we want pg_stat_replication to be updated promptly. (3) seems to try to address this problem straightly and looks better to me. > 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the > minimal fix, but it crudely punches through the abstraction between > WalSndLoop() and its WalSndSendDataCallback. (4) also looks good because it's simple, if we can redesign those functions in good shape. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote: > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > By the way, if latch is consumed in WalSndLoop, succeeding call to > > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > > cause missing wakeups? (in other words, overlooking of wakeup latch). > > - Since the only source other than timeout of walsender wakeup is latch, > - we should avoid wasteful consuming of latch. (It is the same issue > - with [1]). > > + Since walsender is wokeup by LSN advancement via latch, we should > + avoid wasteful consuming of latch. (It is the same issue with [1]). > > > > If wakeup signal is not remembered on walsender (like > > InterruptPending), WalSndPhysical cannot enter a sleep with > > confidence. No; per latch.h, "What must be avoided is placing any checks for asynchronous events after WaitLatch and before ResetLatch, as that creates a race condition." In other words, the thing to avoid is calling ResetLatch() without next examining all pending work that a latch would signal. Each walsender.c WaitLatch call does follow the rules. On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote: > On 2020/04/17 14:41, Noah Misch wrote: > >1. Make XLogSendPhysical() more like XLogSendLogical(), calling > > WalSndWaitForWal() when no WAL is available. A quick version of this > > passes tests, but I'll need to audit WalSndWaitForWal() for things that are > > wrong for physical replication. > > (1) makes even physical replication walsender sleep in two places and > which seems to make the code for physical replication complicated > more than necessary. I'd like to avoid (1) if possible. Good point. > >2. Make XLogSendLogical() more like XLogSendPhysical(), returning when > > insufficient WAL is available. This complicates the xlogreader.h API to > > pass back "wait for this XLogRecPtr", and we'd then persist enough state to > > resume decoding. This doesn't have any advantages to make up for those. > > > >3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the > > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks > > further drift between the two wait sites; on the other hand, one could > > refactor later to help avoid that. > > Since the additional call of WalSndKeepalive() is necessary only for > logical replication, it should be copied to, e.g., XLogSendLogical(), > instead of WalSndLoop()? For example, when XLogSendLogical() sets > WalSndCaughtUp to true, it should call WalSndKeepalive()? We'd send a keepalive even when pq_flush_if_writable() can't empty the output buffer. That could be acceptable, but it's not ideal. > The root problem seems that when WAL record that's no-opes for > logical rep is processed, keep alive message has not sent immediately, > in spite of that we want pg_stat_replication to be updated promptly. The degree of promptness should be predictable, at least. If we removed the WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would not be prompt, but they would be predictable. I do, however, think prompt updates are worthwhile. > (3) seems to try to address this problem straightly and looks better to me. > > >4. Keep the WalSndLoop() wait, but condition it on !logical. This is the > > minimal fix, but it crudely punches through the abstraction between > > WalSndLoop() and its WalSndSendDataCallback. > > (4) also looks good because it's simple, if we can redesign those > functions in good shape. Let's do that. I'm attaching the replacement implementation and the revert of v1.
Attachment
On 2020/04/18 16:01, Noah Misch wrote: > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote: >> At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >>> By the way, if latch is consumed in WalSndLoop, succeeding call to >>> WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that >>> cause missing wakeups? (in other words, overlooking of wakeup latch). >> >> - Since the only source other than timeout of walsender wakeup is latch, >> - we should avoid wasteful consuming of latch. (It is the same issue >> - with [1]). >> >> + Since walsender is wokeup by LSN advancement via latch, we should >> + avoid wasteful consuming of latch. (It is the same issue with [1]). >> >> >>> If wakeup signal is not remembered on walsender (like >>> InterruptPending), WalSndPhysical cannot enter a sleep with >>> confidence. > > No; per latch.h, "What must be avoided is placing any checks for asynchronous > events after WaitLatch and before ResetLatch, as that creates a race > condition." In other words, the thing to avoid is calling ResetLatch() > without next examining all pending work that a latch would signal. Each > walsender.c WaitLatch call does follow the rules. > > On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote: >> On 2020/04/17 14:41, Noah Misch wrote: >>> 1. Make XLogSendPhysical() more like XLogSendLogical(), calling >>> WalSndWaitForWal() when no WAL is available. A quick version of this >>> passes tests, but I'll need to audit WalSndWaitForWal() for things that are >>> wrong for physical replication. >> >> (1) makes even physical replication walsender sleep in two places and >> which seems to make the code for physical replication complicated >> more than necessary. I'd like to avoid (1) if possible. > > Good point. > >>> 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when >>> insufficient WAL is available. This complicates the xlogreader.h API to >>> pass back "wait for this XLogRecPtr", and we'd then persist enough state to >>> resume decoding. This doesn't have any advantages to make up for those. >>> >>> 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the >>> WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks >>> further drift between the two wait sites; on the other hand, one could >>> refactor later to help avoid that. >> >> Since the additional call of WalSndKeepalive() is necessary only for >> logical replication, it should be copied to, e.g., XLogSendLogical(), >> instead of WalSndLoop()? For example, when XLogSendLogical() sets >> WalSndCaughtUp to true, it should call WalSndKeepalive()? > > We'd send a keepalive even when pq_flush_if_writable() can't empty the output > buffer. That could be acceptable, but it's not ideal. > >> The root problem seems that when WAL record that's no-opes for >> logical rep is processed, keep alive message has not sent immediately, >> in spite of that we want pg_stat_replication to be updated promptly. > > The degree of promptness should be predictable, at least. If we removed the > WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would > not be prompt, but they would be predictable. I do, however, think prompt > updates are worthwhile. > >> (3) seems to try to address this problem straightly and looks better to me. >> >>> 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the >>> minimal fix, but it crudely punches through the abstraction between >>> WalSndLoop() and its WalSndSendDataCallback. >> >> (4) also looks good because it's simple, if we can redesign those >> functions in good shape. > > Let's do that. I'm attaching the replacement implementation and the revert of > v1. Thanks for the patch! Though referencing XLogSendLogical inside WalSndLoop() might be a bit ugly,, I'm fine with this change because it's simple and easier to understand. + * Block if we have unsent data. XXX For logical replication, let + * WalSndWaitForWal(), handle any other blocking; idle receivers need + * its additional actions. For physical replication, also block if + * caught up; its send_data does not block. It might be better to s/WalSndWaitForWal()/send_data()? Because not only WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking. WalSndWriteData() is called also under send_data, i.e., XLogSendLogical(). frame #2: 0x0000000106bcfa84 postgres`WalSndWriteData(ctx=0x00007fb2a4812d20, lsn=22608080, xid=488, last_write=false)at walsender.c:1247:2 frame #3: 0x0000000106b98295 postgres`OutputPluginWrite(ctx=0x00007fb2a4812d20, last_write=false) at logical.c:540:2 frame #4: 0x00000001073fe9b8 pgoutput.so`send_relation_and_attrs(relation=0x00000001073ba2c0, ctx=0x00007fb2a4812d20)at pgoutput.c:353:2 frame #5: 0x00000001073fe7a0 pgoutput.so`maybe_send_schema(ctx=0x00007fb2a4812d20, relation=0x00000001073ba2c0, relentry=0x00007fb2a483aa60)at pgoutput.c:315:2 frame #6: 0x00000001073fd4c0 pgoutput.so`pgoutput_change(ctx=0x00007fb2a4812d20, txn=0x00007fb2a502e428, relation=0x00000001073ba2c0,change=0x00007fb2a5030428) at pgoutput.c:394:2 frame #7: 0x0000000106b99094 postgres`change_cb_wrapper(cache=0x00007fb2a482ed20, txn=0x00007fb2a502e428, relation=0x00000001073ba2c0,change=0x00007fb2a5030428) at logical.c:753:2 frame #8: 0x0000000106ba2200 postgres`ReorderBufferCommit(rb=0x00007fb2a482ed20, xid=488, commit_lsn=22621088, end_lsn=22621136,commit_time=640675460323211, origin_id=0, origin_lsn=0) at reorderbuffer.c:1653:7 frame #9: 0x0000000106b93c10 postgres`DecodeCommit(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8, parsed=0x00007ffee954bf90,xid=488) at decode.c:637:2 frame #10: 0x0000000106b92fa9 postgres`DecodeXactOp(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8) at decode.c:245:5 frame #11: 0x0000000106b92aee postgres`LogicalDecodingProcessRecord(ctx=0x00007fb2a4812d20, record=0x00007fb2a4812fe0)at decode.c:114:4 frame #12: 0x0000000106bd2a16 postgres`XLogSendLogical at walsender.c:2863:3 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote: > + * Block if we have unsent data. XXX For logical replication, let > + * WalSndWaitForWal(), handle any other blocking; idle receivers need > + * its additional actions. For physical replication, also block if > + * caught up; its send_data does not block. > > It might be better to s/WalSndWaitForWal()/send_data()? Because not only > WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking. > WalSndWriteData() is called also under send_data, i.e., XLogSendLogical(). Thanks for reviewing. WalSndWriteData() blocks when we have unsent data, which is the same cause for blocking in WalSndLoop(). Since the comment you quote says we let WalSndWaitForWal() "handle any other blocking", I don't think your proposed change makes it more correct. Also, if someone wants to refactor this, the place to look is WalSndWaitForWal(), not any other part of send_data().
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote: > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > By the way, if latch is consumed in WalSndLoop, succeeding call to > > > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > > > cause missing wakeups? (in other words, overlooking of wakeup latch). > > > > - Since the only source other than timeout of walsender wakeup is latch, > > - we should avoid wasteful consuming of latch. (It is the same issue > > - with [1]). > > > > + Since walsender is wokeup by LSN advancement via latch, we should > > + avoid wasteful consuming of latch. (It is the same issue with [1]). > > > > > > > If wakeup signal is not remembered on walsender (like > > > InterruptPending), WalSndPhysical cannot enter a sleep with > > > confidence. > > No; per latch.h, "What must be avoided is placing any checks for asynchronous > events after WaitLatch and before ResetLatch, as that creates a race > condition." In other words, the thing to avoid is calling ResetLatch() > without next examining all pending work that a latch would signal. Each > walsender.c WaitLatch call does follow the rules. I didn't meant that, of course. I thought of more or less the same with moving the trigger from latch to signal then the handler sets a flag and SetLatch(). If we use bare latch, we should avoid false entering to sleep, which also makes thinks compolex. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 20 Apr 2020 14:30:08 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/04/18 16:01, Noah Misch wrote: > > On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote: > >>> 4. Keep the WalSndLoop() wait, but condition it on !logical. This is > >>> the > >>> minimal fix, but it crudely punches through the abstraction between > >>> WalSndLoop() and its WalSndSendDataCallback. > >> > >> (4) also looks good because it's simple, if we can redesign those > >> functions in good shape. > > Let's do that. I'm attaching the replacement implementation and the > > revert of > > v1. > > Thanks for the patch! Though referencing XLogSendLogical inside > WalSndLoop() > might be a bit ugly,, I'm fine with this change because it's simple > and easier > to understand. I thought that if we do this, read_data returns boolean that indiates whether wait for latch or incoming packet, or returns a wake event mask. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote: > At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in > > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote: > > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > By the way, if latch is consumed in WalSndLoop, succeeding call to > > > > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > > > > cause missing wakeups? (in other words, overlooking of wakeup latch). > > > > > > - Since the only source other than timeout of walsender wakeup is latch, > > > - we should avoid wasteful consuming of latch. (It is the same issue > > > - with [1]). > > > > > > + Since walsender is wokeup by LSN advancement via latch, we should > > > + avoid wasteful consuming of latch. (It is the same issue with [1]). > > > > > > > > > > If wakeup signal is not remembered on walsender (like > > > > InterruptPending), WalSndPhysical cannot enter a sleep with > > > > confidence. > > > > No; per latch.h, "What must be avoided is placing any checks for asynchronous > > events after WaitLatch and before ResetLatch, as that creates a race > > condition." In other words, the thing to avoid is calling ResetLatch() > > without next examining all pending work that a latch would signal. Each > > walsender.c WaitLatch call does follow the rules. > > I didn't meant that, of course. I thought of more or less the same > with moving the trigger from latch to signal then the handler sets a > flag and SetLatch(). If we use bare latch, we should avoid false > entering to sleep, which also makes thinks compolex. I don't understand. If there's a defect, can you write a test case or describe a sequence of events (e.g. at line X, variable Y has value Z)?
At Mon, 20 Apr 2020 00:59:54 -0700, Noah Misch <noah@leadboat.com> wrote in > On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote: > > At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in > > > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote: > > > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > > By the way, if latch is consumed in WalSndLoop, succeeding call to > > > > > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > > > > > cause missing wakeups? (in other words, overlooking of wakeup latch). > > > > > > > > - Since the only source other than timeout of walsender wakeup is latch, > > > > - we should avoid wasteful consuming of latch. (It is the same issue > > > > - with [1]). > > > > > > > > + Since walsender is wokeup by LSN advancement via latch, we should > > > > + avoid wasteful consuming of latch. (It is the same issue with [1]). > > > > > > > > > > > > > If wakeup signal is not remembered on walsender (like > > > > > InterruptPending), WalSndPhysical cannot enter a sleep with > > > > > confidence. > > > > > > No; per latch.h, "What must be avoided is placing any checks for asynchronous > > > events after WaitLatch and before ResetLatch, as that creates a race > > > condition." In other words, the thing to avoid is calling ResetLatch() > > > without next examining all pending work that a latch would signal. Each > > > walsender.c WaitLatch call does follow the rules. > > > > I didn't meant that, of course. I thought of more or less the same > > with moving the trigger from latch to signal then the handler sets a > > flag and SetLatch(). If we use bare latch, we should avoid false > > entering to sleep, which also makes thinks compolex. > > I don't understand. If there's a defect, can you write a test case or > describe a sequence of events (e.g. at line X, variable Y has value Z)? Indeed. Anyway the current version cannot have such a possible issue. Thanks. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020/04/20 16:02, Noah Misch wrote: > On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote: >> + * Block if we have unsent data. XXX For logical replication, let >> + * WalSndWaitForWal(), handle any other blocking; idle receivers need >> + * its additional actions. For physical replication, also block if >> + * caught up; its send_data does not block. >> >> It might be better to s/WalSndWaitForWal()/send_data()? Because not only >> WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking. >> WalSndWriteData() is called also under send_data, i.e., XLogSendLogical(). > > Thanks for reviewing. WalSndWriteData() blocks when we have unsent data, > which is the same cause for blocking in WalSndLoop(). Since the comment you > quote says we let WalSndWaitForWal() "handle any other blocking", I don't > think your proposed change makes it more correct. I was misreading this as something like "any other blocking than the blocking in WalSndLoop()". Ok, I have no more comments on the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Mon, Apr 20, 2020 at 07:24:28PM +0900, Fujii Masao wrote: > I was misreading this as something like "any other blocking than > the blocking in WalSndLoop()". Ok, I have no more comments on > the patch. Patch looks rather sane to me at quick glance. I can see that WAL senders are now not stuck at 100% CPU per process when sitting idle, for both physical and logical replication. Thanks. -- Michael