Thread: 001_rep_changes.pl stalls

001_rep_changes.pl stalls

From
Noah Misch
Date:
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication.  When caught up, we
should impose that logic before every sleep.  The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.


On my regular development machine, src/test/subscription/t/001_rep_changes.pl
stalls for ~10s at this wait_for_catchup:

  $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep");

  # Restart the publisher and check the state of the subscriber which
  # should be in a streaming state after catching up.
  $node_publisher->stop('fast');
  $node_publisher->start;

  $node_publisher->wait_for_catchup('tap_sub');

That snippet emits three notable physical WAL records.  There's a
Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and
an XLOG/FPI_FOR_HINT.

The buildfarm has stalled there, but it happens probably less than half the
time.  Examples[1] showing the stall:


https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check

Here's the most-relevant walsender call tree:

WalSndLoop
  XLogSendLogical (caller invokes once per loop iteration, via send_data callback)
    XLogReadRecord (caller invokes once)
      ReadPageInternal (caller invokes twice in this test; more calls are possible)
        logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via
state->read_page()callback, registered in StartLogicalReplication)
 
          WalSndWaitForWal (caller invokes once; has fast path)


The cause is a race involving the flow of reply messages (send_feedback()
messages) from logical apply worker to walsender.  Here are two sequencing
patterns; the more-indented parts are what differ.  Stalling pattern:

  sender reads Transaction/COMMIT and sends the changes
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
    sender accepts the report
    sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep
    sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
    receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT

Non-stalling pattern (more prevalent with lower machine performance):

  sender reads Transaction/COMMIT and sends the changes
    sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep
    sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT


The fix is to test "MyWalSnd->write < sentPtr" before more sleeps.  The test
is unnecessary when sleeping due to pq_is_send_pending(); in that case, the
receiver is not idle and will reply before idling.  I changed WalSndLoop() to
sleep only for pq_is_send_pending().  For all other sleep reasons, the sleep
will happen in WalSndWaitForWal().  Attached.  I don't know whether this is
important outside of testing scenarios.  I lean against back-patching, but I
will back-patch if someone thinks this qualifies as a performance bug.

Thanks,
nm

[1] I spot-checked only my animals, since I wanted to experiment on an
affected animal.

Attachment

Re: 001_rep_changes.pl stalls

From
Noah Misch
Date:
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote:
> Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
> important for promptly updating pg_stat_replication.  When caught up, we
> should impose that logic before every sleep.  The one-line fix is to sleep in
> WalSndLoop() only when pq_is_send_pending(), not when caught up.

This seems to have made the following race condition easier to hit:
https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us

Now it happened eight times in three days, all on BSD machines:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22

While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad.  I am inclined to revert the fix
after a week.  Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug.  Would
anyone like to argue for a revert earlier, later, or not at all?


There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01

Most-relevant lines of the test script:

    $node_master->safe_psql('postgres',
        "INSERT INTO decoding(blah) VALUES ('afterbb');");
    $node_master->safe_psql('postgres', 'CHECKPOINT');
    $node_master->stop('immediate');

The failure suggested the INSERT was not replicated before the immediate stop.
I can reproduce that consistently, before or after the fix for $SUBJECT, by
modifying walsender to delay 0.2s before sending WAL:

--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
 #include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -2781,2 +2782,5 @@ retry:
 
+    PG_SETMASK(&BlockSig);
+    pg_usleep(200 * 1000);
+    PG_SETMASK(&UnBlockSig);
     pq_putmessage_noblock('d', output_message.data, output_message.len);

I will shortly push a fix adding a wait_for_catchup to the test.  I don't know
if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race
condition easier to hit.



Re: 001_rep_changes.pl stalls

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> This seems to have made the following race condition easier to hit:
> https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us

Yeah, I just came to the same guess in the other thread.

> While I don't think that indicates anything wrong with the fix for $SUBJECT,
> creating more buildfarm noise is itself bad.  I am inclined to revert the fix
> after a week.  Not immediately, in case it uncovers lower-probability bugs.
> I'd then re-commit it after one of those threads fixes the other bug.  Would
> anyone like to argue for a revert earlier, later, or not at all?

I don't think you should revert.  Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.

            regards, tom lane



Re: 001_rep_changes.pl stalls

From
Noah Misch
Date:
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > This seems to have made the following race condition easier to hit:
> > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> 
> Yeah, I just came to the same guess in the other thread.
> 
> > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > creating more buildfarm noise is itself bad.  I am inclined to revert the fix
> > after a week.  Not immediately, in case it uncovers lower-probability bugs.
> > I'd then re-commit it after one of those threads fixes the other bug.  Would
> > anyone like to argue for a revert earlier, later, or not at all?
> 
> I don't think you should revert.  Those failures are (just) often enough
> to be annoying but I do not think that a proper fix is very far away.

That works for me, but an actual defect may trigger a revert.  Fujii Masao
reported high walsender CPU usage after this patch.  The patch caused idle
physical walsenders to use 100% CPU.  When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available.  XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record.  Some options:

1. Make XLogSendPhysical() more like XLogSendLogical(), calling
   WalSndWaitForWal() when no WAL is available.  A quick version of this
   passes tests, but I'll need to audit WalSndWaitForWal() for things that are
   wrong for physical replication.

2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
   insufficient WAL is available.  This complicates the xlogreader.h API to
   pass back "wait for this XLogRecPtr", and we'd then persist enough state to
   resume decoding.  This doesn't have any advantages to make up for those.

3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
   WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
   further drift between the two wait sites; on the other hand, one could
   refactor later to help avoid that.

4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
   minimal fix, but it crudely punches through the abstraction between
   WalSndLoop() and its WalSndSendDataCallback.

I'm favoring (1).  Other preferences?



Re: 001_rep_changes.pl stalls

From
Kyotaro Horiguchi
Date:
At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > This seems to have made the following race condition easier to hit:
> > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> > 
> > Yeah, I just came to the same guess in the other thread.
> > 
> > > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > > creating more buildfarm noise is itself bad.  I am inclined to revert the fix
> > > after a week.  Not immediately, in case it uncovers lower-probability bugs.
> > > I'd then re-commit it after one of those threads fixes the other bug.  Would
> > > anyone like to argue for a revert earlier, later, or not at all?
> > 
> > I don't think you should revert.  Those failures are (just) often enough
> > to be annoying but I do not think that a proper fix is very far away.
> 
> That works for me, but an actual defect may trigger a revert.  Fujii Masao
> reported high walsender CPU usage after this patch.  The patch caused idle
> physical walsenders to use 100% CPU.  When caught up, the
> WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
> until more WAL is available.  XLogSendPhysical() just returns when caught up.
> No amount of WAL is too small for physical replication to dispatch, but
> logical replication needs the full xl_tot_len of a record.  Some options:
> 
> 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
>    WalSndWaitForWal() when no WAL is available.  A quick version of this
>    passes tests, but I'll need to audit WalSndWaitForWal() for things that are
>    wrong for physical replication.
> 
> 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
>    insufficient WAL is available.  This complicates the xlogreader.h API to
>    pass back "wait for this XLogRecPtr", and we'd then persist enough state to
>    resume decoding.  This doesn't have any advantages to make up for those.
> 
> 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
>    WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
>    further drift between the two wait sites; on the other hand, one could
>    refactor later to help avoid that.
> 
> 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
>    minimal fix, but it crudely punches through the abstraction between
>    WalSndLoop() and its WalSndSendDataCallback.
> 
> I'm favoring (1).  Other preferences?

Starting from the current shape, I think 1 is preferable, since that
waiting logic is no longer shared between logical and physical
replications.  But I'm not sure I like calling WalSndWaitForWal()
(maybe with previous location + 1?), because the function seems to do
too-much.

By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).
Since the only source other than timeout of walsender wakeup is latch,
we should avoid wasteful consuming of latch. (It is the same issue
with [1]).

If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.


[1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 001_rep_changes.pl stalls

From
Kyotaro Horiguchi
Date:
Sorry , I wrote something wrong.

At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> > > Noah Misch <noah@leadboat.com> writes:
> > > > This seems to have made the following race condition easier to hit:
> > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> > > 
> > > Yeah, I just came to the same guess in the other thread.
> > > 
> > > > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > > > creating more buildfarm noise is itself bad.  I am inclined to revert the fix
> > > > after a week.  Not immediately, in case it uncovers lower-probability bugs.
> > > > I'd then re-commit it after one of those threads fixes the other bug.  Would
> > > > anyone like to argue for a revert earlier, later, or not at all?
> > > 
> > > I don't think you should revert.  Those failures are (just) often enough
> > > to be annoying but I do not think that a proper fix is very far away.
> > 
> > That works for me, but an actual defect may trigger a revert.  Fujii Masao
> > reported high walsender CPU usage after this patch.  The patch caused idle
> > physical walsenders to use 100% CPU.  When caught up, the
> > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
> > until more WAL is available.  XLogSendPhysical() just returns when caught up.
> > No amount of WAL is too small for physical replication to dispatch, but
> > logical replication needs the full xl_tot_len of a record.  Some options:
> > 
> > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
> >    WalSndWaitForWal() when no WAL is available.  A quick version of this
> >    passes tests, but I'll need to audit WalSndWaitForWal() for things that are
> >    wrong for physical replication.
> > 
> > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
> >    insufficient WAL is available.  This complicates the xlogreader.h API to
> >    pass back "wait for this XLogRecPtr", and we'd then persist enough state to
> >    resume decoding.  This doesn't have any advantages to make up for those.
> > 
> > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
> >    WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
> >    further drift between the two wait sites; on the other hand, one could
> >    refactor later to help avoid that.
> > 
> > 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
> >    minimal fix, but it crudely punches through the abstraction between
> >    WalSndLoop() and its WalSndSendDataCallback.
> > 
> > I'm favoring (1).  Other preferences?
> 
> Starting from the current shape, I think 1 is preferable, since that
> waiting logic is no longer shared between logical and physical
> replications.  But I'm not sure I like calling WalSndWaitForWal()
> (maybe with previous location + 1?), because the function seems to do
> too-much.
> 
> By the way, if latch is consumed in WalSndLoop, succeeding call to
> WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> cause missing wakeups? (in other words, overlooking of wakeup latch).

- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).

+ Since walsender is wokeup by LSN advancement via latch, we should
+ avoid wasteful consuming of latch. (It is the same issue with [1]).


> If wakeup signal is not remembered on walsender (like
> InterruptPending), WalSndPhysical cannot enter a sleep with
> confidence.
> 
> 
> [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 001_rep_changes.pl stalls

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in 
>> I'm favoring (1).  Other preferences?

> Starting from the current shape, I think 1 is preferable, since that
> waiting logic is no longer shared between logical and physical
> replications.  But I'm not sure I like calling WalSndWaitForWal()
> (maybe with previous location + 1?), because the function seems to do
> too-much.

I'm far from an expert on this code, but it does look like there's
a lot of stuff in WalSndWaitForWal that is specific to logical rep,
so I'm not sure that (1) is workable.  At the very least there'd
have to be a bunch more conditionals in that function than there are
now.  In the end, a separate copy for physical rep might be better.

(BTW, I think this code is in desperate need of a renaming
campaign to make it clearer which functions are for logical rep,
physical rep, or both.)

            regards, tom lane



Re: 001_rep_changes.pl stalls

From
Fujii Masao
Date:

On 2020/04/17 14:41, Noah Misch wrote:
> On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
>> Noah Misch <noah@leadboat.com> writes:
>>> This seems to have made the following race condition easier to hit:
>>> https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
>>> https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
>>
>> Yeah, I just came to the same guess in the other thread.
>>
>>> While I don't think that indicates anything wrong with the fix for $SUBJECT,
>>> creating more buildfarm noise is itself bad.  I am inclined to revert the fix
>>> after a week.  Not immediately, in case it uncovers lower-probability bugs.
>>> I'd then re-commit it after one of those threads fixes the other bug.  Would
>>> anyone like to argue for a revert earlier, later, or not at all?
>>
>> I don't think you should revert.  Those failures are (just) often enough
>> to be annoying but I do not think that a proper fix is very far away.
> 
> That works for me, but an actual defect may trigger a revert.  Fujii Masao
> reported high walsender CPU usage after this patch.  The patch caused idle
> physical walsenders to use 100% CPU.  When caught up, the
> WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
> until more WAL is available.  XLogSendPhysical() just returns when caught up.
> No amount of WAL is too small for physical replication to dispatch, but
> logical replication needs the full xl_tot_len of a record.  Some options:
> 
> 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
>     WalSndWaitForWal() when no WAL is available.  A quick version of this
>     passes tests, but I'll need to audit WalSndWaitForWal() for things that are
>     wrong for physical replication.

(1) makes even physical replication walsender sleep in two places and
which seems to make the code for physical replication complicated
more than necessary. I'd like to avoid (1) if possible.

> 
> 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
>     insufficient WAL is available.  This complicates the xlogreader.h API to
>     pass back "wait for this XLogRecPtr", and we'd then persist enough state to
>     resume decoding.  This doesn't have any advantages to make up for those.
> 
> 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
>     WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
>     further drift between the two wait sites; on the other hand, one could
>     refactor later to help avoid that.

Since the additional call of WalSndKeepalive() is necessary only for
logical replication, it should be copied to, e.g., XLogSendLogical(),
instead of WalSndLoop()? For example, when XLogSendLogical() sets
WalSndCaughtUp to true, it should call WalSndKeepalive()?

The root problem seems that when WAL record that's no-opes for
logical rep is processed, keep alive message has not sent immediately,
in spite of that we want pg_stat_replication to be updated promptly.
(3) seems to try to address this problem straightly and looks better to me.

> 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
>     minimal fix, but it crudely punches through the abstraction between
>     WalSndLoop() and its WalSndSendDataCallback.

(4) also looks good because it's simple, if we can redesign those
functions in good shape.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: 001_rep_changes.pl stalls

From
Noah Misch
Date:
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > cause missing wakeups? (in other words, overlooking of wakeup latch).
> 
> - Since the only source other than timeout of walsender wakeup is latch,
> - we should avoid wasteful consuming of latch. (It is the same issue
> - with [1]).
> 
> + Since walsender is wokeup by LSN advancement via latch, we should
> + avoid wasteful consuming of latch. (It is the same issue with [1]).
> 
> 
> > If wakeup signal is not remembered on walsender (like
> > InterruptPending), WalSndPhysical cannot enter a sleep with
> > confidence.

No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition."  In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal.  Each
walsender.c WaitLatch call does follow the rules.

On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
> On 2020/04/17 14:41, Noah Misch wrote:
> >1. Make XLogSendPhysical() more like XLogSendLogical(), calling
> >    WalSndWaitForWal() when no WAL is available.  A quick version of this
> >    passes tests, but I'll need to audit WalSndWaitForWal() for things that are
> >    wrong for physical replication.
> 
> (1) makes even physical replication walsender sleep in two places and
> which seems to make the code for physical replication complicated
> more than necessary. I'd like to avoid (1) if possible.

Good point.

> >2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
> >    insufficient WAL is available.  This complicates the xlogreader.h API to
> >    pass back "wait for this XLogRecPtr", and we'd then persist enough state to
> >    resume decoding.  This doesn't have any advantages to make up for those.
> >
> >3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
> >    WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
> >    further drift between the two wait sites; on the other hand, one could
> >    refactor later to help avoid that.
> 
> Since the additional call of WalSndKeepalive() is necessary only for
> logical replication, it should be copied to, e.g., XLogSendLogical(),
> instead of WalSndLoop()? For example, when XLogSendLogical() sets
> WalSndCaughtUp to true, it should call WalSndKeepalive()?

We'd send a keepalive even when pq_flush_if_writable() can't empty the output
buffer.  That could be acceptable, but it's not ideal.

> The root problem seems that when WAL record that's no-opes for
> logical rep is processed, keep alive message has not sent immediately,
> in spite of that we want pg_stat_replication to be updated promptly.

The degree of promptness should be predictable, at least.  If we removed the
WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would
not be prompt, but they would be predictable.  I do, however, think prompt
updates are worthwhile.

> (3) seems to try to address this problem straightly and looks better to me.
> 
> >4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
> >    minimal fix, but it crudely punches through the abstraction between
> >    WalSndLoop() and its WalSndSendDataCallback.
> 
> (4) also looks good because it's simple, if we can redesign those
> functions in good shape.

Let's do that.  I'm attaching the replacement implementation and the revert of
v1.

Attachment

Re: 001_rep_changes.pl stalls

From
Fujii Masao
Date:

On 2020/04/18 16:01, Noah Misch wrote:
> On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
>> At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>>> By the way, if latch is consumed in WalSndLoop, succeeding call to
>>> WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
>>> cause missing wakeups? (in other words, overlooking of wakeup latch).
>>
>> - Since the only source other than timeout of walsender wakeup is latch,
>> - we should avoid wasteful consuming of latch. (It is the same issue
>> - with [1]).
>>
>> + Since walsender is wokeup by LSN advancement via latch, we should
>> + avoid wasteful consuming of latch. (It is the same issue with [1]).
>>
>>
>>> If wakeup signal is not remembered on walsender (like
>>> InterruptPending), WalSndPhysical cannot enter a sleep with
>>> confidence.
> 
> No; per latch.h, "What must be avoided is placing any checks for asynchronous
> events after WaitLatch and before ResetLatch, as that creates a race
> condition."  In other words, the thing to avoid is calling ResetLatch()
> without next examining all pending work that a latch would signal.  Each
> walsender.c WaitLatch call does follow the rules.
> 
> On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
>> On 2020/04/17 14:41, Noah Misch wrote:
>>> 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
>>>     WalSndWaitForWal() when no WAL is available.  A quick version of this
>>>     passes tests, but I'll need to audit WalSndWaitForWal() for things that are
>>>     wrong for physical replication.
>>
>> (1) makes even physical replication walsender sleep in two places and
>> which seems to make the code for physical replication complicated
>> more than necessary. I'd like to avoid (1) if possible.
> 
> Good point.
> 
>>> 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
>>>     insufficient WAL is available.  This complicates the xlogreader.h API to
>>>     pass back "wait for this XLogRecPtr", and we'd then persist enough state to
>>>     resume decoding.  This doesn't have any advantages to make up for those.
>>>
>>> 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
>>>     WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
>>>     further drift between the two wait sites; on the other hand, one could
>>>     refactor later to help avoid that.
>>
>> Since the additional call of WalSndKeepalive() is necessary only for
>> logical replication, it should be copied to, e.g., XLogSendLogical(),
>> instead of WalSndLoop()? For example, when XLogSendLogical() sets
>> WalSndCaughtUp to true, it should call WalSndKeepalive()?
> 
> We'd send a keepalive even when pq_flush_if_writable() can't empty the output
> buffer.  That could be acceptable, but it's not ideal.
> 
>> The root problem seems that when WAL record that's no-opes for
>> logical rep is processed, keep alive message has not sent immediately,
>> in spite of that we want pg_stat_replication to be updated promptly.
> 
> The degree of promptness should be predictable, at least.  If we removed the
> WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would
> not be prompt, but they would be predictable.  I do, however, think prompt
> updates are worthwhile.
> 
>> (3) seems to try to address this problem straightly and looks better to me.
>>
>>> 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
>>>     minimal fix, but it crudely punches through the abstraction between
>>>     WalSndLoop() and its WalSndSendDataCallback.
>>
>> (4) also looks good because it's simple, if we can redesign those
>> functions in good shape.
> 
> Let's do that.  I'm attaching the replacement implementation and the revert of
> v1.

Thanks for the patch! Though referencing XLogSendLogical inside WalSndLoop()
might be a bit ugly,, I'm fine with this change because it's simple and easier
to understand.

+         * Block if we have unsent data.  XXX For logical replication, let
+         * WalSndWaitForWal(), handle any other blocking; idle receivers need
+         * its additional actions.  For physical replication, also block if
+         * caught up; its send_data does not block.

It might be better to s/WalSndWaitForWal()/send_data()? Because not only
WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().

     frame #2: 0x0000000106bcfa84 postgres`WalSndWriteData(ctx=0x00007fb2a4812d20, lsn=22608080, xid=488,
last_write=false)at walsender.c:1247:2
 
     frame #3: 0x0000000106b98295 postgres`OutputPluginWrite(ctx=0x00007fb2a4812d20, last_write=false) at
logical.c:540:2
     frame #4: 0x00000001073fe9b8 pgoutput.so`send_relation_and_attrs(relation=0x00000001073ba2c0,
ctx=0x00007fb2a4812d20)at pgoutput.c:353:2
 
     frame #5: 0x00000001073fe7a0 pgoutput.so`maybe_send_schema(ctx=0x00007fb2a4812d20, relation=0x00000001073ba2c0,
relentry=0x00007fb2a483aa60)at pgoutput.c:315:2
 
     frame #6: 0x00000001073fd4c0 pgoutput.so`pgoutput_change(ctx=0x00007fb2a4812d20, txn=0x00007fb2a502e428,
relation=0x00000001073ba2c0,change=0x00007fb2a5030428) at pgoutput.c:394:2
 
     frame #7: 0x0000000106b99094 postgres`change_cb_wrapper(cache=0x00007fb2a482ed20, txn=0x00007fb2a502e428,
relation=0x00000001073ba2c0,change=0x00007fb2a5030428) at logical.c:753:2
 
     frame #8: 0x0000000106ba2200 postgres`ReorderBufferCommit(rb=0x00007fb2a482ed20, xid=488, commit_lsn=22621088,
end_lsn=22621136,commit_time=640675460323211, origin_id=0, origin_lsn=0) at reorderbuffer.c:1653:7
 
     frame #9: 0x0000000106b93c10 postgres`DecodeCommit(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8,
parsed=0x00007ffee954bf90,xid=488) at decode.c:637:2
 
     frame #10: 0x0000000106b92fa9 postgres`DecodeXactOp(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8) at
decode.c:245:5
     frame #11: 0x0000000106b92aee postgres`LogicalDecodingProcessRecord(ctx=0x00007fb2a4812d20,
record=0x00007fb2a4812fe0)at decode.c:114:4
 
     frame #12: 0x0000000106bd2a16 postgres`XLogSendLogical at walsender.c:2863:3

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: 001_rep_changes.pl stalls

From
Noah Misch
Date:
On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:
> +         * Block if we have unsent data.  XXX For logical replication, let
> +         * WalSndWaitForWal(), handle any other blocking; idle receivers need
> +         * its additional actions.  For physical replication, also block if
> +         * caught up; its send_data does not block.
> 
> It might be better to s/WalSndWaitForWal()/send_data()? Because not only
> WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
> WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().

Thanks for reviewing.  WalSndWriteData() blocks when we have unsent data,
which is the same cause for blocking in WalSndLoop().  Since the comment you
quote says we let WalSndWaitForWal() "handle any other blocking", I don't
think your proposed change makes it more correct.  Also, if someone wants to
refactor this, the place to look is WalSndWaitForWal(), not any other part of
send_data().



Re: 001_rep_changes.pl stalls

From
Kyotaro Horiguchi
Date:
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > 
> > - Since the only source other than timeout of walsender wakeup is latch,
> > - we should avoid wasteful consuming of latch. (It is the same issue
> > - with [1]).
> > 
> > + Since walsender is wokeup by LSN advancement via latch, we should
> > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > 
> > 
> > > If wakeup signal is not remembered on walsender (like
> > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > confidence.
> 
> No; per latch.h, "What must be avoided is placing any checks for asynchronous
> events after WaitLatch and before ResetLatch, as that creates a race
> condition."  In other words, the thing to avoid is calling ResetLatch()
> without next examining all pending work that a latch would signal.  Each
> walsender.c WaitLatch call does follow the rules.

I didn't meant that, of course.  I thought of more or less the same
with moving the trigger from latch to signal then the handler sets a
flag and SetLatch().  If we use bare latch, we should avoid false
entering to sleep, which also makes thinks compolex.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 001_rep_changes.pl stalls

From
Kyotaro Horiguchi
Date:
At Mon, 20 Apr 2020 14:30:08 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2020/04/18 16:01, Noah Misch wrote:
> > On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
> >>> 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is
> >>> the
> >>>     minimal fix, but it crudely punches through the abstraction between
> >>>     WalSndLoop() and its WalSndSendDataCallback.
> >>
> >> (4) also looks good because it's simple, if we can redesign those
> >> functions in good shape.
> > Let's do that.  I'm attaching the replacement implementation and the
> > revert of
> > v1.
> 
> Thanks for the patch! Though referencing XLogSendLogical inside
> WalSndLoop()
> might be a bit ugly,, I'm fine with this change because it's simple
> and easier
> to understand.

I thought that if we do this, read_data returns boolean that indiates
whether wait for latch or incoming packet, or returns a wake event
mask.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 001_rep_changes.pl stalls

From
Noah Misch
Date:
On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > > 
> > > - Since the only source other than timeout of walsender wakeup is latch,
> > > - we should avoid wasteful consuming of latch. (It is the same issue
> > > - with [1]).
> > > 
> > > + Since walsender is wokeup by LSN advancement via latch, we should
> > > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > > 
> > > 
> > > > If wakeup signal is not remembered on walsender (like
> > > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > > confidence.
> > 
> > No; per latch.h, "What must be avoided is placing any checks for asynchronous
> > events after WaitLatch and before ResetLatch, as that creates a race
> > condition."  In other words, the thing to avoid is calling ResetLatch()
> > without next examining all pending work that a latch would signal.  Each
> > walsender.c WaitLatch call does follow the rules.
> 
> I didn't meant that, of course.  I thought of more or less the same
> with moving the trigger from latch to signal then the handler sets a
> flag and SetLatch().  If we use bare latch, we should avoid false
> entering to sleep, which also makes thinks compolex.

I don't understand.  If there's a defect, can you write a test case or
describe a sequence of events (e.g. at line X, variable Y has value Z)?



Re: 001_rep_changes.pl stalls

From
Kyotaro Horiguchi
Date:
At Mon, 20 Apr 2020 00:59:54 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > > On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> > > > At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > > > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > > > > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > > > > cause missing wakeups? (in other words, overlooking of wakeup latch).
> > > > 
> > > > - Since the only source other than timeout of walsender wakeup is latch,
> > > > - we should avoid wasteful consuming of latch. (It is the same issue
> > > > - with [1]).
> > > > 
> > > > + Since walsender is wokeup by LSN advancement via latch, we should
> > > > + avoid wasteful consuming of latch. (It is the same issue with [1]).
> > > > 
> > > > 
> > > > > If wakeup signal is not remembered on walsender (like
> > > > > InterruptPending), WalSndPhysical cannot enter a sleep with
> > > > > confidence.
> > > 
> > > No; per latch.h, "What must be avoided is placing any checks for asynchronous
> > > events after WaitLatch and before ResetLatch, as that creates a race
> > > condition."  In other words, the thing to avoid is calling ResetLatch()
> > > without next examining all pending work that a latch would signal.  Each
> > > walsender.c WaitLatch call does follow the rules.
> > 
> > I didn't meant that, of course.  I thought of more or less the same
> > with moving the trigger from latch to signal then the handler sets a
> > flag and SetLatch().  If we use bare latch, we should avoid false
> > entering to sleep, which also makes thinks compolex.
> 
> I don't understand.  If there's a defect, can you write a test case or
> describe a sequence of events (e.g. at line X, variable Y has value Z)?

Indeed.  Anyway the current version cannot have such a possible issue.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: 001_rep_changes.pl stalls

From
Fujii Masao
Date:

On 2020/04/20 16:02, Noah Misch wrote:
> On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:
>> +         * Block if we have unsent data.  XXX For logical replication, let
>> +         * WalSndWaitForWal(), handle any other blocking; idle receivers need
>> +         * its additional actions.  For physical replication, also block if
>> +         * caught up; its send_data does not block.
>>
>> It might be better to s/WalSndWaitForWal()/send_data()? Because not only
>> WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
>> WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().
> 
> Thanks for reviewing.  WalSndWriteData() blocks when we have unsent data,
> which is the same cause for blocking in WalSndLoop().  Since the comment you
> quote says we let WalSndWaitForWal() "handle any other blocking", I don't
> think your proposed change makes it more correct.

I was misreading this as something like "any other blocking than
the blocking in WalSndLoop()". Ok, I have no more comments on
the patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: 001_rep_changes.pl stalls

From
Michael Paquier
Date:
On Mon, Apr 20, 2020 at 07:24:28PM +0900, Fujii Masao wrote:
> I was misreading this as something like "any other blocking than
> the blocking in WalSndLoop()". Ok, I have no more comments on
> the patch.

Patch looks rather sane to me at quick glance.  I can see that WAL
senders are now not stuck at 100% CPU per process when sitting idle,
for both physical and logical replication.  Thanks.
--
Michael

Attachment