Re: 001_rep_changes.pl stalls - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: 001_rep_changes.pl stalls
Date
Msg-id 20200417.170629.1231777174468368420.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: 001_rep_changes.pl stalls  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Sorry , I wrote something wrong.

At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> > > Noah Misch <noah@leadboat.com> writes:
> > > > This seems to have made the following race condition easier to hit:
> > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> > > 
> > > Yeah, I just came to the same guess in the other thread.
> > > 
> > > > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > > > creating more buildfarm noise is itself bad.  I am inclined to revert the fix
> > > > after a week.  Not immediately, in case it uncovers lower-probability bugs.
> > > > I'd then re-commit it after one of those threads fixes the other bug.  Would
> > > > anyone like to argue for a revert earlier, later, or not at all?
> > > 
> > > I don't think you should revert.  Those failures are (just) often enough
> > > to be annoying but I do not think that a proper fix is very far away.
> > 
> > That works for me, but an actual defect may trigger a revert.  Fujii Masao
> > reported high walsender CPU usage after this patch.  The patch caused idle
> > physical walsenders to use 100% CPU.  When caught up, the
> > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
> > until more WAL is available.  XLogSendPhysical() just returns when caught up.
> > No amount of WAL is too small for physical replication to dispatch, but
> > logical replication needs the full xl_tot_len of a record.  Some options:
> > 
> > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling
> >    WalSndWaitForWal() when no WAL is available.  A quick version of this
> >    passes tests, but I'll need to audit WalSndWaitForWal() for things that are
> >    wrong for physical replication.
> > 
> > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
> >    insufficient WAL is available.  This complicates the xlogreader.h API to
> >    pass back "wait for this XLogRecPtr", and we'd then persist enough state to
> >    resume decoding.  This doesn't have any advantages to make up for those.
> > 
> > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
> >    WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
> >    further drift between the two wait sites; on the other hand, one could
> >    refactor later to help avoid that.
> > 
> > 4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
> >    minimal fix, but it crudely punches through the abstraction between
> >    WalSndLoop() and its WalSndSendDataCallback.
> > 
> > I'm favoring (1).  Other preferences?
> 
> Starting from the current shape, I think 1 is preferable, since that
> waiting logic is no longer shared between logical and physical
> replications.  But I'm not sure I like calling WalSndWaitForWal()
> (maybe with previous location + 1?), because the function seems to do
> too-much.
> 
> By the way, if latch is consumed in WalSndLoop, succeeding call to
> WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> cause missing wakeups? (in other words, overlooking of wakeup latch).

- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).

+ Since walsender is wokeup by LSN advancement via latch, we should
+ avoid wasteful consuming of latch. (It is the same issue with [1]).


> If wakeup signal is not remembered on walsender (like
> InterruptPending), WalSndPhysical cannot enter a sleep with
> confidence.
> 
> 
> [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority