Re: 001_rep_changes.pl stalls - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: 001_rep_changes.pl stalls |
Date | |
Msg-id | 20200417.170629.1231777174468368420.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: 001_rep_changes.pl stalls (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-hackers |
Sorry , I wrote something wrong. At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in > > On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote: > > > Noah Misch <noah@leadboat.com> writes: > > > > This seems to have made the following race condition easier to hit: > > > > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com > > > > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us > > > > > > Yeah, I just came to the same guess in the other thread. > > > > > > > While I don't think that indicates anything wrong with the fix for $SUBJECT, > > > > creating more buildfarm noise is itself bad. I am inclined to revert the fix > > > > after a week. Not immediately, in case it uncovers lower-probability bugs. > > > > I'd then re-commit it after one of those threads fixes the other bug. Would > > > > anyone like to argue for a revert earlier, later, or not at all? > > > > > > I don't think you should revert. Those failures are (just) often enough > > > to be annoying but I do not think that a proper fix is very far away. > > > > That works for me, but an actual defect may trigger a revert. Fujii Masao > > reported high walsender CPU usage after this patch. The patch caused idle > > physical walsenders to use 100% CPU. When caught up, the > > WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps > > until more WAL is available. XLogSendPhysical() just returns when caught up. > > No amount of WAL is too small for physical replication to dispatch, but > > logical replication needs the full xl_tot_len of a record. Some options: > > > > 1. Make XLogSendPhysical() more like XLogSendLogical(), calling > > WalSndWaitForWal() when no WAL is available. A quick version of this > > passes tests, but I'll need to audit WalSndWaitForWal() for things that are > > wrong for physical replication. > > > > 2. Make XLogSendLogical() more like XLogSendPhysical(), returning when > > insufficient WAL is available. This complicates the xlogreader.h API to > > pass back "wait for this XLogRecPtr", and we'd then persist enough state to > > resume decoding. This doesn't have any advantages to make up for those. > > > > 3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the > > WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks > > further drift between the two wait sites; on the other hand, one could > > refactor later to help avoid that. > > > > 4. Keep the WalSndLoop() wait, but condition it on !logical. This is the > > minimal fix, but it crudely punches through the abstraction between > > WalSndLoop() and its WalSndSendDataCallback. > > > > I'm favoring (1). Other preferences? > > Starting from the current shape, I think 1 is preferable, since that > waiting logic is no longer shared between logical and physical > replications. But I'm not sure I like calling WalSndWaitForWal() > (maybe with previous location + 1?), because the function seems to do > too-much. > > By the way, if latch is consumed in WalSndLoop, succeeding call to > WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that > cause missing wakeups? (in other words, overlooking of wakeup latch). - Since the only source other than timeout of walsender wakeup is latch, - we should avoid wasteful consuming of latch. (It is the same issue - with [1]). + Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]). > If wakeup signal is not remembered on walsender (like > InterruptPending), WalSndPhysical cannot enter a sleep with > confidence. > > > [1] https://www.postgresql.org/message-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: