On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> >> latch event so that it can keep waiting for the worker to launch.
> >> It neglects to set the latch again, allowing ApplyLauncherMain
> >> to miss events.
>
> > There was a previous discussion to fix this behavior. Heikki has
> > proposed a similar fix for this, but at the caller. See the patch
> > attached in email [1].
>
> Ah, thank you for that link. I vaguely recalled that we'd discussed
> this strange behavior before, but I did not realize that anyone had
> diagnosed a cause. I don't much like any of the patches proposed
> in that thread though --- they seem overcomplicated or off-point.
>
> I do not think we can switch to having two latches here. The
> only reason WaitForReplicationWorkerAttach needs to pay attention
> to the process latch at all is that it needs to service
> CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
> also true in the ApplyLauncherMain loop. We can't realistically
> expect other processes to signal different latches depending on
> where the launcher is waiting, so those cases have to be triggered
> by the same latch.
>
> However, WaitForReplicationWorkerAttach can't service any
> latch-setting conditions other than those managed by
> CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns,
> we have some other triggering condition, which is the business
> of some outer code level to deal with. Having it re-set the latch
> to allow that to happen promptly after it returns seems like a
> pretty straightforward answer to me.
Yeah that makes sense, we can not simply wait on another latch which
is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch
looks simple for resolving this issue.
--
Regards,
Dilip Kumar
Google