Re: Logrep launcher race conditions leading to slow tests - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Logrep launcher race conditions leading to slow tests
Date
Msg-id CAFiTN-sDAYW6Ygq1_LNkkaXGFo9w-8jxnQCUtFq99Le5-xHLzg@mail.gmail.com
Whole thread Raw
In response to Re: Logrep launcher race conditions leading to slow tests  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> >> latch event so that it can keep waiting for the worker to launch.
> >> It neglects to set the latch again, allowing ApplyLauncherMain
> >> to miss events.
>
> > There was a previous discussion to fix this behavior. Heikki has
> > proposed a similar fix for this, but at the caller. See the patch
> > attached in email [1].
>
> Ah, thank you for that link.  I vaguely recalled that we'd discussed
> this strange behavior before, but I did not realize that anyone had
> diagnosed a cause.  I don't much like any of the patches proposed
> in that thread though --- they seem overcomplicated or off-point.
>
> I do not think we can switch to having two latches here.  The
> only reason WaitForReplicationWorkerAttach needs to pay attention
> to the process latch at all is that it needs to service
> CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
> also true in the ApplyLauncherMain loop.  We can't realistically
> expect other processes to signal different latches depending on
> where the launcher is waiting, so those cases have to be triggered
> by the same latch.
>
> However, WaitForReplicationWorkerAttach can't service any
> latch-setting conditions other than those managed by
> CHECK_FOR_INTERRUPTS().  So if CHECK_FOR_INTERRUPTS() returns,
> we have some other triggering condition, which is the business
> of some outer code level to deal with.  Having it re-set the latch
> to allow that to happen promptly after it returns seems like a
> pretty straightforward answer to me.

Yeah that makes sense, we can not simply wait on another latch which
is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch
looks simple for resolving this issue.

--
Regards,
Dilip Kumar
Google



pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Next
From: jian he
Date:
Subject: Re: SQL:2023 JSON simplified accessor support