Thread: Re: Logrep launcher race conditions leading to slow tests

Re: Logrep launcher race conditions leading to slow tests

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
>> latch event so that it can keep waiting for the worker to launch.
>> It neglects to set the latch again, allowing ApplyLauncherMain
>> to miss events.

> There was a previous discussion to fix this behavior. Heikki has
> proposed a similar fix for this, but at the caller. See the patch
> attached in email [1].

Ah, thank you for that link.  I vaguely recalled that we'd discussed
this strange behavior before, but I did not realize that anyone had
diagnosed a cause.  I don't much like any of the patches proposed
in that thread though --- they seem overcomplicated or off-point.

I do not think we can switch to having two latches here.  The
only reason WaitForReplicationWorkerAttach needs to pay attention
to the process latch at all is that it needs to service
CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
also true in the ApplyLauncherMain loop.  We can't realistically
expect other processes to signal different latches depending on
where the launcher is waiting, so those cases have to be triggered
by the same latch.

However, WaitForReplicationWorkerAttach can't service any
latch-setting conditions other than those managed by
CHECK_FOR_INTERRUPTS().  So if CHECK_FOR_INTERRUPTS() returns,
we have some other triggering condition, which is the business
of some outer code level to deal with.  Having it re-set the latch
to allow that to happen promptly after it returns seems like a
pretty straightforward answer to me.

I do note Heikki's concern about whether we could get rid of the
sleep-and-retry looping in WaitForReplicationWorkerAttach in
favor of getting signaled somehow, and I agree with that as a
possible future improvement.  But I don't especially see why that
demands another latch; in fact, unless we want to teach WaitLatch
to be able to wait on more than one latch, it *can't* be a
separate latch from the one that receives CHECK_FOR_INTERRUPTS()
conditions.

>> 4. In process_syncing_tables_for_apply (the other caller of
>> logicalrep_worker_launch), it seems okay to ignore the
>> result of logicalrep_worker_launch, but I think it should
>> fill hentry->last_start_time before not after the call.

> With this, won't we end up retrying to launch the worker sooner if the
> launch took time, but still failed to launch the worker?

That code already does update last_start_time unconditionally, and
I think that's the right behavior for the same reason that it's
right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime
whether or not logicalrep_worker_launch succeeds.  If the worker
launch fails, we don't want to retry instantly, we want to wait
wal_retrieve_retry_interval before retrying.  My desire to change
this code is just based on the idea that it's not clear what else
if anything looks at this hashtable, and by the time that
logicalrep_worker_launch returns the system state could be a lot
different.  (For instance, the worker could have started and
failed already.)  So, just as in ApplyLauncherMain, I'd rather
store the start time before calling logicalrep_worker_launch.

BTW, it strikes me that if we're going to leave
process_syncing_tables_for_apply() ignoring the result of
logicalrep_worker_launch, it'd be smart to insert an explicit
(void) cast to show that that's intentional.  Otherwise Coverity
is likely to complain about how we're ignoring the result in
one place and not the other.

            regards, tom lane



Re: Logrep launcher race conditions leading to slow tests

From
Amit Kapila
Date:
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> >> latch event so that it can keep waiting for the worker to launch.
> >> It neglects to set the latch again, allowing ApplyLauncherMain
> >> to miss events.
>
> > There was a previous discussion to fix this behavior. Heikki has
> > proposed a similar fix for this, but at the caller. See the patch
> > attached in email [1].
>
> Ah, thank you for that link.  I vaguely recalled that we'd discussed
> this strange behavior before, but I did not realize that anyone had
> diagnosed a cause.  I don't much like any of the patches proposed
> in that thread though --- they seem overcomplicated or off-point.
>
> I do not think we can switch to having two latches here.  The
> only reason WaitForReplicationWorkerAttach needs to pay attention
> to the process latch at all is that it needs to service
> CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
> also true in the ApplyLauncherMain loop.  We can't realistically
> expect other processes to signal different latches depending on
> where the launcher is waiting, so those cases have to be triggered
> by the same latch.
>
> However, WaitForReplicationWorkerAttach can't service any
> latch-setting conditions other than those managed by
> CHECK_FOR_INTERRUPTS().  So if CHECK_FOR_INTERRUPTS() returns,
> we have some other triggering condition, which is the business
> of some outer code level to deal with.  Having it re-set the latch
> to allow that to happen promptly after it returns seems like a
> pretty straightforward answer to me.
>
> I do note Heikki's concern about whether we could get rid of the
> sleep-and-retry looping in WaitForReplicationWorkerAttach in
> favor of getting signaled somehow, and I agree with that as a
> possible future improvement.  But I don't especially see why that
> demands another latch; in fact, unless we want to teach WaitLatch
> to be able to wait on more than one latch, it *can't* be a
> separate latch from the one that receives CHECK_FOR_INTERRUPTS()
> conditions.
>

I agree that we don't need another latch, and I find your patch solves
it in the best possible way. The only minor point if you think makes
sense to change is the following comment:
+ /*
+ * If we had to clear a latch event in order to wait, be sure to restore
+ * it before exiting.  Otherwise caller may miss events.
+ */
+ if (dropped_latch)
..

The part of the comment "Otherwise caller may miss events." is clear
to me, but I'm not sure if it would be equally easy for everyone to
understand what the other events code is talking about here. Something
on the lines of what Heikki wrote, " We use the same latch to be
signalled about subscription changes and workers exiting, so we might
have missed some notifications, if those events happened
concurrently." is more specific.

> >> 4. In process_syncing_tables_for_apply (the other caller of
> >> logicalrep_worker_launch), it seems okay to ignore the
> >> result of logicalrep_worker_launch, but I think it should
> >> fill hentry->last_start_time before not after the call.
>
> > With this, won't we end up retrying to launch the worker sooner if the
> > launch took time, but still failed to launch the worker?
>
> That code already does update last_start_time unconditionally, and
> I think that's the right behavior for the same reason that it's
> right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime
> whether or not logicalrep_worker_launch succeeds.  If the worker
> launch fails, we don't want to retry instantly, we want to wait
> wal_retrieve_retry_interval before retrying.  My desire to change
> this code is just based on the idea that it's not clear what else
> if anything looks at this hashtable, and by the time that
> logicalrep_worker_launch returns the system state could be a lot
> different.  (For instance, the worker could have started and
> failed already.)  So, just as in ApplyLauncherMain, I'd rather
> store the start time before calling logicalrep_worker_launch.
>
> BTW, it strikes me that if we're going to leave
> process_syncing_tables_for_apply() ignoring the result of
> logicalrep_worker_launch, it'd be smart to insert an explicit
> (void) cast to show that that's intentional.  Otherwise Coverity
> is likely to complain about how we're ignoring the result in
> one place and not the other.
>

Sounds reasonable.

--
With Regards,
Amit Kapila.



Re: Logrep launcher race conditions leading to slow tests

From
Dilip Kumar
Date:
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> >> latch event so that it can keep waiting for the worker to launch.
> >> It neglects to set the latch again, allowing ApplyLauncherMain
> >> to miss events.
>
> > There was a previous discussion to fix this behavior. Heikki has
> > proposed a similar fix for this, but at the caller. See the patch
> > attached in email [1].
>
> Ah, thank you for that link.  I vaguely recalled that we'd discussed
> this strange behavior before, but I did not realize that anyone had
> diagnosed a cause.  I don't much like any of the patches proposed
> in that thread though --- they seem overcomplicated or off-point.
>
> I do not think we can switch to having two latches here.  The
> only reason WaitForReplicationWorkerAttach needs to pay attention
> to the process latch at all is that it needs to service
> CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
> also true in the ApplyLauncherMain loop.  We can't realistically
> expect other processes to signal different latches depending on
> where the launcher is waiting, so those cases have to be triggered
> by the same latch.
>
> However, WaitForReplicationWorkerAttach can't service any
> latch-setting conditions other than those managed by
> CHECK_FOR_INTERRUPTS().  So if CHECK_FOR_INTERRUPTS() returns,
> we have some other triggering condition, which is the business
> of some outer code level to deal with.  Having it re-set the latch
> to allow that to happen promptly after it returns seems like a
> pretty straightforward answer to me.

Yeah that makes sense, we can not simply wait on another latch which
is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch
looks simple for resolving this issue.

--
Regards,
Dilip Kumar
Google