Re: Logrep launcher race conditions leading to slow tests - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Logrep launcher race conditions leading to slow tests |
Date | |
Msg-id | CAA4eK1LL5CKEjNtgi3DZ-nokztZivkVNZqkD0PYMTswPVvHOOA@mail.gmail.com Whole thread Raw |
In response to | Re: Logrep launcher race conditions leading to slow tests (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process > >> latch event so that it can keep waiting for the worker to launch. > >> It neglects to set the latch again, allowing ApplyLauncherMain > >> to miss events. > > > There was a previous discussion to fix this behavior. Heikki has > > proposed a similar fix for this, but at the caller. See the patch > > attached in email [1]. > > Ah, thank you for that link. I vaguely recalled that we'd discussed > this strange behavior before, but I did not realize that anyone had > diagnosed a cause. I don't much like any of the patches proposed > in that thread though --- they seem overcomplicated or off-point. > > I do not think we can switch to having two latches here. The > only reason WaitForReplicationWorkerAttach needs to pay attention > to the process latch at all is that it needs to service > CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is > also true in the ApplyLauncherMain loop. We can't realistically > expect other processes to signal different latches depending on > where the launcher is waiting, so those cases have to be triggered > by the same latch. > > However, WaitForReplicationWorkerAttach can't service any > latch-setting conditions other than those managed by > CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, > we have some other triggering condition, which is the business > of some outer code level to deal with. Having it re-set the latch > to allow that to happen promptly after it returns seems like a > pretty straightforward answer to me. > > I do note Heikki's concern about whether we could get rid of the > sleep-and-retry looping in WaitForReplicationWorkerAttach in > favor of getting signaled somehow, and I agree with that as a > possible future improvement. But I don't especially see why that > demands another latch; in fact, unless we want to teach WaitLatch > to be able to wait on more than one latch, it *can't* be a > separate latch from the one that receives CHECK_FOR_INTERRUPTS() > conditions. > I agree that we don't need another latch, and I find your patch solves it in the best possible way. The only minor point if you think makes sense to change is the following comment: + /* + * If we had to clear a latch event in order to wait, be sure to restore + * it before exiting. Otherwise caller may miss events. + */ + if (dropped_latch) .. The part of the comment "Otherwise caller may miss events." is clear to me, but I'm not sure if it would be equally easy for everyone to understand what the other events code is talking about here. Something on the lines of what Heikki wrote, " We use the same latch to be signalled about subscription changes and workers exiting, so we might have missed some notifications, if those events happened concurrently." is more specific. > >> 4. In process_syncing_tables_for_apply (the other caller of > >> logicalrep_worker_launch), it seems okay to ignore the > >> result of logicalrep_worker_launch, but I think it should > >> fill hentry->last_start_time before not after the call. > > > With this, won't we end up retrying to launch the worker sooner if the > > launch took time, but still failed to launch the worker? > > That code already does update last_start_time unconditionally, and > I think that's the right behavior for the same reason that it's > right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime > whether or not logicalrep_worker_launch succeeds. If the worker > launch fails, we don't want to retry instantly, we want to wait > wal_retrieve_retry_interval before retrying. My desire to change > this code is just based on the idea that it's not clear what else > if anything looks at this hashtable, and by the time that > logicalrep_worker_launch returns the system state could be a lot > different. (For instance, the worker could have started and > failed already.) So, just as in ApplyLauncherMain, I'd rather > store the start time before calling logicalrep_worker_launch. > > BTW, it strikes me that if we're going to leave > process_syncing_tables_for_apply() ignoring the result of > logicalrep_worker_launch, it'd be smart to insert an explicit > (void) cast to show that that's intentional. Otherwise Coverity > is likely to complain about how we're ignoring the result in > one place and not the other. > Sounds reasonable. -- With Regards, Amit Kapila.
pgsql-hackers by date: