Thread: Re: Logrep launcher race conditions leading to slow tests
Amit Kapila <amit.kapila16@gmail.com> writes: > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process >> latch event so that it can keep waiting for the worker to launch. >> It neglects to set the latch again, allowing ApplyLauncherMain >> to miss events. > There was a previous discussion to fix this behavior. Heikki has > proposed a similar fix for this, but at the caller. See the patch > attached in email [1]. Ah, thank you for that link. I vaguely recalled that we'd discussed this strange behavior before, but I did not realize that anyone had diagnosed a cause. I don't much like any of the patches proposed in that thread though --- they seem overcomplicated or off-point. I do not think we can switch to having two latches here. The only reason WaitForReplicationWorkerAttach needs to pay attention to the process latch at all is that it needs to service CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is also true in the ApplyLauncherMain loop. We can't realistically expect other processes to signal different latches depending on where the launcher is waiting, so those cases have to be triggered by the same latch. However, WaitForReplicationWorkerAttach can't service any latch-setting conditions other than those managed by CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, we have some other triggering condition, which is the business of some outer code level to deal with. Having it re-set the latch to allow that to happen promptly after it returns seems like a pretty straightforward answer to me. I do note Heikki's concern about whether we could get rid of the sleep-and-retry looping in WaitForReplicationWorkerAttach in favor of getting signaled somehow, and I agree with that as a possible future improvement. But I don't especially see why that demands another latch; in fact, unless we want to teach WaitLatch to be able to wait on more than one latch, it *can't* be a separate latch from the one that receives CHECK_FOR_INTERRUPTS() conditions. >> 4. In process_syncing_tables_for_apply (the other caller of >> logicalrep_worker_launch), it seems okay to ignore the >> result of logicalrep_worker_launch, but I think it should >> fill hentry->last_start_time before not after the call. > With this, won't we end up retrying to launch the worker sooner if the > launch took time, but still failed to launch the worker? That code already does update last_start_time unconditionally, and I think that's the right behavior for the same reason that it's right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime whether or not logicalrep_worker_launch succeeds. If the worker launch fails, we don't want to retry instantly, we want to wait wal_retrieve_retry_interval before retrying. My desire to change this code is just based on the idea that it's not clear what else if anything looks at this hashtable, and by the time that logicalrep_worker_launch returns the system state could be a lot different. (For instance, the worker could have started and failed already.) So, just as in ApplyLauncherMain, I'd rather store the start time before calling logicalrep_worker_launch. BTW, it strikes me that if we're going to leave process_syncing_tables_for_apply() ignoring the result of logicalrep_worker_launch, it'd be smart to insert an explicit (void) cast to show that that's intentional. Otherwise Coverity is likely to complain about how we're ignoring the result in one place and not the other. regards, tom lane
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process > >> latch event so that it can keep waiting for the worker to launch. > >> It neglects to set the latch again, allowing ApplyLauncherMain > >> to miss events. > > > There was a previous discussion to fix this behavior. Heikki has > > proposed a similar fix for this, but at the caller. See the patch > > attached in email [1]. > > Ah, thank you for that link. I vaguely recalled that we'd discussed > this strange behavior before, but I did not realize that anyone had > diagnosed a cause. I don't much like any of the patches proposed > in that thread though --- they seem overcomplicated or off-point. > > I do not think we can switch to having two latches here. The > only reason WaitForReplicationWorkerAttach needs to pay attention > to the process latch at all is that it needs to service > CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is > also true in the ApplyLauncherMain loop. We can't realistically > expect other processes to signal different latches depending on > where the launcher is waiting, so those cases have to be triggered > by the same latch. > > However, WaitForReplicationWorkerAttach can't service any > latch-setting conditions other than those managed by > CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, > we have some other triggering condition, which is the business > of some outer code level to deal with. Having it re-set the latch > to allow that to happen promptly after it returns seems like a > pretty straightforward answer to me. > > I do note Heikki's concern about whether we could get rid of the > sleep-and-retry looping in WaitForReplicationWorkerAttach in > favor of getting signaled somehow, and I agree with that as a > possible future improvement. But I don't especially see why that > demands another latch; in fact, unless we want to teach WaitLatch > to be able to wait on more than one latch, it *can't* be a > separate latch from the one that receives CHECK_FOR_INTERRUPTS() > conditions. > I agree that we don't need another latch, and I find your patch solves it in the best possible way. The only minor point if you think makes sense to change is the following comment: + /* + * If we had to clear a latch event in order to wait, be sure to restore + * it before exiting. Otherwise caller may miss events. + */ + if (dropped_latch) .. The part of the comment "Otherwise caller may miss events." is clear to me, but I'm not sure if it would be equally easy for everyone to understand what the other events code is talking about here. Something on the lines of what Heikki wrote, " We use the same latch to be signalled about subscription changes and workers exiting, so we might have missed some notifications, if those events happened concurrently." is more specific. > >> 4. In process_syncing_tables_for_apply (the other caller of > >> logicalrep_worker_launch), it seems okay to ignore the > >> result of logicalrep_worker_launch, but I think it should > >> fill hentry->last_start_time before not after the call. > > > With this, won't we end up retrying to launch the worker sooner if the > > launch took time, but still failed to launch the worker? > > That code already does update last_start_time unconditionally, and > I think that's the right behavior for the same reason that it's > right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime > whether or not logicalrep_worker_launch succeeds. If the worker > launch fails, we don't want to retry instantly, we want to wait > wal_retrieve_retry_interval before retrying. My desire to change > this code is just based on the idea that it's not clear what else > if anything looks at this hashtable, and by the time that > logicalrep_worker_launch returns the system state could be a lot > different. (For instance, the worker could have started and > failed already.) So, just as in ApplyLauncherMain, I'd rather > store the start time before calling logicalrep_worker_launch. > > BTW, it strikes me that if we're going to leave > process_syncing_tables_for_apply() ignoring the result of > logicalrep_worker_launch, it'd be smart to insert an explicit > (void) cast to show that that's intentional. Otherwise Coverity > is likely to complain about how we're ignoring the result in > one place and not the other. > Sounds reasonable. -- With Regards, Amit Kapila.
On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process > >> latch event so that it can keep waiting for the worker to launch. > >> It neglects to set the latch again, allowing ApplyLauncherMain > >> to miss events. > > > There was a previous discussion to fix this behavior. Heikki has > > proposed a similar fix for this, but at the caller. See the patch > > attached in email [1]. > > Ah, thank you for that link. I vaguely recalled that we'd discussed > this strange behavior before, but I did not realize that anyone had > diagnosed a cause. I don't much like any of the patches proposed > in that thread though --- they seem overcomplicated or off-point. > > I do not think we can switch to having two latches here. The > only reason WaitForReplicationWorkerAttach needs to pay attention > to the process latch at all is that it needs to service > CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is > also true in the ApplyLauncherMain loop. We can't realistically > expect other processes to signal different latches depending on > where the launcher is waiting, so those cases have to be triggered > by the same latch. > > However, WaitForReplicationWorkerAttach can't service any > latch-setting conditions other than those managed by > CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, > we have some other triggering condition, which is the business > of some outer code level to deal with. Having it re-set the latch > to allow that to happen promptly after it returns seems like a > pretty straightforward answer to me. Yeah that makes sense, we can not simply wait on another latch which is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch looks simple for resolving this issue. -- Regards, Dilip Kumar Google