Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Improving the latch handling between logical replication launcher and worker processes. |
Date | |
Msg-id | CALDaNm1cOeam9SR6vQ+tr3vnFqd9Q5poND-JfFB5kL-NNdds+Q@mail.gmail.com Whole thread Raw |
In response to | Re: Improving the latch handling between logical replication launcher and worker processes. (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
On Mon, 8 Jul 2024 at 17:46, vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 05/07/2024 14:07, vignesh C wrote: > > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> I'm don't quite understand the problem we're trying to fix: > > >> > > >>> Currently the launcher's latch is used for the following: a) worker > > >>> process attach b) worker process exit and c) subscription creation. > > >>> Since this same latch is used for multiple cases, the launcher process > > >>> is not able to handle concurrent scenarios like: a) Launcher started a > > >>> new apply worker and waiting for apply worker to attach and b) create > > >>> subscription sub2 sending launcher wake up signal. In this scenario, > > >>> both of them will set latch of the launcher process, the launcher > > >>> process is not able to identify that both operations have occurred 1) > > >>> worker is attached 2) subscription is created and apply worker should > > >>> be started. As a result the apply worker does not get started for the > > >>> new subscription created immediately and gets started after the > > >>> timeout of 180 seconds. > > >> > > >> I don't see how we could miss a notification. Yes, both events will set > > >> the same latch. Why is that a problem? > > > > > > The launcher process waits for the apply worker to attach at > > > WaitForReplicationWorkerAttach function. The launcher's latch is > > > getting set concurrently for another create subscription and apply > > > worker attached. The launcher now detects the latch is set while > > > waiting at WaitForReplicationWorkerAttach, it resets the latch and > > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > > > the latch has already been reset). Further details are provided below. > > > > > > The loop will see that the new > > >> worker has attached, and that the new subscription has been created, and > > >> process both events. Right? > > > > > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > > > processing the create subscription event. > > > > > > Slightly detailing further: > > > In the scenario when we execute two concurrent create subscription > > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > > > CREATE SUBSCRIPTION sub2. > > > In a few random scenarios, the following events may unfold: > > > After the first create subscription command(sub1), the backend will > > > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > > > Subsequently, the launcher process will reset the latch and identify > > > the addition of a new subscription in the pg_subscription list. The > > > launcher process will proceed to request the postmaster to start the > > > apply worker background process (sub1) and request the postmaster to > > > notify the launcher once the apply worker(sub1) has been started. > > > Launcher will then wait for the apply worker(sub1) to attach in the > > > WaitForReplicationWorkerAttach function. > > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > > > executed concurrently, also set the launcher's latch(because of > > > ApplyLauncherWakeupAtCommit). > > > Simultaneously when the launcher remains in the > > > WaitForReplicationWorkerAttach function waiting for apply worker of > > > sub1 to start, the postmaster will start the apply worker for > > > subscription sub1 and send a SIGUSR1 signal to the launcher process > > > via ReportBackgroundWorkerPID. Upon receiving this signal, the > > > launcher process will then set its latch. > > > > > > Now, the launcher's latch has been set concurrently after the apply > > > worker for sub1 is started and the execution of the CREATE > > > SUBSCRIPTION sub2 command. > > > > > > At this juncture, the launcher, which had been awaiting the attachment > > > of the apply worker, detects that the latch is set and proceeds to > > > reset it. The reset operation of the latch occurs within the following > > > section of code in WaitForReplicationWorkerAttach: > > > ... > > > rc = WaitLatch(MyLatch, > > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > > > if (rc & WL_LATCH_SET) > > > { > > > ResetLatch(MyLatch); > > > CHECK_FOR_INTERRUPTS(); > > > } > > > ... > > > > > > After resetting the latch here, the activation signal intended to > > > start the apply worker for subscription sub2 is no longer present. The > > > launcher will return to the ApplyLauncherMain function, where it will > > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > > > processing the create subscription request (i.e., creating a new apply > > > worker for sub2). > > > The issue arises from the latch being reset in > > > WaitForReplicationWorkerAttach, which can occasionally delay the > > > synchronization of table data for the subscription. > > > > Ok, I see it now. Thanks for the explanation. So the problem isn't using > > the same latch for different purposes per se. It's that we're trying to > > use it in a nested fashion, resetting it in the inner loop. > > > > Looking at the proposed patch more closely: > > > > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, > > > * We need timeout because we generally don't get notified via latch > > > * about the worker attach. But we don't expect to have to wait long. > > > */ > > > - rc = WaitLatch(MyLatch, > > > + rc = WaitLatch(&worker->launcherWakeupLatch, > > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > > > if (rc & WL_LATCH_SET) > > > { > > > - ResetLatch(MyLatch); > > > + ResetLatch(&worker->launcherWakeupLatch); > > > CHECK_FOR_INTERRUPTS(); > > > } > > > } > > > > The comment here is now outdated, right? The patch adds an explicit > > SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher > > about the attachment. > > I will update the comment for this. > > > Is the launcherWakeupLatch field protected by some lock, to protect > > which process owns it? OwnLatch() is called in > > logicalrep_worker_stop_internal() without holding a lock for example. Is > > there a guarantee that only one process can do that at the same time? > > I have analyzed a few scenarios, I will analyze the remaining > scenarios and update on this. > > > What happens if a process never calls DisownLatch(), e.g. because it > > bailed out with an error while waiting for the worker to attach or stop? > > I tried a lot of scenarios by erroring out after ownlatch call and did > not hit the panic code of OwnLatch yet. I will try a few more > scenarios that I have in mind and see if we can hit the PANIC in > OwnLatch and update on this. I could encounter the PANIC from OwnLatch by having a lot of create subscriptions doing a table sync with a limited number of slots. This and the above scenario are slightly related. I need some more time to prepare the fix. Regards, Vignesh
pgsql-hackers by date: