Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Improving the latch handling between logical replication launcher and worker processes. |
Date | |
Msg-id | CALDaNm0dw6uP2LW1YhCsF+khF59AVxmkcptM762GEVS5YZWbHg@mail.gmail.com Whole thread Raw |
In response to | Re: Improving the latch handling between logical replication launcher and worker processes. (Peter Smith <smithpb2250@gmail.com>) |
List | pgsql-hackers |
On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > c) Don't reset the latch at worker attach and allow launcher main to > > > > identify and handle it. For this there is a patch v6-0002 available at > > > > [2]. > > > > > > This option c seems the easiest. Can you explain what are the > > > drawbacks of using this approach? > > > > This solution will resolve the issue. However, one drawback to > > consider is that because we're not resetting the latch, in this > > scenario, the launcher process will need to perform an additional > > round of acquiring subscription details and determining whether the > > worker should start, regardless of any changes in subscriptions. > > > > Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach > was not expecting to get notified. > > e.g.1. The WaitList comment in the function says so: > /* > * We need timeout because we generally don't get notified via latch > * about the worker attach. But we don't expect to have to wait long. > */ > > e.g.2 The logicalrep_worker_attach() function (which is AFAIK what > WaitForReplicationWorkerAttach was waiting for) is not doing any > SetLatch. So that matches what the comment said. > > ~~~ > > AFAICT the original problem reported by this thread happened because > the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled > by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW > wasn't expecting to be notified at all. > > ~~~ > > Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach: > > You said above that one drawback is "the launcher process will need to > perform an additional round of acquiring subscription details and > determining whether the worker should start, regardless of any changes > in subscriptions" > > I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens > during the attaching of other workers then the latch would (now after > option c) remain set and so the WaitLatch of ApplyLauncherMain would > be notified and/or return immediately end causing an immediate > re-iteration of the "foreach(lc, sublist)" loop. > > But I don't understand why that is a problem. > > a) I didn't know what you meant "regardless of any changes in > subscriptions" because I think the troublesome SetLatch originated > from the CREATE SUBSCRIPTION and so there *is* a change to > subscriptions. The process of setting the latch unfolds as follows: Upon creating a new subscription, the launcher process initiates a request to the postmaster, prompting it to initiate a new apply worker process. Subsequently, the postmaster commences the apply worker process and dispatches a SIGUSR1 signal to the launcher process(this is done from do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this signal, the launcher process sets the latch. Now, there are two potential scenarios: a) Concurrent Creation of Another Subscription: In this situation, the launcher traverses the subscription list to detect the creation of a new subscription and proceeds to initiate a new apply worker for the concurrently created subscription. This is ok. b) Absence of Concurrent Subscription Creation: In this case, since the latch remains unset, the launcher iterates through the subscription list and identifies the absence of new subscriptions. This verification occurs as the latch remains unset. Here there is an additional check. I'm talking about the second scenario where no subscription is concurrently created. In this case, as the latch remains unset, we perform an additional check on the subscription list. There is no problem with this. This additional check can occur in the existing code too if the function WaitForReplicationWorkerAttach returns from the initial if check i.e. if the worker already started when this check happens. Regards, Vignesh
pgsql-hackers by date: