Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers

From vignesh C
Subject Re: Improving the latch handling between logical replication launcher and worker processes.
Date
Msg-id CALDaNm0dw6uP2LW1YhCsF+khF59AVxmkcptM762GEVS5YZWbHg@mail.gmail.com
Whole thread Raw
In response to Re: Improving the latch handling between logical replication launcher and worker processes.  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > >
> > > > c) Don't reset the latch at worker attach and allow launcher main to
> > > > identify and handle it. For this there is a patch v6-0002 available at
> > > > [2].
> > >
> > > This option c seems the easiest. Can you explain what are the
> > > drawbacks of using this approach?
> >
> > This solution will resolve the issue. However, one drawback to
> > consider is that because we're not resetting the latch, in this
> > scenario, the launcher process will need to perform an additional
> > round of acquiring subscription details and determining whether the
> > worker should start, regardless of any changes in subscriptions.
> >
>
> Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
> was not expecting to get notified.
>
> e.g.1. The WaitList comment in the function says so:
> /*
>  * We need timeout because we generally don't get notified via latch
>  * about the worker attach.  But we don't expect to have to wait long.
>  */
>
> e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
> WaitForReplicationWorkerAttach  was waiting for) is not doing any
> SetLatch. So that matches what the comment said.
>
> ~~~
>
> AFAICT the original problem reported by this thread happened because
> the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
> by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
> wasn't expecting to be notified at all.
>
> ~~~
>
> Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:
>
> You said above that one drawback is "the launcher process will need to
> perform an additional round of acquiring subscription details and
> determining whether the worker should start, regardless of any changes
> in subscriptions"
>
> I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
> during the attaching of other workers then the latch would (now after
> option c) remain set and so the WaitLatch of ApplyLauncherMain would
> be notified and/or return immediately end causing an immediate
> re-iteration of the "foreach(lc, sublist)" loop.
>
> But I don't understand why that is a problem.
>
> a) I didn't know what you meant "regardless of any changes in
> subscriptions" because I think the troublesome SetLatch originated
> from the CREATE SUBSCRIPTION and so there *is* a change to
> subscriptions.

The process of setting the latch unfolds as follows: Upon creating a
new subscription, the launcher process initiates a request to the
postmaster, prompting it to initiate a new apply worker process.
Subsequently, the postmaster commences the apply worker process and
dispatches a SIGUSR1 signal to the launcher process(this is done from
do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this
signal, the launcher process sets the latch.
Now, there are two potential scenarios:
a) Concurrent Creation of Another Subscription: In this situation, the
launcher traverses the subscription list to detect the creation of a
new subscription and proceeds to initiate a new apply worker for the
concurrently created subscription. This is ok. b) Absence of
Concurrent Subscription Creation: In this case, since the latch
remains unset, the launcher iterates through the subscription list and
identifies the absence of new subscriptions. This verification occurs
as the latch remains unset. Here there is an additional check.
I'm talking about the second scenario where no subscription is
concurrently created. In this case, as the latch remains unset, we
perform an additional check on the subscription list. There is no
problem with this.
This additional check can occur in the existing code too if the
function WaitForReplicationWorkerAttach returns from the initial if
check i.e. if the worker already started when this check happens.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Next
From: Will Mortensen
Date:
Subject: Re: Exposing the lock manager's WaitForLockers() to SQL