Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Improving the latch handling between logical replication launcher and worker processes.
Date
Msg-id CAHut+PuYxOqfiw3v5GTMK10Hs0T0kTdMw6CmkFbCb5C52HEt2w@mail.gmail.com
Whole thread Raw
In response to Re: Improving the latch handling between logical replication launcher and worker processes.  (vignesh C <vignesh21@gmail.com>)
Responses Re: Improving the latch handling between logical replication launcher and worker processes.
List pgsql-hackers
On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> >
> > > c) Don't reset the latch at worker attach and allow launcher main to
> > > identify and handle it. For this there is a patch v6-0002 available at
> > > [2].
> >
> > This option c seems the easiest. Can you explain what are the
> > drawbacks of using this approach?
>
> This solution will resolve the issue. However, one drawback to
> consider is that because we're not resetting the latch, in this
> scenario, the launcher process will need to perform an additional
> round of acquiring subscription details and determining whether the
> worker should start, regardless of any changes in subscriptions.
>

Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
was not expecting to get notified.

e.g.1. The WaitList comment in the function says so:
/*
 * We need timeout because we generally don't get notified via latch
 * about the worker attach.  But we don't expect to have to wait long.
 */

e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
WaitForReplicationWorkerAttach  was waiting for) is not doing any
SetLatch. So that matches what the comment said.

~~~

AFAICT the original problem reported by this thread happened because
the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
wasn't expecting to be notified at all.

~~~

Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:

You said above that one drawback is "the launcher process will need to
perform an additional round of acquiring subscription details and
determining whether the worker should start, regardless of any changes
in subscriptions"

I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
during the attaching of other workers then the latch would (now after
option c) remain set and so the WaitLatch of ApplyLauncherMain would
be notified and/or return immediately end causing an immediate
re-iteration of the "foreach(lc, sublist)" loop.

But I don't understand why that is a problem.

a) I didn't know what you meant "regardless of any changes in
subscriptions" because I think the troublesome SetLatch originated
from the CREATE SUBSCRIPTION and so there *is* a change to
subscriptions.

b) We will need to repeat that sublist loop anyway to start the worker
for the new CREATE SUBSCRIPTION, and we need to do that at the
earliest opportunity because the whole point of the SetLatch is so the
CREATE SUBSCRIPTION worker can get started promptly. So the earlier we
do that the better, right?

c) AFAICT there is no danger of accidentally tying to starting workers
who are still in the middle of trying to start (from the previous
iteration) because those cases should be guarded by the
ApplyLauncherGetWorkerStartTime logic.

~~

To summarise, I felt removing the ResetLatch and the WL_LATCH_SET
bitset (like your v6-0002 patch does) is not only an easy way of
fixing the problem reported by this thread, but also it now makes that
WaitForReplicationWorkerAttach code behave like the comment ("we
generally don't get notified via latch about the worker attach") is
saying.

(Unless there are some other problems with it that I can't see)

======
Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: "Long Song"
Date:
Subject: Re:Avoid an odd undefined behavior with memcmp (src/bin/pg_rewind/pg_rewind.c)
Next
From: Chris Cleveland
Date:
Subject: Implementing CustomScan over an index