Re: Improving the latch handling between logical replication launcher and worker processes. - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Improving the latch handling between logical replication launcher and worker processes. |
Date | |
Msg-id | CAHut+PuYxOqfiw3v5GTMK10Hs0T0kTdMw6CmkFbCb5C52HEt2w@mail.gmail.com Whole thread Raw |
In response to | Re: Improving the latch handling between logical replication launcher and worker processes. (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Improving the latch handling between logical replication launcher and worker processes.
|
List | pgsql-hackers |
On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > c) Don't reset the latch at worker attach and allow launcher main to > > > identify and handle it. For this there is a patch v6-0002 available at > > > [2]. > > > > This option c seems the easiest. Can you explain what are the > > drawbacks of using this approach? > > This solution will resolve the issue. However, one drawback to > consider is that because we're not resetting the latch, in this > scenario, the launcher process will need to perform an additional > round of acquiring subscription details and determining whether the > worker should start, regardless of any changes in subscriptions. > Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach was not expecting to get notified. e.g.1. The WaitList comment in the function says so: /* * We need timeout because we generally don't get notified via latch * about the worker attach. But we don't expect to have to wait long. */ e.g.2 The logicalrep_worker_attach() function (which is AFAIK what WaitForReplicationWorkerAttach was waiting for) is not doing any SetLatch. So that matches what the comment said. ~~~ AFAICT the original problem reported by this thread happened because the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW wasn't expecting to be notified at all. ~~~ Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach: You said above that one drawback is "the launcher process will need to perform an additional round of acquiring subscription details and determining whether the worker should start, regardless of any changes in subscriptions" I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens during the attaching of other workers then the latch would (now after option c) remain set and so the WaitLatch of ApplyLauncherMain would be notified and/or return immediately end causing an immediate re-iteration of the "foreach(lc, sublist)" loop. But I don't understand why that is a problem. a) I didn't know what you meant "regardless of any changes in subscriptions" because I think the troublesome SetLatch originated from the CREATE SUBSCRIPTION and so there *is* a change to subscriptions. b) We will need to repeat that sublist loop anyway to start the worker for the new CREATE SUBSCRIPTION, and we need to do that at the earliest opportunity because the whole point of the SetLatch is so the CREATE SUBSCRIPTION worker can get started promptly. So the earlier we do that the better, right? c) AFAICT there is no danger of accidentally tying to starting workers who are still in the middle of trying to start (from the previous iteration) because those cases should be guarded by the ApplyLauncherGetWorkerStartTime logic. ~~ To summarise, I felt removing the ResetLatch and the WL_LATCH_SET bitset (like your v6-0002 patch does) is not only an easy way of fixing the problem reported by this thread, but also it now makes that WaitForReplicationWorkerAttach code behave like the comment ("we generally don't get notified via latch about the worker attach") is saying. (Unless there are some other problems with it that I can't see) ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: