Thread: Improving the latch handling between logical replication launcher and worker processes.
Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
Hi, Currently the launcher's latch is used for the following: a) worker process attach b) worker process exit and c) subscription creation. Since this same latch is used for multiple cases, the launcher process is not able to handle concurrent scenarios like: a) Launcher started a new apply worker and waiting for apply worker to attach and b) create subscription sub2 sending launcher wake up signal. In this scenario, both of them will set latch of the launcher process, the launcher process is not able to identify that both operations have occurred 1) worker is attached 2) subscription is created and apply worker should be started. As a result the apply worker does not get started for the new subscription created immediately and gets started after the timeout of 180 seconds. I have started a new thread for this based on suggestions at [1]. We could improvise this by one of the following: a) Introduce a new latch to handle worker attach and exit. b) Add a new GUC launcher_retry_time which gives more flexibility to users as suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays a similar role as the suggested new GUC launcher_retry_time, e.g. even if a worker is launched, the launcher only wait wal_retrieve_retry_interval time before next round. c) Don't reset the latch at worker attach and allow launcher main to identify and handle it. For this there is a patch v6-0002 available at [2]. I'm not sure which approach is better in this case. I was thinking if we should add a new latch to handle worker attach and exit. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y%2BLCuxcSMdKUkKZK740L2w%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com Regards, Vignesh
RE: Improving the latch handling between logical replication launcher and worker processes.
From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, April 25, 2024 4:59 PM vignesh C <vignesh21@gmail.com> wrote: > > Hi, > > Currently the launcher's latch is used for the following: a) worker process attach > b) worker process exit and c) subscription creation. > Since this same latch is used for multiple cases, the launcher process is not able > to handle concurrent scenarios like: a) Launcher started a new apply worker and > waiting for apply worker to attach and b) create subscription sub2 sending > launcher wake up signal. In this scenario, both of them will set latch of the > launcher process, the launcher process is not able to identify that both > operations have occurred 1) worker is attached 2) subscription is created and > apply worker should be started. As a result the apply worker does not get > started for the new subscription created immediately and gets started after the > timeout of 180 seconds. > > I have started a new thread for this based on suggestions at [1]. > > a) Introduce a new latch to handle worker attach and exit. I found the startup process also uses two latches(see recoveryWakeupLatch) for different purposes, so maybe this is OK. But note that both logical launcher and apply worker will call logicalrep_worker_launch(), if we only add one new latch, it means both workers will wait on the same new Latch, and the launcher may consume the SetLatch that should have been consumed by the apply worker(although it's rare). > b) Add a new GUC launcher_retry_time which gives more flexibility to users as > suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays > a similar role as the suggested new GUC launcher_retry_time, e.g. even if a > worker is launched, the launcher only wait wal_retrieve_retry_interval time > before next round. IIUC, the issue does not happen frequently, and may not be noticeable where apply workers wouldn't be restarted often. So, I am slightly not sure if it's worth adding a new GUC. > c) Don't reset the latch at worker attach and allow launcher main to identify and > handle it. For this there is a patch v6-0002 available at [2]. This seems simple. I found we are doing something similar in RegisterSyncRequest() and WalSummarizerMain(). Best Regards, Hou zj
RE: Improving the latch handling between logical replication launcher and worker processes.
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Vignesh, Thanks for raising idea! > a) Introduce a new latch to handle worker attach and exit. Just to confirm - there are three wait events for launchers, so I feel we may be able to create latches per wait event. Is there a reason to introduce "a" latch? > b) Add a new GUC launcher_retry_time which gives more flexibility to > users as suggested by Amit at [1]. Before 5a3a953, the > wal_retrieve_retry_interval plays a similar role as the suggested new > GUC launcher_retry_time, e.g. even if a worker is launched, the > launcher only wait wal_retrieve_retry_interval time before next round. Hmm. My concern is how users estimate the value. Maybe the default will be 3min, but should users change it? If so, how long? I think even if it becomes tunable, they cannot control well. > c) Don't reset the latch at worker attach and allow launcher main to > identify and handle it. For this there is a patch v6-0002 available at > [2]. Does it mean that you want to remove ResetLatch() from WaitForReplicationWorkerAttach(), right? If so, what about the scenario? 1) The launcher waiting the worker is attached in WaitForReplicationWorkerAttach(), and 2) subscription is created before attaching. In this case, the launcher will become un-sleepable because the latch is set but won't be reset. It may waste the CPU time. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Peter Smith
Date:
On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > ... > a) Introduce a new latch to handle worker attach and exit. IIUC there is normally only the “shared” latch or a “process local” latch. i.e. AFAICT is is quite uncommon to invent a new latches like option a is proposing. I did not see any examples of making new latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So this option could be OK, but OTOH since there is hardly any precedent maybe that should be taken as an indication to avoid doing it this way (???) > b) Add a new GUC launcher_retry_time which gives more flexibility to > users as suggested by Amit at [1]. I'm not sure that introducing a new GUC is a good option because this seems a rare problem to have -- so it will be hard to tune since it will be difficult to know you even have this problem and then difficult to know that it is fixed. Anyway. this made me consider more what the WaitLatch timeout value should be. Here are some thoughts: Idea 1) I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180 seconds come from or was that just a made-up number? AFAICT it just came into existence in the first pub/sub commit [1] but there is no explanation for why 180s was chosen. Anyway, I assume a low value (5s?) would be bad because it incurs unacceptable CPU usage, right? But if 180s is too long and 5s is too short then what does a “good” number even look like? E.g.,. if 60s is deemed OK, then is there any harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving it at that? Idea 2) Another idea could be to use a “dynamic timeout” in the WaitLatch of ApplyLauncherMain. Let me try to explain my thought bubble: - If the preceding foreach(lc, sublist) loop launched any workers then the WaitLatch timeout can be a much shorter 10s - If the preceding foreach(lc, sublist) loop did NOT launch any workers (this would be the most common case) then the WaitLatch timeout can be the usual 180s. IIUC this strategy will still give any recently launched workers enough time to attach shmem but it also means that any concurrent CREATE SUBSCRIPTION will be addressed within 10s instead of 180s. Maybe this is sufficient to make an already rare problem become insignificant. > c) Don't reset the latch at worker attach and allow launcher main to > identify and handle it. For this there is a patch v6-0002 available at > [2]. This option c seems the easiest. Can you explain what are the drawbacks of using this approach? ====== [1] github - https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf Kind Regards, Peter Smith. Fujitsu Australia
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Fri, 10 May 2024 at 07:39, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Vignesh, > > Thanks for raising idea! > > > a) Introduce a new latch to handle worker attach and exit. > > Just to confirm - there are three wait events for launchers, so I feel we may be > able to create latches per wait event. Is there a reason to introduce > "a" latch? One latch is enough, we can use the new latch for both worker starting and worker exiting. The other existing latch can be used for other purposes. Something like the attached patch. Regards, Vignesh
Attachment
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > c) Don't reset the latch at worker attach and allow launcher main to > > identify and handle it. For this there is a patch v6-0002 available at > > [2]. > > This option c seems the easiest. Can you explain what are the > drawbacks of using this approach? This solution will resolve the issue. However, one drawback to consider is that because we're not resetting the latch, in this scenario, the launcher process will need to perform an additional round of acquiring subscription details and determining whether the worker should start, regardless of any changes in subscriptions. Regards, Vignesh
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Peter Smith
Date:
On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > c) Don't reset the latch at worker attach and allow launcher main to > > > identify and handle it. For this there is a patch v6-0002 available at > > > [2]. > > > > This option c seems the easiest. Can you explain what are the > > drawbacks of using this approach? > > This solution will resolve the issue. However, one drawback to > consider is that because we're not resetting the latch, in this > scenario, the launcher process will need to perform an additional > round of acquiring subscription details and determining whether the > worker should start, regardless of any changes in subscriptions. > Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach was not expecting to get notified. e.g.1. The WaitList comment in the function says so: /* * We need timeout because we generally don't get notified via latch * about the worker attach. But we don't expect to have to wait long. */ e.g.2 The logicalrep_worker_attach() function (which is AFAIK what WaitForReplicationWorkerAttach was waiting for) is not doing any SetLatch. So that matches what the comment said. ~~~ AFAICT the original problem reported by this thread happened because the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW wasn't expecting to be notified at all. ~~~ Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach: You said above that one drawback is "the launcher process will need to perform an additional round of acquiring subscription details and determining whether the worker should start, regardless of any changes in subscriptions" I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens during the attaching of other workers then the latch would (now after option c) remain set and so the WaitLatch of ApplyLauncherMain would be notified and/or return immediately end causing an immediate re-iteration of the "foreach(lc, sublist)" loop. But I don't understand why that is a problem. a) I didn't know what you meant "regardless of any changes in subscriptions" because I think the troublesome SetLatch originated from the CREATE SUBSCRIPTION and so there *is* a change to subscriptions. b) We will need to repeat that sublist loop anyway to start the worker for the new CREATE SUBSCRIPTION, and we need to do that at the earliest opportunity because the whole point of the SetLatch is so the CREATE SUBSCRIPTION worker can get started promptly. So the earlier we do that the better, right? c) AFAICT there is no danger of accidentally tying to starting workers who are still in the middle of trying to start (from the previous iteration) because those cases should be guarded by the ApplyLauncherGetWorkerStartTime logic. ~~ To summarise, I felt removing the ResetLatch and the WL_LATCH_SET bitset (like your v6-0002 patch does) is not only an easy way of fixing the problem reported by this thread, but also it now makes that WaitForReplicationWorkerAttach code behave like the comment ("we generally don't get notified via latch about the worker attach") is saying. (Unless there are some other problems with it that I can't see) ====== Kind Regards, Peter Smith. Fujitsu Australia
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > > c) Don't reset the latch at worker attach and allow launcher main to > > > > identify and handle it. For this there is a patch v6-0002 available at > > > > [2]. > > > > > > This option c seems the easiest. Can you explain what are the > > > drawbacks of using this approach? > > > > This solution will resolve the issue. However, one drawback to > > consider is that because we're not resetting the latch, in this > > scenario, the launcher process will need to perform an additional > > round of acquiring subscription details and determining whether the > > worker should start, regardless of any changes in subscriptions. > > > > Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach > was not expecting to get notified. > > e.g.1. The WaitList comment in the function says so: > /* > * We need timeout because we generally don't get notified via latch > * about the worker attach. But we don't expect to have to wait long. > */ > > e.g.2 The logicalrep_worker_attach() function (which is AFAIK what > WaitForReplicationWorkerAttach was waiting for) is not doing any > SetLatch. So that matches what the comment said. > > ~~~ > > AFAICT the original problem reported by this thread happened because > the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled > by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW > wasn't expecting to be notified at all. > > ~~~ > > Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach: > > You said above that one drawback is "the launcher process will need to > perform an additional round of acquiring subscription details and > determining whether the worker should start, regardless of any changes > in subscriptions" > > I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens > during the attaching of other workers then the latch would (now after > option c) remain set and so the WaitLatch of ApplyLauncherMain would > be notified and/or return immediately end causing an immediate > re-iteration of the "foreach(lc, sublist)" loop. > > But I don't understand why that is a problem. > > a) I didn't know what you meant "regardless of any changes in > subscriptions" because I think the troublesome SetLatch originated > from the CREATE SUBSCRIPTION and so there *is* a change to > subscriptions. The process of setting the latch unfolds as follows: Upon creating a new subscription, the launcher process initiates a request to the postmaster, prompting it to initiate a new apply worker process. Subsequently, the postmaster commences the apply worker process and dispatches a SIGUSR1 signal to the launcher process(this is done from do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this signal, the launcher process sets the latch. Now, there are two potential scenarios: a) Concurrent Creation of Another Subscription: In this situation, the launcher traverses the subscription list to detect the creation of a new subscription and proceeds to initiate a new apply worker for the concurrently created subscription. This is ok. b) Absence of Concurrent Subscription Creation: In this case, since the latch remains unset, the launcher iterates through the subscription list and identifies the absence of new subscriptions. This verification occurs as the latch remains unset. Here there is an additional check. I'm talking about the second scenario where no subscription is concurrently created. In this case, as the latch remains unset, we perform an additional check on the subscription list. There is no problem with this. This additional check can occur in the existing code too if the function WaitForReplicationWorkerAttach returns from the initial if check i.e. if the worker already started when this check happens. Regards, Vignesh
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Heikki Linnakangas
Date:
I'm don't quite understand the problem we're trying to fix: > Currently the launcher's latch is used for the following: a) worker > process attach b) worker process exit and c) subscription creation. > Since this same latch is used for multiple cases, the launcher process > is not able to handle concurrent scenarios like: a) Launcher started a > new apply worker and waiting for apply worker to attach and b) create > subscription sub2 sending launcher wake up signal. In this scenario, > both of them will set latch of the launcher process, the launcher > process is not able to identify that both operations have occurred 1) > worker is attached 2) subscription is created and apply worker should > be started. As a result the apply worker does not get started for the > new subscription created immediately and gets started after the > timeout of 180 seconds. I don't see how we could miss a notification. Yes, both events will set the same latch. Why is that a problem? The loop will see that the new worker has attached, and that the new subscription has been created, and process both events. Right? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I'm don't quite understand the problem we're trying to fix: > > > Currently the launcher's latch is used for the following: a) worker > > process attach b) worker process exit and c) subscription creation. > > Since this same latch is used for multiple cases, the launcher process > > is not able to handle concurrent scenarios like: a) Launcher started a > > new apply worker and waiting for apply worker to attach and b) create > > subscription sub2 sending launcher wake up signal. In this scenario, > > both of them will set latch of the launcher process, the launcher > > process is not able to identify that both operations have occurred 1) > > worker is attached 2) subscription is created and apply worker should > > be started. As a result the apply worker does not get started for the > > new subscription created immediately and gets started after the > > timeout of 180 seconds. > > I don't see how we could miss a notification. Yes, both events will set > the same latch. Why is that a problem? The launcher process waits for the apply worker to attach at WaitForReplicationWorkerAttach function. The launcher's latch is getting set concurrently for another create subscription and apply worker attached. The launcher now detects the latch is set while waiting at WaitForReplicationWorkerAttach, it resets the latch and proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as the latch has already been reset). Further details are provided below. The loop will see that the new > worker has attached, and that the new subscription has been created, and > process both events. Right? Since the latch is reset at WaitForReplicationWorkerAttach, it skips processing the create subscription event. Slightly detailing further: In the scenario when we execute two concurrent create subscription commands, first CREATE SUBSCRIPTION sub1, followed immediately by CREATE SUBSCRIPTION sub2. In a few random scenarios, the following events may unfold: After the first create subscription command(sub1), the backend will set the launcher's latch because of ApplyLauncherWakeupAtCommit. Subsequently, the launcher process will reset the latch and identify the addition of a new subscription in the pg_subscription list. The launcher process will proceed to request the postmaster to start the apply worker background process (sub1) and request the postmaster to notify the launcher once the apply worker(sub1) has been started. Launcher will then wait for the apply worker(sub1) to attach in the WaitForReplicationWorkerAttach function. Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was executed concurrently, also set the launcher's latch(because of ApplyLauncherWakeupAtCommit). Simultaneously when the launcher remains in the WaitForReplicationWorkerAttach function waiting for apply worker of sub1 to start, the postmaster will start the apply worker for subscription sub1 and send a SIGUSR1 signal to the launcher process via ReportBackgroundWorkerPID. Upon receiving this signal, the launcher process will then set its latch. Now, the launcher's latch has been set concurrently after the apply worker for sub1 is started and the execution of the CREATE SUBSCRIPTION sub2 command. At this juncture, the launcher, which had been awaiting the attachment of the apply worker, detects that the latch is set and proceeds to reset it. The reset operation of the latch occurs within the following section of code in WaitForReplicationWorkerAttach: ... rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_BGWORKER_STARTUP); if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } ... After resetting the latch here, the activation signal intended to start the apply worker for subscription sub2 is no longer present. The launcher will return to the ApplyLauncherMain function, where it will await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before processing the create subscription request (i.e., creating a new apply worker for sub2). The issue arises from the latch being reset in WaitForReplicationWorkerAttach, which can occasionally delay the synchronization of table data for the subscription. Another concurrency scenario, as mentioned by Alexander Lakhin at [3], involves "apply worker exiting with failure and concurrent create subscription." The underlying cause for both issues is from the same source: the reset of the latch in WaitForReplicationWorkerAttach. Even though there is no failure, such issues contribute to a delay of 180 seconds in the execution of buildfarm tests and local tests. This was noticed in buildfarm where 026_stats.pl took more than 180 seconds at [2]: ... [21:11:21] t/025_rep_changes_for_schema.pl .... ok 3340 ms ( 0.00 usr 0.00 sys + 0.43 cusr 0.23 csys = 0.66 CPU) [21:11:25] t/026_stats.pl ..................... ok 4499 ms ( 0.00 usr 0.00 sys + 0.42 cusr 0.21 csys = 0.63 CPU) [21:14:31] t/027_nosuperuser.pl ............... ok 185457 ms ( 0.01 usr 0.00 sys + 5.91 cusr 2.68 csys = 8.60 CPU) [21:14:35] t/028_row_filter.pl ................ ok 3998 ms ( 0.01 usr 0.00 sys + 0.86 cusr 0.34 csys = 1.21 CPU) ... And also by Robert at [3]. [1] - https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com [2] - https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check [3] - https://www.postgresql.org/message-id/CA%2BTgmoZkj%3DA39i4obKXADMhzJW%3D6dyGq-C1aGfb%2BjUy9XvxwYA%40mail.gmail.com Regards, Vignesh
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Heikki Linnakangas
Date:
On 05/07/2024 14:07, vignesh C wrote: > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> I'm don't quite understand the problem we're trying to fix: >> >>> Currently the launcher's latch is used for the following: a) worker >>> process attach b) worker process exit and c) subscription creation. >>> Since this same latch is used for multiple cases, the launcher process >>> is not able to handle concurrent scenarios like: a) Launcher started a >>> new apply worker and waiting for apply worker to attach and b) create >>> subscription sub2 sending launcher wake up signal. In this scenario, >>> both of them will set latch of the launcher process, the launcher >>> process is not able to identify that both operations have occurred 1) >>> worker is attached 2) subscription is created and apply worker should >>> be started. As a result the apply worker does not get started for the >>> new subscription created immediately and gets started after the >>> timeout of 180 seconds. >> >> I don't see how we could miss a notification. Yes, both events will set >> the same latch. Why is that a problem? > > The launcher process waits for the apply worker to attach at > WaitForReplicationWorkerAttach function. The launcher's latch is > getting set concurrently for another create subscription and apply > worker attached. The launcher now detects the latch is set while > waiting at WaitForReplicationWorkerAttach, it resets the latch and > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > the latch has already been reset). Further details are provided below. > > The loop will see that the new >> worker has attached, and that the new subscription has been created, and >> process both events. Right? > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > processing the create subscription event. > > Slightly detailing further: > In the scenario when we execute two concurrent create subscription > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > CREATE SUBSCRIPTION sub2. > In a few random scenarios, the following events may unfold: > After the first create subscription command(sub1), the backend will > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > Subsequently, the launcher process will reset the latch and identify > the addition of a new subscription in the pg_subscription list. The > launcher process will proceed to request the postmaster to start the > apply worker background process (sub1) and request the postmaster to > notify the launcher once the apply worker(sub1) has been started. > Launcher will then wait for the apply worker(sub1) to attach in the > WaitForReplicationWorkerAttach function. > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > executed concurrently, also set the launcher's latch(because of > ApplyLauncherWakeupAtCommit). > Simultaneously when the launcher remains in the > WaitForReplicationWorkerAttach function waiting for apply worker of > sub1 to start, the postmaster will start the apply worker for > subscription sub1 and send a SIGUSR1 signal to the launcher process > via ReportBackgroundWorkerPID. Upon receiving this signal, the > launcher process will then set its latch. > > Now, the launcher's latch has been set concurrently after the apply > worker for sub1 is started and the execution of the CREATE > SUBSCRIPTION sub2 command. > > At this juncture, the launcher, which had been awaiting the attachment > of the apply worker, detects that the latch is set and proceeds to > reset it. The reset operation of the latch occurs within the following > section of code in WaitForReplicationWorkerAttach: > ... > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > ResetLatch(MyLatch); > CHECK_FOR_INTERRUPTS(); > } > ... > > After resetting the latch here, the activation signal intended to > start the apply worker for subscription sub2 is no longer present. The > launcher will return to the ApplyLauncherMain function, where it will > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > processing the create subscription request (i.e., creating a new apply > worker for sub2). > The issue arises from the latch being reset in > WaitForReplicationWorkerAttach, which can occasionally delay the > synchronization of table data for the subscription. Ok, I see it now. Thanks for the explanation. So the problem isn't using the same latch for different purposes per se. It's that we're trying to use it in a nested fashion, resetting it in the inner loop. Looking at the proposed patch more closely: > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, > * We need timeout because we generally don't get notified via latch > * about the worker attach. But we don't expect to have to wait long. > */ > - rc = WaitLatch(MyLatch, > + rc = WaitLatch(&worker->launcherWakeupLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > - ResetLatch(MyLatch); > + ResetLatch(&worker->launcherWakeupLatch); > CHECK_FOR_INTERRUPTS(); > } > } The comment here is now outdated, right? The patch adds an explicit SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher about the attachment. Is the launcherWakeupLatch field protected by some lock, to protect which process owns it? OwnLatch() is called in logicalrep_worker_stop_internal() without holding a lock for example. Is there a guarantee that only one process can do that at the same time? What happens if a process never calls DisownLatch(), e.g. because it bailed out with an error while waiting for the worker to attach or stop? As an alternative, smaller fix, I think we could do the attached. It forces the launcher's main loop to do another iteration, if it calls logicalrep_worker_launch(). That extra iteration should pick up any missed notifications. Your solution with an additional latch seems better because it makes WaitForReplicationWorkerAttach() react more quickly, without the 10 s wait. I'm surprised we have that in the first place, 10 s seems like a pretty long time to wait for a parallel apply worker to start. Why was that ever OK? -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Amit Kapila
Date:
On Fri, Jul 5, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > Your solution with an additional latch seems better because it makes > WaitForReplicationWorkerAttach() react more quickly, without the 10 s > wait. I'm surprised we have that in the first place, 10 s seems like a > pretty long time to wait for a parallel apply worker to start. Why was > that ever OK? > Isn't the call wait for 10 milliseconds? The comment atop WaitLatch("The "timeout" is given in milliseconds...) indicates the timeout is in milliseconds. -- With Regards, Amit Kapila.
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 05/07/2024 14:07, vignesh C wrote: > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> I'm don't quite understand the problem we're trying to fix: > >> > >>> Currently the launcher's latch is used for the following: a) worker > >>> process attach b) worker process exit and c) subscription creation. > >>> Since this same latch is used for multiple cases, the launcher process > >>> is not able to handle concurrent scenarios like: a) Launcher started a > >>> new apply worker and waiting for apply worker to attach and b) create > >>> subscription sub2 sending launcher wake up signal. In this scenario, > >>> both of them will set latch of the launcher process, the launcher > >>> process is not able to identify that both operations have occurred 1) > >>> worker is attached 2) subscription is created and apply worker should > >>> be started. As a result the apply worker does not get started for the > >>> new subscription created immediately and gets started after the > >>> timeout of 180 seconds. > >> > >> I don't see how we could miss a notification. Yes, both events will set > >> the same latch. Why is that a problem? > > > > The launcher process waits for the apply worker to attach at > > WaitForReplicationWorkerAttach function. The launcher's latch is > > getting set concurrently for another create subscription and apply > > worker attached. The launcher now detects the latch is set while > > waiting at WaitForReplicationWorkerAttach, it resets the latch and > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > > the latch has already been reset). Further details are provided below. > > > > The loop will see that the new > >> worker has attached, and that the new subscription has been created, and > >> process both events. Right? > > > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > > processing the create subscription event. > > > > Slightly detailing further: > > In the scenario when we execute two concurrent create subscription > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > > CREATE SUBSCRIPTION sub2. > > In a few random scenarios, the following events may unfold: > > After the first create subscription command(sub1), the backend will > > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > > Subsequently, the launcher process will reset the latch and identify > > the addition of a new subscription in the pg_subscription list. The > > launcher process will proceed to request the postmaster to start the > > apply worker background process (sub1) and request the postmaster to > > notify the launcher once the apply worker(sub1) has been started. > > Launcher will then wait for the apply worker(sub1) to attach in the > > WaitForReplicationWorkerAttach function. > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > > executed concurrently, also set the launcher's latch(because of > > ApplyLauncherWakeupAtCommit). > > Simultaneously when the launcher remains in the > > WaitForReplicationWorkerAttach function waiting for apply worker of > > sub1 to start, the postmaster will start the apply worker for > > subscription sub1 and send a SIGUSR1 signal to the launcher process > > via ReportBackgroundWorkerPID. Upon receiving this signal, the > > launcher process will then set its latch. > > > > Now, the launcher's latch has been set concurrently after the apply > > worker for sub1 is started and the execution of the CREATE > > SUBSCRIPTION sub2 command. > > > > At this juncture, the launcher, which had been awaiting the attachment > > of the apply worker, detects that the latch is set and proceeds to > > reset it. The reset operation of the latch occurs within the following > > section of code in WaitForReplicationWorkerAttach: > > ... > > rc = WaitLatch(MyLatch, > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > if (rc & WL_LATCH_SET) > > { > > ResetLatch(MyLatch); > > CHECK_FOR_INTERRUPTS(); > > } > > ... > > > > After resetting the latch here, the activation signal intended to > > start the apply worker for subscription sub2 is no longer present. The > > launcher will return to the ApplyLauncherMain function, where it will > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > > processing the create subscription request (i.e., creating a new apply > > worker for sub2). > > The issue arises from the latch being reset in > > WaitForReplicationWorkerAttach, which can occasionally delay the > > synchronization of table data for the subscription. > > Ok, I see it now. Thanks for the explanation. So the problem isn't using > the same latch for different purposes per se. It's that we're trying to > use it in a nested fashion, resetting it in the inner loop. > > Looking at the proposed patch more closely: > > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, > > * We need timeout because we generally don't get notified via latch > > * about the worker attach. But we don't expect to have to wait long. > > */ > > - rc = WaitLatch(MyLatch, > > + rc = WaitLatch(&worker->launcherWakeupLatch, > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > if (rc & WL_LATCH_SET) > > { > > - ResetLatch(MyLatch); > > + ResetLatch(&worker->launcherWakeupLatch); > > CHECK_FOR_INTERRUPTS(); > > } > > } > > The comment here is now outdated, right? The patch adds an explicit > SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher > about the attachment. I will update the comment for this. > Is the launcherWakeupLatch field protected by some lock, to protect > which process owns it? OwnLatch() is called in > logicalrep_worker_stop_internal() without holding a lock for example. Is > there a guarantee that only one process can do that at the same time? I have analyzed a few scenarios, I will analyze the remaining scenarios and update on this. > What happens if a process never calls DisownLatch(), e.g. because it > bailed out with an error while waiting for the worker to attach or stop? I tried a lot of scenarios by erroring out after ownlatch call and did not hit the panic code of OwnLatch yet. I will try a few more scenarios that I have in mind and see if we can hit the PANIC in OwnLatch and update on this. > As an alternative, smaller fix, I think we could do the attached. It > forces the launcher's main loop to do another iteration, if it calls > logicalrep_worker_launch(). That extra iteration should pick up any > missed notifications. This also works. I had another solution in similar lines like you proposed at [1], where we don't reset the latch at the inner loop. I'm not sure which solution we should proceed with. Thoughts? > Your solution with an additional latch seems better because it makes > WaitForReplicationWorkerAttach() react more quickly, without the 10 s > wait. I'm surprised we have that in the first place, 10 s seems like a > pretty long time to wait for a parallel apply worker to start. Why was > that ever OK? Here 10 means 10 milliseconds, so it just waits for 10 milliseconds before going to the next iteration. [1] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com Regards, Vignesh
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Amit Kapila
Date:
On Mon, Jul 8, 2024 at 5:47 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > > As an alternative, smaller fix, I think we could do the attached. It > > forces the launcher's main loop to do another iteration, if it calls > > logicalrep_worker_launch(). That extra iteration should pick up any > > missed notifications. > > This also works. > The minor drawback would be that in many cases the extra iteration would not lead to anything meaningful. -- With Regards, Amit Kapila.
Re: Improving the latch handling between logical replication launcher and worker processes.
From
vignesh C
Date:
On Mon, 8 Jul 2024 at 17:46, vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 05/07/2024 14:07, vignesh C wrote: > > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> I'm don't quite understand the problem we're trying to fix: > > >> > > >>> Currently the launcher's latch is used for the following: a) worker > > >>> process attach b) worker process exit and c) subscription creation. > > >>> Since this same latch is used for multiple cases, the launcher process > > >>> is not able to handle concurrent scenarios like: a) Launcher started a > > >>> new apply worker and waiting for apply worker to attach and b) create > > >>> subscription sub2 sending launcher wake up signal. In this scenario, > > >>> both of them will set latch of the launcher process, the launcher > > >>> process is not able to identify that both operations have occurred 1) > > >>> worker is attached 2) subscription is created and apply worker should > > >>> be started. As a result the apply worker does not get started for the > > >>> new subscription created immediately and gets started after the > > >>> timeout of 180 seconds. > > >> > > >> I don't see how we could miss a notification. Yes, both events will set > > >> the same latch. Why is that a problem? > > > > > > The launcher process waits for the apply worker to attach at > > > WaitForReplicationWorkerAttach function. The launcher's latch is > > > getting set concurrently for another create subscription and apply > > > worker attached. The launcher now detects the latch is set while > > > waiting at WaitForReplicationWorkerAttach, it resets the latch and > > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > > > the latch has already been reset). Further details are provided below. > > > > > > The loop will see that the new > > >> worker has attached, and that the new subscription has been created, and > > >> process both events. Right? > > > > > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > > > processing the create subscription event. > > > > > > Slightly detailing further: > > > In the scenario when we execute two concurrent create subscription > > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > > > CREATE SUBSCRIPTION sub2. > > > In a few random scenarios, the following events may unfold: > > > After the first create subscription command(sub1), the backend will > > > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > > > Subsequently, the launcher process will reset the latch and identify > > > the addition of a new subscription in the pg_subscription list. The > > > launcher process will proceed to request the postmaster to start the > > > apply worker background process (sub1) and request the postmaster to > > > notify the launcher once the apply worker(sub1) has been started. > > > Launcher will then wait for the apply worker(sub1) to attach in the > > > WaitForReplicationWorkerAttach function. > > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > > > executed concurrently, also set the launcher's latch(because of > > > ApplyLauncherWakeupAtCommit). > > > Simultaneously when the launcher remains in the > > > WaitForReplicationWorkerAttach function waiting for apply worker of > > > sub1 to start, the postmaster will start the apply worker for > > > subscription sub1 and send a SIGUSR1 signal to the launcher process > > > via ReportBackgroundWorkerPID. Upon receiving this signal, the > > > launcher process will then set its latch. > > > > > > Now, the launcher's latch has been set concurrently after the apply > > > worker for sub1 is started and the execution of the CREATE > > > SUBSCRIPTION sub2 command. > > > > > > At this juncture, the launcher, which had been awaiting the attachment > > > of the apply worker, detects that the latch is set and proceeds to > > > reset it. The reset operation of the latch occurs within the following > > > section of code in WaitForReplicationWorkerAttach: > > > ... > > > rc = WaitLatch(MyLatch, > > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > > > if (rc & WL_LATCH_SET) > > > { > > > ResetLatch(MyLatch); > > > CHECK_FOR_INTERRUPTS(); > > > } > > > ... > > > > > > After resetting the latch here, the activation signal intended to > > > start the apply worker for subscription sub2 is no longer present. The > > > launcher will return to the ApplyLauncherMain function, where it will > > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > > > processing the create subscription request (i.e., creating a new apply > > > worker for sub2). > > > The issue arises from the latch being reset in > > > WaitForReplicationWorkerAttach, which can occasionally delay the > > > synchronization of table data for the subscription. > > > > Ok, I see it now. Thanks for the explanation. So the problem isn't using > > the same latch for different purposes per se. It's that we're trying to > > use it in a nested fashion, resetting it in the inner loop. > > > > Looking at the proposed patch more closely: > > > > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, > > > * We need timeout because we generally don't get notified via latch > > > * about the worker attach. But we don't expect to have to wait long. > > > */ > > > - rc = WaitLatch(MyLatch, > > > + rc = WaitLatch(&worker->launcherWakeupLatch, > > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > > > if (rc & WL_LATCH_SET) > > > { > > > - ResetLatch(MyLatch); > > > + ResetLatch(&worker->launcherWakeupLatch); > > > CHECK_FOR_INTERRUPTS(); > > > } > > > } > > > > The comment here is now outdated, right? The patch adds an explicit > > SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher > > about the attachment. > > I will update the comment for this. > > > Is the launcherWakeupLatch field protected by some lock, to protect > > which process owns it? OwnLatch() is called in > > logicalrep_worker_stop_internal() without holding a lock for example. Is > > there a guarantee that only one process can do that at the same time? > > I have analyzed a few scenarios, I will analyze the remaining > scenarios and update on this. > > > What happens if a process never calls DisownLatch(), e.g. because it > > bailed out with an error while waiting for the worker to attach or stop? > > I tried a lot of scenarios by erroring out after ownlatch call and did > not hit the panic code of OwnLatch yet. I will try a few more > scenarios that I have in mind and see if we can hit the PANIC in > OwnLatch and update on this. I could encounter the PANIC from OwnLatch by having a lot of create subscriptions doing a table sync with a limited number of slots. This and the above scenario are slightly related. I need some more time to prepare the fix. Regards, Vignesh
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Kyotaro Horiguchi
Date:
At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C <vignesh21@gmail.com> wrote in > The attached v2 version patch has the changes for the same. Sorry for jumping in at this point. I've just reviewed the latest patch (v2), and the frequent Own/Disown-Latch operations caught my attention. Additionally, handling multiple concurrently operating trigger sources with nested latch waits seems bug-prone, which I’d prefer to avoid from both a readability and safety perspective. With that in mind, I’d like to suggest an alternative approach. I may not be fully aware of all the previous discussions, so apologies if this idea has already been considered and dismissed. Currently, WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() wait on a latch after verifying the desired state. This ensures that even if there are spurious or missed wakeups, they won't cause issues. In contrast, ApplyLauncherMain() enters a latch wait without checking the desired state first. Consequently, if another process sets the latch to wake up the main loop while the former two functions are waiting, that wakeup could be missed. If my understanding is correct, the problem lies in ApplyLauncherMain() not checking the expected state before beginning to wait on the latch. There is no issue with waiting if the state hasn't been satisfied yet. So, I propose that ApplyLauncherMain() should check the condition that triggers a main loop wakeup before calling WaitLatch(). To do this, we could add a flag in LogicalRepCtxStruct to signal that the main loop has immediate tasks to handle. ApplyLauncherWakeup() would set this flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check this flag before calling WaitLatch() and skip the WaitLatch() call if the flag is set. I think this approach could solve the issue without adding complexity. What do you think? regard. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Heikki Linnakangas
Date:
On 04/09/2024 14:24, vignesh C wrote: > On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >> I think this approach could solve the issue without adding >> complexity. What do you think? > > I agree that this approach is more simple than the other approach. How > about something like the attached patch to handle the same. I haven't looked at these new patches from the last few days, but please also note the work at https://www.postgresql.org/message-id/476672e7-62f1-4cab-a822-f3a8e949dd3f%40iki.fi. If those "interrupts" patches are committed, this is pretty straightforward to fix by using a separate interrupt bit for this, as the patch on that thread does. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improving the latch handling between logical replication launcher and worker processes.
From
Alexander Lakhin
Date:
Hello, 04.09.2024 16:53, Heikki Linnakangas wrote: > On 04/09/2024 14:24, vignesh C wrote: >> >> I agree that this approach is more simple than the other approach. How >> about something like the attached patch to handle the same. > > I haven't looked at these new patches from the last few days, but please also note the work at > https://www.postgresql.org/message-id/476672e7-62f1-4cab-a822-f3a8e949dd3f%40iki.fi. If those "interrupts" patches are > committed, this is pretty straightforward to fix by using a separate interrupt bit for this, as the patch on that > thread does. > I'd also like to add that this issue leads to buildfarm test failures, because of the race condition between #define DEFAULT_NAPTIME_PER_CYCLE 180000L and $timeout_default = 180 That is, in the situation described above "the apply worker does not get started for the new subscription created immediately and gets started after the timeout of 180 seconds", 014_binary.pl can fail if wait_for_log()'s 180 seconds passed sooner: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2025-02-09%2011%3A45%3A05 I reproduced this failure locally when running 50 014_binary tests in parallel and got failures on iterations 4, 14, 10. But with PG_TEST_TIMEOUT_DEFAULT=190, 30 iterations passed for me (8 of them took 180+ seconds). Best regards, Alexander Lakhin Neon (https://neon.tech)