Thread: Improving the latch handling between logical replication launcher and worker processes.

Hi,

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I have started a new thread for this based on suggestions at [1].

We could improvise this by one of the following:
a) Introduce a new latch to handle worker attach and exit.
b) Add a new GUC launcher_retry_time which gives more flexibility to
users as suggested by Amit at [1]. Before 5a3a953, the
wal_retrieve_retry_interval plays a similar role as the suggested new
GUC launcher_retry_time, e.g. even if a worker is launched, the
launcher only wait wal_retrieve_retry_interval time before next round.
c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

I'm not sure which approach is better in this case. I was thinking if
we should add a new latch to handle worker attach and exit.
Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y%2BLCuxcSMdKUkKZK740L2w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com

Regards,
Vignesh



RE: Improving the latch handling between logical replication launcher and worker processes.

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, April 25, 2024 4:59 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> Hi,
> 
> Currently the launcher's latch is used for the following: a) worker process attach
> b) worker process exit and c) subscription creation.
> Since this same latch is used for multiple cases, the launcher process is not able
> to handle concurrent scenarios like: a) Launcher started a new apply worker and
> waiting for apply worker to attach and b) create subscription sub2 sending
> launcher wake up signal. In this scenario, both of them will set latch of the
> launcher process, the launcher process is not able to identify that both
> operations have occurred 1) worker is attached 2) subscription is created and
> apply worker should be started. As a result the apply worker does not get
> started for the new subscription created immediately and gets started after the
> timeout of 180 seconds.
> 
> I have started a new thread for this based on suggestions at [1].
> 
> a) Introduce a new latch to handle worker attach and exit.

I found the startup process also uses two latches(see recoveryWakeupLatch) for
different purposes, so maybe this is OK. But note that both logical launcher
and apply worker will call logicalrep_worker_launch(), if we only add one new
latch, it means both workers will wait on the same new Latch, and the launcher
may consume the SetLatch that should have been consumed by the apply
worker(although it's rare).

> b) Add a new GUC launcher_retry_time which gives more flexibility to users as
> suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays
> a similar role as the suggested new GUC launcher_retry_time, e.g. even if a
> worker is launched, the launcher only wait wal_retrieve_retry_interval time
> before next round.

IIUC, the issue does not happen frequently, and may not be noticeable where
apply workers wouldn't be restarted often. So, I am slightly not sure if it's
worth adding a new GUC.

> c) Don't reset the latch at worker attach and allow launcher main to identify and
> handle it. For this there is a patch v6-0002 available at [2].

This seems simple. I found we are doing something similar in
RegisterSyncRequest() and WalSummarizerMain().

Best Regards,
Hou zj

RE: Improving the latch handling between logical replication launcher and worker processes.

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Vignesh,

Thanks for raising idea!

> a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1]. Before 5a3a953, the
> wal_retrieve_retry_interval plays a similar role as the suggested new
> GUC launcher_retry_time, e.g. even if a worker is launched, the
> launcher only wait wal_retrieve_retry_interval time before next round.

Hmm. My concern is how users estimate the value. Maybe the default will be
3min, but should users change it? If so, how long? I think even if it becomes
tunable, they cannot control well.

> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

Does it mean that you want to remove ResetLatch() from
WaitForReplicationWorkerAttach(), right? If so, what about the scenario?

1) The launcher waiting the worker is attached in WaitForReplicationWorkerAttach(),
and 2) subscription is created before attaching. In this case, the launcher will
become un-sleepable because the latch is set but won't be reset. It may waste the
CPU time.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
>
...
> a) Introduce a new latch to handle worker attach and exit.

IIUC there is normally only the “shared” latch or a “process local”
latch. i.e. AFAICT is is quite uncommon to invent a new latches like
option a is proposing. I did not see any examples of making new
latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source
other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So
this option could be OK, but OTOH since there is hardly any precedent
maybe that should be taken as an indication to avoid doing it this way
(???)

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1].

I'm not sure that introducing a new GUC is a good option because this
seems a rare problem to have -- so it will be hard to tune since it
will be difficult to know you even have this problem and then
difficult to know that it is fixed. Anyway. this made me consider more
what the WaitLatch timeout value should be. Here are some thoughts:

Idea 1)

I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180
seconds come from or was that just a made-up number? AFAICT it just
came into existence in the first pub/sub commit [1] but there is no
explanation for why 180s was chosen. Anyway, I assume a low value
(5s?) would be bad because it incurs unacceptable CPU usage, right?
But if 180s is too long and 5s is too short then what does a “good”
number even look like? E.g.,. if 60s is deemed OK, then is there any
harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving
it at that?

Idea 2)

Another idea could be to use a “dynamic timeout” in the WaitLatch of
ApplyLauncherMain. Let me try to explain my thought bubble:
- If the preceding foreach(lc, sublist) loop launched any workers then
the WaitLatch timeout can be a much shorter 10s
- If the preceding foreach(lc, sublist) loop did NOT launch any
workers (this would be the most common case) then the WaitLatch
timeout can be the usual 180s.

IIUC this strategy will still give any recently launched workers
enough time to attach shmem but it also means that any concurrent
CREATE SUBSCRIPTION will be addressed within 10s instead of 180s.
Maybe this is sufficient to make an already rare problem become
insignificant.


> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

======
[1] github -
https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf

Kind Regards,
Peter Smith.
Fujitsu Australia



On Fri, 10 May 2024 at 07:39, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh,
>
> Thanks for raising idea!
>
> > a) Introduce a new latch to handle worker attach and exit.
>
> Just to confirm - there are three wait events for launchers, so I feel we may be
> able to create latches per wait event. Is there a reason to introduce
> "a" latch?

One latch is enough, we can use the new latch for both worker starting
and worker exiting. The other existing latch can be used for other
purposes.  Something like the attached patch.

Regards,
Vignesh

Attachment
On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> >
>
> > c) Don't reset the latch at worker attach and allow launcher main to
> > identify and handle it. For this there is a patch v6-0002 available at
> > [2].
>
> This option c seems the easiest. Can you explain what are the
> drawbacks of using this approach?

This solution will resolve the issue. However, one drawback to
consider is that because we're not resetting the latch, in this
scenario, the launcher process will need to perform an additional
round of acquiring subscription details and determining whether the
worker should start, regardless of any changes in subscriptions.

Regards,
Vignesh



On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> >
> > > c) Don't reset the latch at worker attach and allow launcher main to
> > > identify and handle it. For this there is a patch v6-0002 available at
> > > [2].
> >
> > This option c seems the easiest. Can you explain what are the
> > drawbacks of using this approach?
>
> This solution will resolve the issue. However, one drawback to
> consider is that because we're not resetting the latch, in this
> scenario, the launcher process will need to perform an additional
> round of acquiring subscription details and determining whether the
> worker should start, regardless of any changes in subscriptions.
>

Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
was not expecting to get notified.

e.g.1. The WaitList comment in the function says so:
/*
 * We need timeout because we generally don't get notified via latch
 * about the worker attach.  But we don't expect to have to wait long.
 */

e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
WaitForReplicationWorkerAttach  was waiting for) is not doing any
SetLatch. So that matches what the comment said.

~~~

AFAICT the original problem reported by this thread happened because
the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
wasn't expecting to be notified at all.

~~~

Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:

You said above that one drawback is "the launcher process will need to
perform an additional round of acquiring subscription details and
determining whether the worker should start, regardless of any changes
in subscriptions"

I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
during the attaching of other workers then the latch would (now after
option c) remain set and so the WaitLatch of ApplyLauncherMain would
be notified and/or return immediately end causing an immediate
re-iteration of the "foreach(lc, sublist)" loop.

But I don't understand why that is a problem.

a) I didn't know what you meant "regardless of any changes in
subscriptions" because I think the troublesome SetLatch originated
from the CREATE SUBSCRIPTION and so there *is* a change to
subscriptions.

b) We will need to repeat that sublist loop anyway to start the worker
for the new CREATE SUBSCRIPTION, and we need to do that at the
earliest opportunity because the whole point of the SetLatch is so the
CREATE SUBSCRIPTION worker can get started promptly. So the earlier we
do that the better, right?

c) AFAICT there is no danger of accidentally tying to starting workers
who are still in the middle of trying to start (from the previous
iteration) because those cases should be guarded by the
ApplyLauncherGetWorkerStartTime logic.

~~

To summarise, I felt removing the ResetLatch and the WL_LATCH_SET
bitset (like your v6-0002 patch does) is not only an easy way of
fixing the problem reported by this thread, but also it now makes that
WaitForReplicationWorkerAttach code behave like the comment ("we
generally don't get notified via latch about the worker attach") is
saying.

(Unless there are some other problems with it that I can't see)

======
Kind Regards,
Peter Smith.
Fujitsu Australia



On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > >
> > > > c) Don't reset the latch at worker attach and allow launcher main to
> > > > identify and handle it. For this there is a patch v6-0002 available at
> > > > [2].
> > >
> > > This option c seems the easiest. Can you explain what are the
> > > drawbacks of using this approach?
> >
> > This solution will resolve the issue. However, one drawback to
> > consider is that because we're not resetting the latch, in this
> > scenario, the launcher process will need to perform an additional
> > round of acquiring subscription details and determining whether the
> > worker should start, regardless of any changes in subscriptions.
> >
>
> Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
> was not expecting to get notified.
>
> e.g.1. The WaitList comment in the function says so:
> /*
>  * We need timeout because we generally don't get notified via latch
>  * about the worker attach.  But we don't expect to have to wait long.
>  */
>
> e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
> WaitForReplicationWorkerAttach  was waiting for) is not doing any
> SetLatch. So that matches what the comment said.
>
> ~~~
>
> AFAICT the original problem reported by this thread happened because
> the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
> by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
> wasn't expecting to be notified at all.
>
> ~~~
>
> Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:
>
> You said above that one drawback is "the launcher process will need to
> perform an additional round of acquiring subscription details and
> determining whether the worker should start, regardless of any changes
> in subscriptions"
>
> I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
> during the attaching of other workers then the latch would (now after
> option c) remain set and so the WaitLatch of ApplyLauncherMain would
> be notified and/or return immediately end causing an immediate
> re-iteration of the "foreach(lc, sublist)" loop.
>
> But I don't understand why that is a problem.
>
> a) I didn't know what you meant "regardless of any changes in
> subscriptions" because I think the troublesome SetLatch originated
> from the CREATE SUBSCRIPTION and so there *is* a change to
> subscriptions.

The process of setting the latch unfolds as follows: Upon creating a
new subscription, the launcher process initiates a request to the
postmaster, prompting it to initiate a new apply worker process.
Subsequently, the postmaster commences the apply worker process and
dispatches a SIGUSR1 signal to the launcher process(this is done from
do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this
signal, the launcher process sets the latch.
Now, there are two potential scenarios:
a) Concurrent Creation of Another Subscription: In this situation, the
launcher traverses the subscription list to detect the creation of a
new subscription and proceeds to initiate a new apply worker for the
concurrently created subscription. This is ok. b) Absence of
Concurrent Subscription Creation: In this case, since the latch
remains unset, the launcher iterates through the subscription list and
identifies the absence of new subscriptions. This verification occurs
as the latch remains unset. Here there is an additional check.
I'm talking about the second scenario where no subscription is
concurrently created. In this case, as the latch remains unset, we
perform an additional check on the subscription list. There is no
problem with this.
This additional check can occur in the existing code too if the
function WaitForReplicationWorkerAttach returns from the initial if
check i.e. if the worker already started when this check happens.

Regards,
Vignesh



I'm don't quite understand the problem we're trying to fix:

> Currently the launcher's latch is used for the following: a) worker
> process attach b) worker process exit and c) subscription creation.
> Since this same latch is used for multiple cases, the launcher process
> is not able to handle concurrent scenarios like: a) Launcher started a
> new apply worker and waiting for apply worker to attach and b) create
> subscription sub2 sending launcher wake up signal. In this scenario,
> both of them will set latch of the launcher process, the launcher
> process is not able to identify that both operations have occurred 1)
> worker is attached 2) subscription is created and apply worker should
> be started. As a result the apply worker does not get started for the
> new subscription created immediately and gets started after the
> timeout of 180 seconds.

I don't see how we could miss a notification. Yes, both events will set 
the same latch. Why is that a problem? The loop will see that the new 
worker has attached, and that the new subscription has been created, and 
process both events. Right?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I'm don't quite understand the problem we're trying to fix:
>
> > Currently the launcher's latch is used for the following: a) worker
> > process attach b) worker process exit and c) subscription creation.
> > Since this same latch is used for multiple cases, the launcher process
> > is not able to handle concurrent scenarios like: a) Launcher started a
> > new apply worker and waiting for apply worker to attach and b) create
> > subscription sub2 sending launcher wake up signal. In this scenario,
> > both of them will set latch of the launcher process, the launcher
> > process is not able to identify that both operations have occurred 1)
> > worker is attached 2) subscription is created and apply worker should
> > be started. As a result the apply worker does not get started for the
> > new subscription created immediately and gets started after the
> > timeout of 180 seconds.
>
> I don't see how we could miss a notification. Yes, both events will set
> the same latch. Why is that a problem?

The launcher process waits for the apply worker to attach at
WaitForReplicationWorkerAttach function. The launcher's latch is
getting set concurrently for another create subscription and apply
worker attached. The launcher now detects the latch is set while
waiting at WaitForReplicationWorkerAttach, it resets the latch and
proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
the latch has already been reset). Further details are provided below.

The loop will see that the new
> worker has attached, and that the new subscription has been created, and
> process both events. Right?

Since the latch is reset at WaitForReplicationWorkerAttach, it skips
processing the create subscription event.

Slightly detailing further:
In the scenario when we execute two concurrent create subscription
commands, first CREATE SUBSCRIPTION sub1, followed immediately by
CREATE SUBSCRIPTION sub2.
In a few random scenarios, the following events may unfold:
After the first create subscription command(sub1), the backend will
set the launcher's latch because of ApplyLauncherWakeupAtCommit.
Subsequently, the launcher process will reset the latch and identify
the addition of a new subscription in the pg_subscription list. The
launcher process will proceed to request the postmaster to start the
apply worker background process (sub1) and request the postmaster to
notify the launcher once the apply worker(sub1) has been started.
Launcher will then wait for the apply worker(sub1) to attach in the
WaitForReplicationWorkerAttach function.
Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
executed concurrently, also set the launcher's latch(because of
ApplyLauncherWakeupAtCommit).
Simultaneously when the launcher remains in the
WaitForReplicationWorkerAttach function waiting for apply worker of
sub1 to start, the postmaster will start the apply worker for
subscription sub1 and send a SIGUSR1 signal to the launcher process
via ReportBackgroundWorkerPID. Upon receiving this signal, the
launcher process will then set its latch.

Now, the launcher's latch has been set concurrently after the apply
worker for sub1 is started and the execution of the CREATE
SUBSCRIPTION sub2 command.

At this juncture, the launcher, which had been awaiting the attachment
of the apply worker, detects that the latch is set and proceeds to
reset it. The reset operation of the latch occurs within the following
section of code in WaitForReplicationWorkerAttach:
...
rc = WaitLatch(MyLatch,
   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
   10L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
...

After resetting the latch here, the activation signal intended to
start the apply worker for subscription sub2 is no longer present. The
launcher will return to the ApplyLauncherMain function, where it will
await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
processing the create subscription request (i.e., creating a new apply
worker for sub2).
The issue arises from the latch being reset in
WaitForReplicationWorkerAttach, which can occasionally delay the
synchronization of table data for the subscription.
Another concurrency scenario, as mentioned by Alexander Lakhin at [3],
involves "apply worker exiting with failure and concurrent create
subscription." The underlying cause for both issues is from the same
source: the reset of the latch in WaitForReplicationWorkerAttach.
Even though there is no failure, such issues contribute to a delay of
180 seconds in the execution of buildfarm tests and local tests.

This was noticed in buildfarm where 026_stats.pl took more than 180
seconds at [2]:
...
[21:11:21] t/025_rep_changes_for_schema.pl .... ok     3340 ms ( 0.00
usr  0.00 sys +  0.43 cusr  0.23 csys =  0.66 CPU)
[21:11:25] t/026_stats.pl ..................... ok     4499 ms ( 0.00
usr  0.00 sys +  0.42 cusr  0.21 csys =  0.63 CPU)
[21:14:31] t/027_nosuperuser.pl ............... ok   185457 ms ( 0.01
usr  0.00 sys +  5.91 cusr  2.68 csys =  8.60 CPU)
[21:14:35] t/028_row_filter.pl ................ ok     3998 ms ( 0.01
usr  0.00 sys +  0.86 cusr  0.34 csys =  1.21 CPU)
...

And also by Robert at [3].

[1] - https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com
[2] -
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check
[3] -
https://www.postgresql.org/message-id/CA%2BTgmoZkj%3DA39i4obKXADMhzJW%3D6dyGq-C1aGfb%2BjUy9XvxwYA%40mail.gmail.com

Regards,
Vignesh



On 05/07/2024 14:07, vignesh C wrote:
> On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> I'm don't quite understand the problem we're trying to fix:
>>
>>> Currently the launcher's latch is used for the following: a) worker
>>> process attach b) worker process exit and c) subscription creation.
>>> Since this same latch is used for multiple cases, the launcher process
>>> is not able to handle concurrent scenarios like: a) Launcher started a
>>> new apply worker and waiting for apply worker to attach and b) create
>>> subscription sub2 sending launcher wake up signal. In this scenario,
>>> both of them will set latch of the launcher process, the launcher
>>> process is not able to identify that both operations have occurred 1)
>>> worker is attached 2) subscription is created and apply worker should
>>> be started. As a result the apply worker does not get started for the
>>> new subscription created immediately and gets started after the
>>> timeout of 180 seconds.
>>
>> I don't see how we could miss a notification. Yes, both events will set
>> the same latch. Why is that a problem?
> 
> The launcher process waits for the apply worker to attach at
> WaitForReplicationWorkerAttach function. The launcher's latch is
> getting set concurrently for another create subscription and apply
> worker attached. The launcher now detects the latch is set while
> waiting at WaitForReplicationWorkerAttach, it resets the latch and
> proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> the latch has already been reset). Further details are provided below.
> 
> The loop will see that the new
>> worker has attached, and that the new subscription has been created, and
>> process both events. Right?
> 
> Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> processing the create subscription event.
> 
> Slightly detailing further:
> In the scenario when we execute two concurrent create subscription
> commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> CREATE SUBSCRIPTION sub2.
> In a few random scenarios, the following events may unfold:
> After the first create subscription command(sub1), the backend will
> set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> Subsequently, the launcher process will reset the latch and identify
> the addition of a new subscription in the pg_subscription list. The
> launcher process will proceed to request the postmaster to start the
> apply worker background process (sub1) and request the postmaster to
> notify the launcher once the apply worker(sub1) has been started.
> Launcher will then wait for the apply worker(sub1) to attach in the
> WaitForReplicationWorkerAttach function.
> Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> executed concurrently, also set the launcher's latch(because of
> ApplyLauncherWakeupAtCommit).
> Simultaneously when the launcher remains in the
> WaitForReplicationWorkerAttach function waiting for apply worker of
> sub1 to start, the postmaster will start the apply worker for
> subscription sub1 and send a SIGUSR1 signal to the launcher process
> via ReportBackgroundWorkerPID. Upon receiving this signal, the
> launcher process will then set its latch.
> 
> Now, the launcher's latch has been set concurrently after the apply
> worker for sub1 is started and the execution of the CREATE
> SUBSCRIPTION sub2 command.
> 
> At this juncture, the launcher, which had been awaiting the attachment
> of the apply worker, detects that the latch is set and proceeds to
> reset it. The reset operation of the latch occurs within the following
> section of code in WaitForReplicationWorkerAttach:
> ...
> rc = WaitLatch(MyLatch,
>     WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>     10L, WAIT_EVENT_BGWORKER_STARTUP);
> 
> if (rc & WL_LATCH_SET)
> {
> ResetLatch(MyLatch);
> CHECK_FOR_INTERRUPTS();
> }
> ...
> 
> After resetting the latch here, the activation signal intended to
> start the apply worker for subscription sub2 is no longer present. The
> launcher will return to the ApplyLauncherMain function, where it will
> await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> processing the create subscription request (i.e., creating a new apply
> worker for sub2).
> The issue arises from the latch being reset in
> WaitForReplicationWorkerAttach, which can occasionally delay the
> synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using 
the same latch for different purposes per se. It's that we're trying to 
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:

> @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
>           * We need timeout because we generally don't get notified via latch
>           * about the worker attach.  But we don't expect to have to wait long.
>           */
> -        rc = WaitLatch(MyLatch,
> +        rc = WaitLatch(&worker->launcherWakeupLatch,
>                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>                         10L, WAIT_EVENT_BGWORKER_STARTUP);
>  
>          if (rc & WL_LATCH_SET)
>          {
> -            ResetLatch(MyLatch);
> +            ResetLatch(&worker->launcherWakeupLatch);
>              CHECK_FOR_INTERRUPTS();
>          }
>      }

The comment here is now outdated, right? The patch adds an explicit 
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher 
about the attachment.

Is the launcherWakeupLatch field protected by some lock, to protect 
which process owns it? OwnLatch() is called in 
logicalrep_worker_stop_internal() without holding a lock for example. Is 
there a guarantee that only one process can do that at the same time?

What happens if a process never calls DisownLatch(), e.g. because it 
bailed out with an error while waiting for the worker to attach or stop?

As an alternative, smaller fix, I think we could do the attached. It 
forces the launcher's main loop to do another iteration, if it calls 
logicalrep_worker_launch(). That extra iteration should pick up any 
missed notifications.

Your solution with an additional latch seems better because it makes 
WaitForReplicationWorkerAttach() react more quickly, without the 10 s 
wait. I'm surprised we have that in the first place, 10 s seems like a 
pretty long time to wait for a parallel apply worker to start. Why was 
that ever OK?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment
On Fri, Jul 5, 2024 at 6:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>
> Your solution with an additional latch seems better because it makes
> WaitForReplicationWorkerAttach() react more quickly, without the 10 s
> wait. I'm surprised we have that in the first place, 10 s seems like a
> pretty long time to wait for a parallel apply worker to start. Why was
> that ever OK?
>

Isn't the call wait for 10 milliseconds?  The comment atop
WaitLatch("The "timeout" is given in milliseconds...) indicates the
timeout is in milliseconds.

--
With Regards,
Amit Kapila.



On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/07/2024 14:07, vignesh C wrote:
> > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> I'm don't quite understand the problem we're trying to fix:
> >>
> >>> Currently the launcher's latch is used for the following: a) worker
> >>> process attach b) worker process exit and c) subscription creation.
> >>> Since this same latch is used for multiple cases, the launcher process
> >>> is not able to handle concurrent scenarios like: a) Launcher started a
> >>> new apply worker and waiting for apply worker to attach and b) create
> >>> subscription sub2 sending launcher wake up signal. In this scenario,
> >>> both of them will set latch of the launcher process, the launcher
> >>> process is not able to identify that both operations have occurred 1)
> >>> worker is attached 2) subscription is created and apply worker should
> >>> be started. As a result the apply worker does not get started for the
> >>> new subscription created immediately and gets started after the
> >>> timeout of 180 seconds.
> >>
> >> I don't see how we could miss a notification. Yes, both events will set
> >> the same latch. Why is that a problem?
> >
> > The launcher process waits for the apply worker to attach at
> > WaitForReplicationWorkerAttach function. The launcher's latch is
> > getting set concurrently for another create subscription and apply
> > worker attached. The launcher now detects the latch is set while
> > waiting at WaitForReplicationWorkerAttach, it resets the latch and
> > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> > the latch has already been reset). Further details are provided below.
> >
> > The loop will see that the new
> >> worker has attached, and that the new subscription has been created, and
> >> process both events. Right?
> >
> > Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> > processing the create subscription event.
> >
> > Slightly detailing further:
> > In the scenario when we execute two concurrent create subscription
> > commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> > CREATE SUBSCRIPTION sub2.
> > In a few random scenarios, the following events may unfold:
> > After the first create subscription command(sub1), the backend will
> > set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> > Subsequently, the launcher process will reset the latch and identify
> > the addition of a new subscription in the pg_subscription list. The
> > launcher process will proceed to request the postmaster to start the
> > apply worker background process (sub1) and request the postmaster to
> > notify the launcher once the apply worker(sub1) has been started.
> > Launcher will then wait for the apply worker(sub1) to attach in the
> > WaitForReplicationWorkerAttach function.
> > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> > executed concurrently, also set the launcher's latch(because of
> > ApplyLauncherWakeupAtCommit).
> > Simultaneously when the launcher remains in the
> > WaitForReplicationWorkerAttach function waiting for apply worker of
> > sub1 to start, the postmaster will start the apply worker for
> > subscription sub1 and send a SIGUSR1 signal to the launcher process
> > via ReportBackgroundWorkerPID. Upon receiving this signal, the
> > launcher process will then set its latch.
> >
> > Now, the launcher's latch has been set concurrently after the apply
> > worker for sub1 is started and the execution of the CREATE
> > SUBSCRIPTION sub2 command.
> >
> > At this juncture, the launcher, which had been awaiting the attachment
> > of the apply worker, detects that the latch is set and proceeds to
> > reset it. The reset operation of the latch occurs within the following
> > section of code in WaitForReplicationWorkerAttach:
> > ...
> > rc = WaitLatch(MyLatch,
> >     WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> >     10L, WAIT_EVENT_BGWORKER_STARTUP);
> >
> > if (rc & WL_LATCH_SET)
> > {
> > ResetLatch(MyLatch);
> > CHECK_FOR_INTERRUPTS();
> > }
> > ...
> >
> > After resetting the latch here, the activation signal intended to
> > start the apply worker for subscription sub2 is no longer present. The
> > launcher will return to the ApplyLauncherMain function, where it will
> > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> > processing the create subscription request (i.e., creating a new apply
> > worker for sub2).
> > The issue arises from the latch being reset in
> > WaitForReplicationWorkerAttach, which can occasionally delay the
> > synchronization of table data for the subscription.
>
> Ok, I see it now. Thanks for the explanation. So the problem isn't using
> the same latch for different purposes per se. It's that we're trying to
> use it in a nested fashion, resetting it in the inner loop.
>
> Looking at the proposed patch more closely:
>
> > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
> >                * We need timeout because we generally don't get notified via latch
> >                * about the worker attach.  But we don't expect to have to wait long.
> >                */
> > -             rc = WaitLatch(MyLatch,
> > +             rc = WaitLatch(&worker->launcherWakeupLatch,
> >                                          WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> >                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
> >
> >               if (rc & WL_LATCH_SET)
> >               {
> > -                     ResetLatch(MyLatch);
> > +                     ResetLatch(&worker->launcherWakeupLatch);
> >                       CHECK_FOR_INTERRUPTS();
> >               }
> >       }
>
> The comment here is now outdated, right? The patch adds an explicit
> SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
> about the attachment.

I will update the comment for this.

> Is the launcherWakeupLatch field protected by some lock, to protect
> which process owns it? OwnLatch() is called in
> logicalrep_worker_stop_internal() without holding a lock for example. Is
> there a guarantee that only one process can do that at the same time?

I have analyzed a few scenarios, I will analyze the remaining
scenarios and update on this.

> What happens if a process never calls DisownLatch(), e.g. because it
> bailed out with an error while waiting for the worker to attach or stop?

I tried a lot of scenarios by erroring out after ownlatch call and did
not hit the panic code of OwnLatch yet. I will try a few more
scenarios that I have in mind and see if we can hit the PANIC in
OwnLatch and update on this.

> As an alternative, smaller fix, I think we could do the attached. It
> forces the launcher's main loop to do another iteration, if it calls
> logicalrep_worker_launch(). That extra iteration should pick up any
> missed notifications.

This also works. I had another solution in similar lines like you
proposed at [1], where we don't reset the latch at the inner loop. I'm
not sure which solution we should proceed with. Thoughts?

> Your solution with an additional latch seems better because it makes
> WaitForReplicationWorkerAttach() react more quickly, without the 10 s
> wait. I'm surprised we have that in the first place, 10 s seems like a
> pretty long time to wait for a parallel apply worker to start. Why was
> that ever OK?

Here 10 means 10 milliseconds,  so it just waits for 10 milliseconds
before going to the next iteration.

[1] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com

Regards,
Vignesh



On Mon, Jul 8, 2024 at 5:47 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
>
> > As an alternative, smaller fix, I think we could do the attached. It
> > forces the launcher's main loop to do another iteration, if it calls
> > logicalrep_worker_launch(). That extra iteration should pick up any
> > missed notifications.
>
> This also works.
>

The minor drawback would be that in many cases the extra iteration
would not lead to anything meaningful.

--
With Regards,
Amit Kapila.



On Mon, 8 Jul 2024 at 17:46, vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 05/07/2024 14:07, vignesh C wrote:
> > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >>
> > >> I'm don't quite understand the problem we're trying to fix:
> > >>
> > >>> Currently the launcher's latch is used for the following: a) worker
> > >>> process attach b) worker process exit and c) subscription creation.
> > >>> Since this same latch is used for multiple cases, the launcher process
> > >>> is not able to handle concurrent scenarios like: a) Launcher started a
> > >>> new apply worker and waiting for apply worker to attach and b) create
> > >>> subscription sub2 sending launcher wake up signal. In this scenario,
> > >>> both of them will set latch of the launcher process, the launcher
> > >>> process is not able to identify that both operations have occurred 1)
> > >>> worker is attached 2) subscription is created and apply worker should
> > >>> be started. As a result the apply worker does not get started for the
> > >>> new subscription created immediately and gets started after the
> > >>> timeout of 180 seconds.
> > >>
> > >> I don't see how we could miss a notification. Yes, both events will set
> > >> the same latch. Why is that a problem?
> > >
> > > The launcher process waits for the apply worker to attach at
> > > WaitForReplicationWorkerAttach function. The launcher's latch is
> > > getting set concurrently for another create subscription and apply
> > > worker attached. The launcher now detects the latch is set while
> > > waiting at WaitForReplicationWorkerAttach, it resets the latch and
> > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> > > the latch has already been reset). Further details are provided below.
> > >
> > > The loop will see that the new
> > >> worker has attached, and that the new subscription has been created, and
> > >> process both events. Right?
> > >
> > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> > > processing the create subscription event.
> > >
> > > Slightly detailing further:
> > > In the scenario when we execute two concurrent create subscription
> > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> > > CREATE SUBSCRIPTION sub2.
> > > In a few random scenarios, the following events may unfold:
> > > After the first create subscription command(sub1), the backend will
> > > set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> > > Subsequently, the launcher process will reset the latch and identify
> > > the addition of a new subscription in the pg_subscription list. The
> > > launcher process will proceed to request the postmaster to start the
> > > apply worker background process (sub1) and request the postmaster to
> > > notify the launcher once the apply worker(sub1) has been started.
> > > Launcher will then wait for the apply worker(sub1) to attach in the
> > > WaitForReplicationWorkerAttach function.
> > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> > > executed concurrently, also set the launcher's latch(because of
> > > ApplyLauncherWakeupAtCommit).
> > > Simultaneously when the launcher remains in the
> > > WaitForReplicationWorkerAttach function waiting for apply worker of
> > > sub1 to start, the postmaster will start the apply worker for
> > > subscription sub1 and send a SIGUSR1 signal to the launcher process
> > > via ReportBackgroundWorkerPID. Upon receiving this signal, the
> > > launcher process will then set its latch.
> > >
> > > Now, the launcher's latch has been set concurrently after the apply
> > > worker for sub1 is started and the execution of the CREATE
> > > SUBSCRIPTION sub2 command.
> > >
> > > At this juncture, the launcher, which had been awaiting the attachment
> > > of the apply worker, detects that the latch is set and proceeds to
> > > reset it. The reset operation of the latch occurs within the following
> > > section of code in WaitForReplicationWorkerAttach:
> > > ...
> > > rc = WaitLatch(MyLatch,
> > >     WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > >     10L, WAIT_EVENT_BGWORKER_STARTUP);
> > >
> > > if (rc & WL_LATCH_SET)
> > > {
> > > ResetLatch(MyLatch);
> > > CHECK_FOR_INTERRUPTS();
> > > }
> > > ...
> > >
> > > After resetting the latch here, the activation signal intended to
> > > start the apply worker for subscription sub2 is no longer present. The
> > > launcher will return to the ApplyLauncherMain function, where it will
> > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> > > processing the create subscription request (i.e., creating a new apply
> > > worker for sub2).
> > > The issue arises from the latch being reset in
> > > WaitForReplicationWorkerAttach, which can occasionally delay the
> > > synchronization of table data for the subscription.
> >
> > Ok, I see it now. Thanks for the explanation. So the problem isn't using
> > the same latch for different purposes per se. It's that we're trying to
> > use it in a nested fashion, resetting it in the inner loop.
> >
> > Looking at the proposed patch more closely:
> >
> > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
> > >                * We need timeout because we generally don't get notified via latch
> > >                * about the worker attach.  But we don't expect to have to wait long.
> > >                */
> > > -             rc = WaitLatch(MyLatch,
> > > +             rc = WaitLatch(&worker->launcherWakeupLatch,
> > >                                          WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > >                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
> > >
> > >               if (rc & WL_LATCH_SET)
> > >               {
> > > -                     ResetLatch(MyLatch);
> > > +                     ResetLatch(&worker->launcherWakeupLatch);
> > >                       CHECK_FOR_INTERRUPTS();
> > >               }
> > >       }
> >
> > The comment here is now outdated, right? The patch adds an explicit
> > SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
> > about the attachment.
>
> I will update the comment for this.
>
> > Is the launcherWakeupLatch field protected by some lock, to protect
> > which process owns it? OwnLatch() is called in
> > logicalrep_worker_stop_internal() without holding a lock for example. Is
> > there a guarantee that only one process can do that at the same time?
>
> I have analyzed a few scenarios, I will analyze the remaining
> scenarios and update on this.
>
> > What happens if a process never calls DisownLatch(), e.g. because it
> > bailed out with an error while waiting for the worker to attach or stop?
>
> I tried a lot of scenarios by erroring out after ownlatch call and did
> not hit the panic code of OwnLatch yet. I will try a few more
> scenarios that I have in mind and see if we can hit the PANIC in
> OwnLatch and update on this.

I could encounter the PANIC from OwnLatch by having a lot of create
subscriptions doing a table sync with a limited number of slots. This
and the above scenario are slightly related. I need some more time to
prepare the fix.

Regards,
Vignesh



At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C <vignesh21@gmail.com> wrote in 
> The attached v2 version patch has the changes for the same.

Sorry for jumping in at this point. I've just reviewed the latest
patch (v2), and the frequent Own/Disown-Latch operations caught my
attention. Additionally, handling multiple concurrently operating
trigger sources with nested latch waits seems bug-prone, which I’d
prefer to avoid from both a readability and safety perspective.

With that in mind, I’d like to suggest an alternative approach. I may
not be fully aware of all the previous discussions, so apologies if
this idea has already been considered and dismissed.

Currently, WaitForReplicationWorkerAttach() and
logicalrep_worker_stop_internal() wait on a latch after verifying the
desired state. This ensures that even if there are spurious or missed
wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
enters a latch wait without checking the desired state
first. Consequently, if another process sets the latch to wake up the
main loop while the former two functions are waiting, that wakeup
could be missed. If my understanding is correct, the problem lies in
ApplyLauncherMain() not checking the expected state before beginning
to wait on the latch. There is no issue with waiting if the state
hasn't been satisfied yet.

So, I propose that ApplyLauncherMain() should check the condition that
triggers a main loop wakeup before calling WaitLatch(). To do this, we
could add a flag in LogicalRepCtxStruct to signal that the main loop
has immediate tasks to handle. ApplyLauncherWakeup() would set this
flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
this flag before calling WaitLatch() and skip the WaitLatch() call if
the flag is set.

I think this approach could solve the issue without adding
complexity. What do you think?

regard.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improving the latch handling between logical replication launcher and worker processes.

From
Heikki Linnakangas
Date:
On 04/09/2024 14:24, vignesh C wrote:
> On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>> I think this approach could solve the issue without adding
>> complexity. What do you think?
> 
> I agree that this approach is more simple than the other approach. How
> about something like the attached patch to handle the same.

I haven't looked at these new patches from the last few days, but please 
also note the work at 
https://www.postgresql.org/message-id/476672e7-62f1-4cab-a822-f3a8e949dd3f%40iki.fi. 
If those "interrupts" patches are committed, this is pretty 
straightforward to fix by using a separate interrupt bit for this, as 
the patch on that thread does.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Hello,

04.09.2024 16:53, Heikki Linnakangas wrote:
> On 04/09/2024 14:24, vignesh C wrote:
>>
>> I agree that this approach is more simple than the other approach. How
>> about something like the attached patch to handle the same.
>
> I haven't looked at these new patches from the last few days, but please also note the work at 
> https://www.postgresql.org/message-id/476672e7-62f1-4cab-a822-f3a8e949dd3f%40iki.fi. If those "interrupts" patches
are
 
> committed, this is pretty straightforward to fix by using a separate interrupt bit for this, as the patch on that 
> thread does.
>

I'd also like to add that this issue leads to buildfarm test failures,
because of the race condition between
#define DEFAULT_NAPTIME_PER_CYCLE 180000L
and
$timeout_default = 180

That is, in the situation described above "the apply worker does not get
started for the new subscription created immediately and gets started
after the timeout of 180 seconds",  014_binary.pl can fail if
wait_for_log()'s 180 seconds passed sooner:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2025-02-09%2011%3A45%3A05

I reproduced this failure locally when running 50 014_binary tests in
parallel and got failures on iterations 4, 14, 10.

But with PG_TEST_TIMEOUT_DEFAULT=190, 30 iterations passed for me (8 of
them took 180+ seconds).

Best regards,
Alexander Lakhin
Neon (https://neon.tech)