Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

From vignesh C
Subject Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date
Msg-id CALDaNm0X8oUiW1CzniPZsDxqjP-VoYuEvb1h7NFXohKc1P5HEw@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in FetchTableStates() breaks synchronization of subscription tables  (Alexander Lakhin <exclusion@gmail.com>)
Responses Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
List pgsql-hackers
On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 05.02.2024 13:13, vignesh C wrote:
> > Thanks for the steps for the issue, I was able to reproduce this issue
> > in my environment with the steps provided. The attached patch has a
> > proposed fix where the latch will not be set in case of the apply
> > worker exiting immediately after starting.
>
> It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
> called by a backend executing CREATE SUBSCRIPTION command.
> That is, with the v4-0002 patch applied and pg_usleep(300000L); added
> just below
>              if (!worker_in_use)
>                  return worker_in_use;
> I still observe the test 027_nosuperuser running for 3+ minutes:
> t/027_nosuperuser.pl .. ok
> All tests successful.
> Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 csys =  9.30 CPU)
>
> IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
> regression_sub ...", gets missed when launcher waits for start of another
> worker (logical replication worker for subscription "admin_sub"), launched
> just before that command.

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Improve eviction algorithm in ReorderBuffer