Home > mailing lists

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

From	vignesh C
Subject	Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date	February 5, 2024 10:13:41
Msg-id	CALDaNm16SW3X2snvTKu-YQxGWGhjWLWj5H9eqXc-bTK4inUtJw@mail.gmail.com Whole thread Raw
In response to	Re: Race condition in FetchTableStates() breaks synchronization of subscription tables (Alexander Lakhin <exclusion@gmail.com>)
Responses	Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
List	pgsql-hackers

Tree view

On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Vignesh and Hou-san,
>
> 01.02.2024 07:59, vignesh C wrote:
> > Here is an updated patch which changes the boolean variable to a
> > tri-state enum and set stable state to valid only if no invalidations
> > have been occurred while the list is being prepared.
> >
>
> While testing the v3 patch, I observed another anomaly with the 014_binary
> test. When I run the test in parallel, I see that sometimes one of the test
> instances runs much longer than others. For example:
> 49      All tests successful.
> 49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  0.30 csys =  0.77 CPU)
> 49      Result: PASS
> 12      All tests successful.
> 12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  0.40 csys =  0.89 CPU)
> 12      Result: PASS
>
> As far as I can see, this happens due to another race condition, this time
> in launcher.c.
> For such a three-minute case I see in _subscriber.log:
> 2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation "public.test_mismatching_types" does not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during
message
> type "INSERT" in transaction 757, finished at 0/153C838
> 2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types"
does
> not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during
message
> type "INSERT" in transaction 757, finished at 0/153C838
> ...
> 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE TABLE public.test_mismatching_types (
>              a int PRIMARY KEY
>          );
> 2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for
> subscription 16398"
> 2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited
with
> exit code 1
> ...
> 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> ...
> 2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for subscription "tsub"
> (there is no interesting activity between 14:33:13 and 14:36:13)
>
> So logical replication apply worker exited because CREATE TABLE was not
> executed on subscriber yet, and new replication worker started because of a
> timeout occurred in WaitLatch(..., wait_time, ...) inside
> ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
> (180 sec)).
>
> But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
> a result of:
> logical replication apply worker| logicalrep_worker_onexit() ->
>    ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).
>
> In a bad case, I see that the SetLatch() called as well, but then the latch
> gets reset by the following code in WaitForReplicationWorkerAttach():
>          rc = WaitLatch(MyLatch,
>                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>                         10L, WAIT_EVENT_BGWORKER_STARTUP);
>
>          if (rc & WL_LATCH_SET)
>          {
>              ResetLatch(MyLatch);
>              CHECK_FOR_INTERRUPTS();
>          }
>
> With pg_usleep(300000); added just before ResetLatch and
>   $node_subscriber->safe_psql(
>       'postgres', qq(
> +SELECT pg_sleep(0.5);
>       CREATE TABLE public.test_mismatching_types (
> in 014_binary.pl, I can see the anomaly without running tests in parallel,
> just when running that test in a loop:
> for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
> ...
> iteration 2
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 csys =  0.42 CPU)
> Result: PASS
> iteration 3
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 csys =  0.55 CPU)
> Result: PASS
> ...
>
> In other words, the abnormal test execution takes place due to the
> following events:
> 1. logicalrep worker launcher launches replication worker and waits for it
>   to attach:
>   ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
> 2. logicalrep worker exits due to some error (logical replication target
>   relation does not exist, in our case) and sends a signal to set the latch
>   for launcher
> 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
>   it inside WaitForReplicationWorkerAttach()
> 4. launcher gets back to ApplyLauncherMain() where it waits for the latch
>   (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).
>
> Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
> test 027_nosuperuser executes for 3+ minutes on each run for me.

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

Regards,
Vignesh

Attachment

pgsql-hackers by date:

From: Alvaro Herrera
Date: 05 February 2024, 09:53:17
Subject: Re: cataloguing NOT NULL constraints

From: Alvaro Herrera
Date: 05 February 2024, 10:17:07
Subject: Re: meson: catalog/syscache_ids.h isn't installed

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

Attachment

Previous

Next