Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables |
Date | |
Msg-id | CALDaNm16SW3X2snvTKu-YQxGWGhjWLWj5H9eqXc-bTK4inUtJw@mail.gmail.com Whole thread Raw |
In response to | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
|
List | pgsql-hackers |
On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Vignesh and Hou-san, > > 01.02.2024 07:59, vignesh C wrote: > > Here is an updated patch which changes the boolean variable to a > > tri-state enum and set stable state to valid only if no invalidations > > have been occurred while the list is being prepared. > > > > While testing the v3 patch, I observed another anomaly with the 014_binary > test. When I run the test in parallel, I see that sometimes one of the test > instances runs much longer than others. For example: > 49 All tests successful. > 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU) > 49 Result: PASS > 12 All tests successful. > 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU) > 12 Result: PASS > > As far as I can see, this happens due to another race condition, this time > in launcher.c. > For such a three-minute case I see in _subscriber.log: > 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does > not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > ... > 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types ( > a int PRIMARY KEY > ); > 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for > subscription 16398" > 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with > exit code 1 > ... > 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > ... > 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub" > (there is no interesting activity between 14:33:13 and 14:36:13) > > So logical replication apply worker exited because CREATE TABLE was not > executed on subscriber yet, and new replication worker started because of a > timeout occurred in WaitLatch(..., wait_time, ...) inside > ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE > (180 sec)). > > But in a normal (fast) case the same WaitLatch exits due to MyLatch set as > a result of: > logical replication apply worker| logicalrep_worker_onexit() -> > ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> > launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). > > In a bad case, I see that the SetLatch() called as well, but then the latch > gets reset by the following code in WaitForReplicationWorkerAttach(): > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > ResetLatch(MyLatch); > CHECK_FOR_INTERRUPTS(); > } > > With pg_usleep(300000); added just before ResetLatch and > $node_subscriber->safe_psql( > 'postgres', qq( > +SELECT pg_sleep(0.5); > CREATE TABLE public.test_mismatching_types ( > in 014_binary.pl, I can see the anomaly without running tests in parallel, > just when running that test in a loop: > for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done > ... > iteration 2 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU) > Result: PASS > iteration 3 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU) > Result: PASS > ... > > In other words, the abnormal test execution takes place due to the > following events: > 1. logicalrep worker launcher launches replication worker and waits for it > to attach: > ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach() > 2. logicalrep worker exits due to some error (logical replication target > relation does not exist, in our case) and sends a signal to set the latch > for launcher > 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets > it inside WaitForReplicationWorkerAttach() > 4. launcher gets back to ApplyLauncherMain() where it waits for the latch > (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms). > > Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the > test 027_nosuperuser executes for 3+ minutes on each run for me. Thanks for the steps for the issue, I was able to reproduce this issue in my environment with the steps provided. The attached patch has a proposed fix where the latch will not be set in case of the apply worker exiting immediately after starting. Regards, Vignesh
Attachment
pgsql-hackers by date: