Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables |
Date | |
Msg-id | 858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com Whole thread Raw |
In response to | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
|
List | pgsql-hackers |
Hello Vignesh and Hou-san, 01.02.2024 07:59, vignesh C wrote: > Here is an updated patch which changes the boolean variable to a > tri-state enum and set stable state to valid only if no invalidations > have been occurred while the list is being prepared. > While testing the v3 patch, I observed another anomaly with the 014_binary test. When I run the test in parallel, I see that sometimes one of the test instances runs much longer than others. For example: 49 All tests successful. 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU) 49 Result: PASS 12 All tests successful. 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU) 12 Result: PASS As far as I can see, this happens due to another race condition, this time in launcher.c. For such a three-minute case I see in _subscriber.log: 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 ... 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types ( a int PRIMARY KEY ); 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for subscription 16398" 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with exit code 1 ... 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; ... 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub" (there is no interesting activity between 14:33:13 and 14:36:13) So logical replication apply worker exited because CREATE TABLE was not executed on subscriber yet, and new replication worker started because of a timeout occurred in WaitLatch(..., wait_time, ...) inside ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE (180 sec)). But in a normal (fast) case the same WaitLatch exits due to MyLatch set as a result of: logical replication apply worker| logicalrep_worker_onexit() -> ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). In a bad case, I see that the SetLatch() called as well, but then the latch gets reset by the following code in WaitForReplicationWorkerAttach(): rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_BGWORKER_STARTUP); if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } With pg_usleep(300000); added just before ResetLatch and $node_subscriber->safe_psql( 'postgres', qq( +SELECT pg_sleep(0.5); CREATE TABLE public.test_mismatching_types ( in 014_binary.pl, I can see the anomaly without running tests in parallel, just when running that test in a loop: for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done ... iteration 2 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU) Result: PASS iteration 3 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU) Result: PASS ... In other words, the abnormal test execution takes place due to the following events: 1. logicalrep worker launcher launches replication worker and waits for it to attach: ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach() 2. logicalrep worker exits due to some error (logical replication target relation does not exist, in our case) and sends a signal to set the latch for launcher 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets it inside WaitForReplicationWorkerAttach() 4. launcher gets back to ApplyLauncherMain() where it waits for the latch (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms). Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the test 027_nosuperuser executes for 3+ minutes on each run for me. And we can find such abnormal execution on the buildfarm too: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check So, if we had PG_TEST_TIMEOUT_DEFAULT less than DEFAULT_NAPTIME_PER_CYCLE, we would notice this situation as a test failure, but that timeouts are equal (accidentally?), hence, say, wait_for_log() just waits long enough for an expected replication worker to start eventually. As to the initial issue fixes, v3 and v2 look good to me, but v2 is more appealing for back-patching (if this fix going to be back-patched), IMHO. Best regards, Alexander
pgsql-hackers by date: