Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date
Msg-id 858a7622-2c81-1687-d1df-1322dfcb2e72@gmail.com
Whole thread Raw
In response to Re: Race condition in FetchTableStates() breaks synchronization of subscription tables  (vignesh C <vignesh21@gmail.com>)
Responses Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
List pgsql-hackers
Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:
> Here is an updated patch which changes the boolean variable to a
> tri-state enum and set stable state to valid only if no invalidations
> have been occurred while the list is being prepared.
>

While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49      All tests successful.
49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  0.30 csys =  0.77 CPU)
49      Result: PASS
12      All tests successful.
12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  0.40 csys =  0.89 CPU)
12      Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation "public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message

type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types" does 
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message

type "INSERT" in transaction 757, finished at 0/153C838
...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE TABLE public.test_mismatching_types (
             a int PRIMARY KEY
         );
2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for 
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited
with
 
exit code 1
...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
   ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                        10L, WAIT_EVENT_BGWORKER_STARTUP);

         if (rc & WL_LATCH_SET)
         {
             ResetLatch(MyLatch);
             CHECK_FOR_INTERRUPTS();
         }

With pg_usleep(300000); added just before ResetLatch and
  $node_subscriber->safe_psql(
      'postgres', qq(
+SELECT pg_sleep(0.5);
      CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 csys =  0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 csys =  0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
  to attach:
  ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
  relation does not exist, in our case) and sends a signal to set the latch
  for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
  it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
  (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.
And we can find such abnormal execution on the buildfarm too:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check

So, if we had PG_TEST_TIMEOUT_DEFAULT less than DEFAULT_NAPTIME_PER_CYCLE,
we would notice this situation as a test failure, but that timeouts are
equal (accidentally?), hence, say, wait_for_log() just waits long enough
for an expected replication worker to start eventually.

As to the initial issue fixes, v3 and v2 look good to me, but v2 is more
appealing for back-patching (if this fix going to be back-patched), IMHO.

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: SQL:2011 application time
Next
From: Bertrand Drouvot
Date:
Subject: Re: Synchronizing slots from primary to standby