BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Date
Msg-id 18155-33f555bcc7e1105e@postgresql.org
Whole thread Raw
Responses Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18155
Logged by:          Drew Callahan
Email address:      callaan@amazon.com
PostgreSQL version: 16.0
Operating system:   Ubuntu
Description:

Depending on the Major Version an untimely timeout or termination of the
main apply worker when the table apply worker is waiting for the
subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
two really painful experiences.

If on Rel14+, the untimely exit can lead to the main apply worker becoming
indefinitely stuck while it waits for a table apply worker, that has already
exited and won't be launched again, to change the subscription relstate to
SUBREL_STATE_SYNCDONE.  In order to unwedge, a system restart is required to
clear the corrupted transient subscription relstate data.

If on Rel13+, then the untimely exit can lead to silent data loss. This will
occur if the table apply worker performed a copy at LSN X. If the main apply
worker is now at LSN Y > X, the system requires the table sync worker to
apply all changes between X & Y that were skipped by the main apply worker
in a catch up phase. Due to the untimely exit, the table apply worker will
assume that the main apply worker was actually behind, skip the catch up
work, and exit. As a result, all data between X & Y will be lost for that
table.

The cause of both issues is that wait_for_worker_state_change() is handled
like a void function by the table apply worker. However, if the main apply
worker does not currently exist on the system due to some issue such as a
timeout triggering it to safely exit, then the function will return *before*
the state change has occurred and return false.

The fix for this issue is relatively, we just need to validate the state
actually did change. If it did not, then we can circle back later to check
again which is relatively cheap on Rel14+. Pasted a potential solution
below. A solution on Rel13- would want to introduce a retry loop in
wait_for_worker_state_change() to avoid having to do the entire COPY again
first which is very expensive on large tables.

diff --git a/src/backend/replication/logical/tablesync.c
b/src/backend/replication/logical/tablesync.c
index e2cee92cf2..25be978e15 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1529,10 +1529,30 @@ copy_table_done:
        SpinLockRelease(&MyLogicalRepWorker->relmutex);

        /*
-        * Finally, wait until the leader apply worker tells us to catch up
and
-        * then return to let LogicalRepApplyLoop do it.
+        * Finally, wait until the main apply worker tells us to catch up
and then
+        * return to let LogicalRepApplyLoop do it. If this wait fails due
to the
+        * apply worker not being up, we fail this attempt and force a
retry.
         */
-       wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
+       if (!wait_for_worker_state_change(SUBREL_STATE_CATCHUP))
+       {
+               ereport(ERROR,
+                               (errcode(ERRCODE_INTERNAL_ERROR),
+                               errmsg("logical replication table
synchronization worker for subscription \"%s\", relid \"%u\" "
+                                               "unable to finish due to
main apply worker being unavailable.",
+                                               MySubscription->name,
MyLogicalRepWorker->relid),
+                               errhint("this worker will attempt to sync
the table again once logical replication has resumed")));
+       }
+
+       if (MyLogicalRepWorker->relstate != SUBREL_STATE_CATCHUP)
+       {
+               ereport(ERROR,
+                               (errcode(ERRCODE_INTERNAL_ERROR),
+                               errmsg("logical replication table
synchronization worker for subscription \"%s\", relid \"%u\" "
+                                               "unable to make progress
while apply worker is down",
+                                               MySubscription->name,
MyLogicalRepWorker->relid),
+                               errhint("this worker will attempt to sync
the table again once logical replication has resumed")));
+       }
+
        return slotname;
 }


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Next
From: Amit Kapila
Date:
Subject: Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss