Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss - Mailing list pgsql-bugs

From Callahan, Drew
Subject Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Date
Msg-id 3AF2B908-D293-43F9-AE06-73C7797C96B0@amazon.com
Whole thread Raw
In response to Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
Hi Amit,

Thanks for commenting, you're correct. I had misremembered the case and upon review found that
I had the situation inverted for Rel14. We saw that none of the main apply workers were launching
and all of the table apply workers were not transitioning out of the syncwait state. As a result, they
kept replaying changes since the logic only checks if we have crossed the LSN if in the CATCHUP state.

                slot_name                |  plugin  | slot_type | datoid | database | temporary | active | active_pid |
xmin| catalog_xmin |  restart_lsn  | confirmed_flush_lsn | pg_current_wal_lsn | pg_current_wal_flush_lsn |
pg_current_wal_insert_lsn| pg_wal_lsn_diff 
 

-----------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------+--------------------+--------------------------+---------------------------+-----------------
 main_slot_5                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1512105843 | 22D5/33FC5B60 | 22D5/34C8D910       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    203488955600
 
 main_slot_3                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515152530 | 22D5/71A15460 | 22D5/7323FA88       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202454733776
 
 main_slot_1                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515398672 | 22D5/76AB8500 | 22D5/77806A10       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202370179888
 
 main_slot_2                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515398672 | 22D5/76AB8500 | 22D5/77039830       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202370179888
 
 main_slot_0                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1515277554 | 22D5/74320CF8 | 22D5/74539970       | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    202411694904
 
 main_slot_4                                               | pgoutput | logical   |  16396 | users    | f         | f
  |            |      |   1509956699 | 22D5/8643A70  | 22D5/AEFE230        | 2304/94DF6830      | 2304/94DF6830
  | 2304/94E68B18             |    204220345792
 
 pg_21077_sync_20174_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |      97665 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21173_sync_20205_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8349 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21121_sync_20141_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4319 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21178_sync_20206_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8279 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21081_sync_20101_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4065 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21081_sync_20109_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8302 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21173_sync_20195_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4385 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20183_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       6690 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21178_sync_20198_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       4548 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20152_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |      97564 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21121_sync_20149_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       8384 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
 pg_21077_sync_20159_7063584474467025194 | pgoutput | logical   |  16396 | users    | f         | t      |       5559 |
    |   2116325349 | 2304/94B86518 | 2304/94DF6830       | 2304/94DF6830      | 2304/94DF6830            |
2304/94E68B18            |         2556696
 
(18 rows)

On the server side, we did not see evidence of WALSenders being launched. As a result, the gap kept increasing further
and further since they workers would not transition to the catchup state after several hours due to this.

Thanks,
Drew


On Fri, Oct 13, 2023 at 6:43 AM PG Bug reporting form
<noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote:
>
> Depending on the Major Version an untimely timeout or termination of the
> main apply worker when the table apply worker is waiting for the
> subscription relstate to change to SUBREL_STATE_CATCHUP can lead to one of
> two really painful experiences.
>
> If on Rel14+, the untimely exit can lead to the main apply worker becoming
> indefinitely stuck while it waits for a table apply worker, that has already
> exited and won't be launched again, to change the subscription relstate to
> SUBREL_STATE_SYNCDONE. In order to unwedge, a system restart is required to
> clear the corrupted transient subscription relstate data.
>
> If on Rel13+, then the untimely exit can lead to silent data loss. This will
> occur if the table apply worker performed a copy at LSN X. If the main apply
> worker is now at LSN Y > X, the system requires the table sync worker to
> apply all changes between X & Y that were skipped by the main apply worker
> in a catch up phase. Due to the untimely exit, the table apply worker will
> assume that the main apply worker was actually behind, skip the catch up
> work, and exit. As a result, all data between X & Y will be lost for that
> table.
>
> The cause of both issues is that wait_for_worker_state_change() is handled
> like a void function by the table apply worker. However, if the main apply
> worker does not currently exist on the system due to some issue such as a
> timeout triggering it to safely exit, then the function will return *before*
> the state change has occurred and return false.
>


But even if wait_for_worker_state_change() returns false, ideally
table sync worker shouldn't exit without marking the relstate as
SYNCDONE. The table sync worker should keep looping till the state
changes to SUBREL_STATE_CATCHUP. See process_syncing_tables_for_sync()
which doesn't allow to exit the table sync worker exit. Your
observation doesn't match with this analysis, so can you please share
how did you reach to the conclusion that the table sync worker will
exit and won't restart?


--
With Regards,
Amit Kapila.




pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18159: Database Integrity Concerns: Missing FK Constraint
Next
From: Laurenz Albe
Date:
Subject: Re: BUG #18159: Database Integrity Concerns: Missing FK Constraint