Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAA4eK1+q_qmp45weGeW4yBmHV4k-kiFYNFNH4oeg_N8vGjC+Ew@mail.gmail.com
Whole thread Raw
In response to RE: Fix slot synchronization with two_phase decoding enabled  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
List pgsql-hackers
On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Hi,
>
> The recently added tests for slotsync on two-phase enabled slots failed[1] due
> to a broken assertion triggered while decoding the COMMIT PREPARED record on
> the promoted standby.
>
> -----
> Analysis
> -----
>
>         if ((txn->final_lsn < two_phase_at) && is_commit)
>         {
>                 /*
>                  * txn must have been marked as a prepared transaction and skipped but
>                  * not sent a prepare. Also, the prepare info must have been updated
>                  * in txn even if we skip prepare.
>                  */
>                 Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) ==
>                            (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));
>
> I think the issue stem from the PREPARED transaction, which was skipped on the
> primary, not being skipped on the promoted standby. The root cause appears to
> be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced
> to the standby. This results in the confirmed_flush (0/6004F90) of the synced
> slot being less than the synced two_phase_at field (0/6005118). Consequently,
> the PREPARE record's LSN exceeds the synced confirmed_flush during
> decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when
> decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but
> wasn't skipped.
>
> -- The LOGs that proves the confirmed_flush is not synced:
>
> decoding on original slot (primary):
>         LOG:  0/6004F90 has been already streamed, forwarding to 0/6005118
>         ...
>         DETAIL:  Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
>
> decoding on synced slot (promted standby) - failed run:
>         DETAIL:  Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8.
>
> decoding on synced slot (promted standby) - success run:
>         LOG:  0/6004F90 has been already streamed, forwarding to 0/6005118
>         ...
>         DETAIL:  Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
> --
>
> The logs above clearly show that during the test failure, the starting position
> (6004F90) was less than that of a successful run. Additionally, it was lower
> than the starting position of the remote slot on the primary, indicating a
> failure to synchronize confirmed_lsn.
>
> The reason confirmed_flush isn't synced should stem from the following check,
> which skip the syncing of the confirmed_flush value. I also reproduced the
> assertion failure by creating a scenario that leads to sync omission due to
> this check:
>
>         /*
>          * Don't overwrite if we already have a newer catalog_xmin and
>          * restart_lsn.
>          */
>         if (remote_slot->restart_lsn < slot->data.restart_lsn ||
>                 TransactionIdPrecedes(remote_slot->catalog_xmin,
>                                                           slot->data.catalog_xmin))
>         {
>
>
> This check triggered because the synced catalog_xmin surpassed that of the
> source slot (It can be seen from the log that the restart_lsn was synced to the
> same value) ). This situation arises due to several factors:
>
> On the publisher(primary), the slot may retain an older catalog_xmin and
> restart_lsn because it is being consumed by a walsender. In this scenario, if
> the apply worker does not immediately confirm that changes have been flushed,
> the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old
> value for candidate_catalog_xmin/candidate_restart_lsn and would not update
> them until the apply worker confirms via LogicalConfirmReceivedLocation.
>
> However, the slotsync worker has the opportunity to begin WAL reading from a
> more recent point, potentially advancing catalog_xmin/restart_lsn to newer
> values.
>
> And it's also possible to advance only catalog_xmin while keep restart_lsn
> unchanged, by starting a transaction before the running_xact record. In this
> case, it would pass builder->last_serialized_snapshot as restart_lsn to
> LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would
> point to the position of the last running_xact record instead of the current
> one, or the value would be NULL if no snapshot has been serialized before, so
> it would not advance restart_lsn. The advancement of catalog_xmin is not
> blocked by running transaction, as it would either use the XID of the running
> transaction or the oldestRunningXid as candidate.
>

Your analysis appears correct to me.

>
> -----
> Fix
> -----
>
> I think we should keep the confirmed_flush even if the previous synced
> restart_lsn/catalog_xmin is newer. Attachments include a patch for the same.
>

This will fix the case we are facing but adds a new rule for slot
synchronization. Can we think of a simpler way to fix this by avoiding
updating other slot fields (like two_phase, two_phase_at) if
restart_lsn or catalog_xmin of the local slot is ahead of the remote
slot?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: n_ins_since_vacuum stats for aborted transactions
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Changing shared_buffers without restart