Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Fix slot synchronization with two_phase decoding enabled |
Date | |
Msg-id | CAA4eK1+q_qmp45weGeW4yBmHV4k-kiFYNFNH4oeg_N8vGjC+Ew@mail.gmail.com Whole thread Raw |
In response to | RE: Fix slot synchronization with two_phase decoding enabled ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Hi, > > The recently added tests for slotsync on two-phase enabled slots failed[1] due > to a broken assertion triggered while decoding the COMMIT PREPARED record on > the promoted standby. > > ----- > Analysis > ----- > > if ((txn->final_lsn < two_phase_at) && is_commit) > { > /* > * txn must have been marked as a prepared transaction and skipped but > * not sent a prepare. Also, the prepare info must have been updated > * in txn even if we skip prepare. > */ > Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) == > (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)); > > I think the issue stem from the PREPARED transaction, which was skipped on the > primary, not being skipped on the promoted standby. The root cause appears to > be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced > to the standby. This results in the confirmed_flush (0/6004F90) of the synced > slot being less than the synced two_phase_at field (0/6005118). Consequently, > the PREPARE record's LSN exceeds the synced confirmed_flush during > decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when > decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but > wasn't skipped. > > -- The LOGs that proves the confirmed_flush is not synced: > > decoding on original slot (primary): > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. > > decoding on synced slot (promted standby) - failed run: > DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8. > > decoding on synced slot (promted standby) - success run: > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. > -- > > The logs above clearly show that during the test failure, the starting position > (6004F90) was less than that of a successful run. Additionally, it was lower > than the starting position of the remote slot on the primary, indicating a > failure to synchronize confirmed_lsn. > > The reason confirmed_flush isn't synced should stem from the following check, > which skip the syncing of the confirmed_flush value. I also reproduced the > assertion failure by creating a scenario that leads to sync omission due to > this check: > > /* > * Don't overwrite if we already have a newer catalog_xmin and > * restart_lsn. > */ > if (remote_slot->restart_lsn < slot->data.restart_lsn || > TransactionIdPrecedes(remote_slot->catalog_xmin, > slot->data.catalog_xmin)) > { > > > This check triggered because the synced catalog_xmin surpassed that of the > source slot (It can be seen from the log that the restart_lsn was synced to the > same value) ). This situation arises due to several factors: > > On the publisher(primary), the slot may retain an older catalog_xmin and > restart_lsn because it is being consumed by a walsender. In this scenario, if > the apply worker does not immediately confirm that changes have been flushed, > the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old > value for candidate_catalog_xmin/candidate_restart_lsn and would not update > them until the apply worker confirms via LogicalConfirmReceivedLocation. > > However, the slotsync worker has the opportunity to begin WAL reading from a > more recent point, potentially advancing catalog_xmin/restart_lsn to newer > values. > > And it's also possible to advance only catalog_xmin while keep restart_lsn > unchanged, by starting a transaction before the running_xact record. In this > case, it would pass builder->last_serialized_snapshot as restart_lsn to > LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would > point to the position of the last running_xact record instead of the current > one, or the value would be NULL if no snapshot has been serialized before, so > it would not advance restart_lsn. The advancement of catalog_xmin is not > blocked by running transaction, as it would either use the XID of the running > transaction or the oldestRunningXid as candidate. > Your analysis appears correct to me. > > ----- > Fix > ----- > > I think we should keep the confirmed_flush even if the previous synced > restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. > This will fix the case we are facing but adds a new rule for slot synchronization. Can we think of a simpler way to fix this by avoiding updating other slot fields (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot is ahead of the remote slot? -- With Regards, Amit Kapila.
pgsql-hackers by date: