Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers
From | Nisha Moond |
---|---|
Subject | Re: Fix slot synchronization with two_phase decoding enabled |
Date | |
Msg-id | CABdArM5ziR1sfmnf84QK-rK=2PKkr-veyNt=Q_j=C0Vm5NJ0ng@mail.gmail.com Whole thread Raw |
In response to | RE: Fix slot synchronization with two_phase decoding enabled ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
HI, Please find the patches attached for all three approaches. On Wed, Apr 9, 2025 at 10:45 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote: > > > > On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > > > > > > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > Thank you for the explanation! I agree that the issue happens in > > > > > > these > > > cases. > > > > > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > > > doesn't have any transactions that are prepared before the point > > > > > > of enabling two_phase. For example, when the slotsync worker > > > > > > fetches the remote slot, it remembers the confirmed_flush_lsn > > > > > > (say > > > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > > > slot is newly created with enabling two_phase, and then it makes > > > > > > the slot 'sync-ready' once it confirmed that the slot's > > > > > > restart_lsn passed > > > LSN-1. Does it work? > > > > > > > > > > Thanks for the idea! > > > > > > > > > > We considered a similar approach in [1] to confirm there is no > > > > > prepared transactions before two_phase_at, but the issue is that > > > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > > > this case, the slot may have already been marked as sync-ready > > > > > before the two_phase flag is enabled, as slotsync is unaware of > > > > > potential > > > future changes to the two_phase flag. > > > > > > > > This can happen because when copy_data is true, tablesync can take a > > > > long time to complete the sync and in the meantime, slot without a > > > > two_phase flag would have been synced to standby. Such a slot would > > > > be marked as sync-ready even if we follow the calculation proposed > > > > by Sawada-san. Note that we enable two_phase once all the tables are > > > > in ready state (See run_apply_worker() and comments atop worker.c > > > > (TWO_PHASE TRANSACTIONS)). > > > > > > Right. It doesn't make sense to make the slot not-sync-ready and then > > > back to sync-ready. > > > > > > While I agree with the approach for HEAD and it seems difficult to > > > find a solution, I'm concerned that disallowing to use both failover > > > and two_phase in a minor release would affect users much. Users who > > > are already using that combination might end up needing to re-think > > > their system architecture. So I'm trying to narrow down use cases > > > where we're going to prohibit or to find workarounds. > > We analyzed more for the backbranch fix, and here is a summary of different > approaches that could be used for PG17. > > ---------- > Approach 1 > ---------- > > This is the original approach implemented in V4 patch. > > In this approach, we entirely disallow enabling both failover and the two-phase > feature together for a replication slot or subscription. > > pros: > > This restriction is simple to implement and easy for users to comprehend. > > cons: > > Users would be unable to use the two-phase feature in conjunction with > failover. > > Following the upgrade to a new release with this fix, existing subscriptions > that have both failover and two-phase enabled would require manual re-creation, > which is time-consuming. > Patch "v5_aprch_1-0001" implements the above Approach 1. > ---------- > Approach 2 > ---------- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. > > Users are not always forced to re-create subscriptions post-upgrade. > > cons: > > The logic involved for (restart_lsn > two_phase_at) might be less intuitive for > users. > > After upgrading, it's recommended that users verify whether two_phase_at for a > source slot is less than restart_lsn of the synced slot. If it isn't, users > should be careful when using the synced slot on a standby. This might be > complex to understand. > Patch "v5_aprch_2-0001" implements the above Approach 2. > ---------- > Approach 3 > ---------- > > This approach is similar to Approach 2 but simplifies some aspects by avoiding > checks for prepared transactions during slot creation. It disallows enabling > both failover and twophase during slot creation, but permits altering failover > to true once ensured that no prepared transaction exists (restart_lsn > > two_phase_at). > > pros: > > Like Approach 2, this reduces user restrictions compared to completely > disallowing failover and two-phase usage together. > > Users are not always forced to re-create subscriptions post-upgrade. > > cons: > > User still could not enable failover and twophase when creating a slot. > > Similar to approach 1, the check (restart_lsn > two_phase_at) might be less > intuitive for users. > > After upgrading, it's recommended that users verify whether two_phase_at for a > source slot is less than restart_lsn of the synced slot. If it isn't, users > should be careful when using the synced slot on a standby. This might be > complex to understand. > Patch "v5_aprch_3-0001" implements the above Approach 3. Thanks Hou-san for implementing approach-2 and providing the patch. -- Thanks, Nisha
Attachment
pgsql-hackers by date: