RE: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id OS3PR01MB571846125A89761A47389CA594AE2@OS3PR01MB5718.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote:

> 
> On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:
> > >
> > > >
> > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thank you for the explanation! I agree that the issue happens in these
> cases.

Just to share the steps to reproduce the issue when creating the subscription
With (create_slot=true, failover = true, two_phase=true, copy_data=false) for reference.

- pub: begin a txn A. This is to stop the slot from building a consistent
  snapshot immediately.
- sub: create subscription with (slot_name='sub', create_slot=true, failover =
  true, two_phase=true, copy_data=false);
- pub: Attach to the walsender that will create the slot and add a checkpoint
  in SnapBuildFindSnapshot() -> (builder->state = SNAPBUILD_FULL_SNAPSHOT;).
  For now, the state should be BUILDING_SNAPSHOT.
- pub: begin another txn B and commit txn A. The snapshot should reach
  FULL_SNAPSHOT now.
- pub: prepared a txn C and then commit txn B.
- pub: release the checkpoint in SnapBuildFindSnapshot(), then it should reach
  the SNAPBUILD_CONSISTENT. Now we have a prepared txn whose prepare_lsn is
  less than the two_phase_at.
- stop the primary and promote the standby.
- sub: alter the subscription to use the new primary as the publisher.
- commit the prepared transaction on new primary, the following error will be
  reported on subscriber:

LOG:  logical replication apply worker for subscription "sub" has started
ERROR:  prepared transaction with identifier "pg_gid_16398_764" does not exist.

> > > >
> > > > As another idea, I wonder if we could somehow defer to make the
> > > > synced slot as 'sync-ready' until we can ensure that the slot
> > > > doesn't have any transactions that are prepared before the point
> > > > of enabling two_phase. For example, when the slotsync worker
> > > > fetches the remote slot, it remembers the confirmed_flush_lsn (say
> > > > LSN-1) if the local slot's two_phase becomes true or the local
> > > > slot is newly created with enabling two_phase, and then it makes
> > > > the slot 'sync-ready' once it confirmed that the slot's restart_lsn passed
> LSN-1. Does it work?
> > >
> > > Thanks for the idea!
> > >
> > > We considered a similar approach in [1] to confirm there is no
> > > prepared transactions before two_phase_at, but the issue is that
> > > when the two_phase flag is switched from 'false' to 'true' (as in
> > > the case with (copy_data=true, failover=true, two_phase=true)). In
> > > this case, the slot may have already been marked as sync-ready
> > > before the two_phase flag is enabled, as slotsync is unaware of potential
> future changes to the two_phase flag.
> >
> > This can happen because when copy_data is true, tablesync can take a
> > long time to complete the sync and in the meantime, slot without a
> > two_phase flag would have been synced to standby. Such a slot would be
> > marked as sync-ready even if we follow the calculation proposed by
> > Sawada-san. Note that we enable two_phase once all the tables are in
> > ready state (See run_apply_worker() and comments atop worker.c
> > (TWO_PHASE TRANSACTIONS)).
> 
> Right. It doesn't make sense to make the slot not-sync-ready and then back to
> sync-ready.
> 
> While I agree with the approach for HEAD and it seems difficult to find a
> solution, I'm concerned that disallowing to use both failover and two_phase in
> a minor release would affect users much. Users who are already using that
> combination might end up needing to re-think their system architecture. So I'm
> trying to narrow down use cases where we're going to prohibit or to find
> workarounds.
> 
> If we agree with the fix for HEAD, we can push the fix for HEAD first, which
> would be better to be done sooner as it needs to bump the catversion. We can
> discuss the ideas and workarounds for v17 later.

Agreed. I will think more on it. One workaround could be skipping the prepared
transaction when such an issue arises, followed by manually replicating the
skipped changes.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: Draft for basic NUMA observability
Next
From: Daniel Gustafsson
Date:
Subject: Re: Extensible user mapping handler for FDW