On Thu, Apr 3, 2025 at 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote:
> > >
> > > >
> > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thank you for the explanation! I agree that the issue happens in these cases.
> > > >
> > > > As another idea, I wonder if we could somehow defer to make the synced
> > > > slot as 'sync-ready' until we can ensure that the slot doesn't have
> > > > any transactions that are prepared before the point of enabling
> > > > two_phase. For example, when the slotsync worker fetches the remote
> > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local
> > > > slot's two_phase becomes true or the local slot is newly created with
> > > > enabling two_phase, and then it makes the slot 'sync-ready' once it
> > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work?
> > >
> > > Thanks for the idea!
> > >
> > > We considered a similar approach in [1] to confirm there is no prepared
> > > transactions before two_phase_at, but the issue is that when the two_phase flag
> > > is switched from 'false' to 'true' (as in the case with (copy_data=true,
> > > failover=true, two_phase=true)). In this case, the slot may have already been
> > > marked as sync-ready before the two_phase flag is enabled, as slotsync is
> > > unaware of potential future changes to the two_phase flag.
> >
> > This can happen because when copy_data is true, tablesync can take a
> > long time to complete the sync and in the meantime, slot without a
> > two_phase flag would have been synced to standby. Such a slot would be
> > marked as sync-ready even if we follow the calculation proposed by
> > Sawada-san. Note that we enable two_phase once all the tables are in
> > ready state (See run_apply_worker() and comments atop worker.c
> > (TWO_PHASE TRANSACTIONS)).
>
> Right. It doesn't make sense to make the slot not-sync-ready and then
> back to sync-ready.
>
> While I agree with the approach for HEAD and it seems difficult to
> find a solution, I'm concerned that disallowing to use both failover
> and two_phase in a minor release would affect users much. Users who
> are already using that combination might end up needing to re-think
> their system architecture. So I'm trying to narrow down use cases
> where we're going to prohibit or to find workarounds.
>
> If we agree with the fix for HEAD, we can push the fix for HEAD first,
> which would be better to be done sooner as it needs to bump the
> catversion. We can discuss the ideas and workarounds for v17 later.
>
Thanks, I'll push the patch for HEAD and then keep thinking if we have
a better way to deal with the problem in 17. BTW, the problem for 17
can happen in a much narrower set of cases as explained in the emails
above.
--
With Regards,
Amit Kapila.