Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAA4eK1KC1pYsH6THkLHMRm+kViXUNpDYpo+HbxLFr5MfMKnV7Q@mail.gmail.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Tue, May 6, 2025 at 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, May 5, 2025 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > While I cannot be entirely certain of my analysis, I believe the root
> > > cause might be related to the backward movement of the confirmed_flush
> > > LSN. The following scenario seems possible:
> > >
> > > 1. The walsender enables the two_phase and sets two_phase_at (which
> > > should be the same as confirmed_flush).
> > > 2. The slot's confirmed_flush regresses for some reason.
> > > 3. The slotsync worker retrieves the remote slot information and
> > > enables two_phase for the local slot.
> > >
> >
> > Yes, this is possible. Here is my theory as to how it can happen in
> > the current case. In the failed test, after the primary has prepared a
> > transaction, the transaction won't be replicated to the subscriber as
> > two_phase was not enabled for the slot. However, subsequent keepalive
> > messages can send the latest WAL location to the subscriber and get
> > the confirmation of the same from the subscriber without its origin
> > being moved. Now, after we restart the apply worker (due to
> > disable/enable for a subscription), it will use the previous
> > origin_lsn to temporarily move back the confirmed flush LSN as
> > explained in one of the previous emails in another thread [1]. During
> > this temporary movement of confirm flush LSN, the slotsync worker
> > fetches the two_phase_at and confirm_flush_lsn values, leading to the
> > assertion failure. We see this issue intermittently because it depends
> > on the timing of slotsync worker's request to fetch the slot's value.
> >
> > If this theory is correct, then we need something on the lines of what
> > Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to
> > fix it.
>
> This theory seems plausible. According to the thread you shared, we
> didn't address the issue of backward movement in the slot's
> confirmed_flush LSN. I think that we should carefully consider whether
> this behavior is acceptable in the first place. If we decide to
> maintain this behavior, we would need to modify the slotsync to deal
> with such scenarios (note we have already ensured that the synced
> slot's confirmed_flush cannot regress). However, I'm concerned about a
> case like where the server crashes immediately after persisting the
> retreated confirmed_flush LSN (along with restart_lsn or xmin updates)
> to disk. In such a case, wouldn't the walsender potentially send
> duplicate data?
>

Yes that could happen but last time when we tried we couldn't
reproduce such a case. However, we will try some more to see if we can
reproduce a problem even without slotsync. Having said that, we will
also try to reproduce the failure with slotsync in the test where BF
is showing failure to confirm the above theory.

>
 If so, we would need a patch like Vignesh proposed.
>

Right, I feel that is the right thing to do, even if we could prove
that it can lead to failure in slotsync.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: 2025-05-08 release announcement draft
Next
From: Jack Ng
Date:
Subject: RE: Changing shared_buffers without restart