On Mon, May 5, 2025 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > While I cannot be entirely certain of my analysis, I believe the root
> > cause might be related to the backward movement of the confirmed_flush
> > LSN. The following scenario seems possible:
> >
> > 1. The walsender enables the two_phase and sets two_phase_at (which
> > should be the same as confirmed_flush).
> > 2. The slot's confirmed_flush regresses for some reason.
> > 3. The slotsync worker retrieves the remote slot information and
> > enables two_phase for the local slot.
> >
>
> Yes, this is possible. Here is my theory as to how it can happen in
> the current case. In the failed test, after the primary has prepared a
> transaction, the transaction won't be replicated to the subscriber as
> two_phase was not enabled for the slot. However, subsequent keepalive
> messages can send the latest WAL location to the subscriber and get
> the confirmation of the same from the subscriber without its origin
> being moved. Now, after we restart the apply worker (due to
> disable/enable for a subscription), it will use the previous
> origin_lsn to temporarily move back the confirmed flush LSN as
> explained in one of the previous emails in another thread [1]. During
> this temporary movement of confirm flush LSN, the slotsync worker
> fetches the two_phase_at and confirm_flush_lsn values, leading to the
> assertion failure. We see this issue intermittently because it depends
> on the timing of slotsync worker's request to fetch the slot's value.
>
> If this theory is correct, then we need something on the lines of what
> Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to
> fix it.
This theory seems plausible. According to the thread you shared, we
didn't address the issue of backward movement in the slot's
confirmed_flush LSN. I think that we should carefully consider whether
this behavior is acceptable in the first place. If we decide to
maintain this behavior, we would need to modify the slotsync to deal
with such scenarios (note we have already ensured that the synced
slot's confirmed_flush cannot regress). However, I'm concerned about a
case like where the server crashes immediately after persisting the
retreated confirmed_flush LSN (along with restart_lsn or xmin updates)
to disk. In such a case, wouldn't the walsender potentially send
duplicate data? If so, we would need a patch like Vignesh proposed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com