Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAA4eK1JPGytL1b9VHM263B8pGGpibV+S7OD_hdaFSkH4X0u4XA@mail.gmail.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Fix slot synchronization with two_phase decoding enabled
List pgsql-hackers
On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> While I cannot be entirely certain of my analysis, I believe the root
> cause might be related to the backward movement of the confirmed_flush
> LSN. The following scenario seems possible:
>
> 1. The walsender enables the two_phase and sets two_phase_at (which
> should be the same as confirmed_flush).
> 2. The slot's confirmed_flush regresses for some reason.
> 3. The slotsync worker retrieves the remote slot information and
> enables two_phase for the local slot.
>

Yes, this is possible. Here is my theory as to how it can happen in
the current case. In the failed test, after the primary has prepared a
transaction, the transaction won't be replicated to the subscriber as
two_phase was not enabled for the slot. However, subsequent keepalive
messages can send the latest WAL location to the subscriber and get
the confirmation of the same from the subscriber without its origin
being moved. Now, after we restart the apply worker (due to
disable/enable for a subscription), it will use the previous
origin_lsn to temporarily move back the confirmed flush LSN as
explained in one of the previous emails in another thread [1]. During
this temporary movement of confirm flush LSN, the slotsync worker
fetches the two_phase_at and confirm_flush_lsn values, leading to the
assertion failure. We see this issue intermittently because it depends
on the timing of slotsync worker's request to fetch the slot's value.

If this theory is correct, then we need something on the lines of what
Vignesh proposed in email [2] (Confirm_flush_dont_allow_backward) to
fix it.

[1]: https://www.postgresql.org/message-id/CAA4eK1%2BzWQwOe5G8zCYGvErnaXh5%2BDbyg_A1Z3uywSf_4%3DT0UA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CALDaNm3hgow2%2BoEov5jBk4iYP5eQrUCF1yZtW7%2BdV3J__p4KLQ%40mail.gmail.com

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: PG 18 release notes draft committed
Next
From: Yura Sokolov
Date:
Subject: Re: Fix a race condition in ConditionVariableTimedSleep()