RE: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id OS0PR01MB57164AB5716AF2E477D53F6F9489A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> 
> On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > While I cannot be entirely certain of my analysis, I believe the root
> > cause might be related to the backward movement of the confirmed_flush
> > LSN. The following scenario seems possible:
> >
> > 1. The walsender enables the two_phase and sets two_phase_at (which
> > should be the same as confirmed_flush).
> > 2. The slot's confirmed_flush regresses for some reason.
> > 3. The slotsync worker retrieves the remote slot information and
> > enables two_phase for the local slot.
> >
> 
> Yes, this is possible. Here is my theory as to how it can happen in the current
> case. In the failed test, after the primary has prepared a transaction, the
> transaction won't be replicated to the subscriber as two_phase was not
> enabled for the slot. However, subsequent keepalive messages can send the
> latest WAL location to the subscriber and get the confirmation of the same from
> the subscriber without its origin being moved. Now, after we restart the apply
> worker (due to disable/enable for a subscription), it will use the previous
> origin_lsn to temporarily move back the confirmed flush LSN as explained in
> one of the previous emails in another thread [1]. During this temporary
> movement of confirm flush LSN, the slotsync worker fetches the two_phase_at
> and confirm_flush_lsn values, leading to the assertion failure. We see this
> issue intermittently because it depends on the timing of slotsync worker's
> request to fetch the slot's value.

Based on this theory, I can reproduce the BF failure in the 040 tap-test on
HEAD after applying the 0001 patch. This is achieved by using the injection
point to stop the walsender from sending a keepalive before receiving the old
origin position from the apply worker, ensuring the confirmed_flush
consistently moves backward before slotsync.

Additionally, I've reproduced the duplicate data issue on HEAD without slotsync
using the attached script (after applying the injection point patch). This
issue arises if we immediately disable the subscription after the
confirm_flush_lsn moves backward, preventing the walsender from advancing the
confirm_flush_lsn.

In this case, if a prepared transaction exists before two_phase_at, then after
re-enabling the subscription, it will replicate that prepared transaction when
decoding the PREPARE record and replicate that again when decoding the COMMIT
PREPARED record. In such cases, the apply worker keeps reporting the error:

ERROR: transaction identifier "pg_gid_16387_755" is already in use.

Apart from above, we're investigating whether the same issue can occur in
back-branches and will share the results once ready.

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Yura Sokolov
Date:
Subject: Re: bug: virtual generated column can be partition key
Next
From: Laurenz Albe
Date:
Subject: Re: A thousand pg_commit_ts truncate attempts per second, two restarting autovacuum processes, and a explosive replication lag. Oh my.