Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAD21AoA99fyUcbgGpxzhqWh+MxC5iTVdm0_GDKCV5SHxaEEfng@mail.gmail.com
Whole thread Raw
In response to RE: Fix slot synchronization with two_phase decoding enabled  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses Re: Fix slot synchronization with two_phase decoding enabled
List pgsql-hackers
On Mon, Apr 28, 2025 at 4:33 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Mon, Apr 28, 2025 at 5:33 PM shveta malik wrote:
> >
> > On Mon, Apr 28, 2025 at 2:27 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote:
> > > >
> > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com>
> > > > >
> > > > > -----
> > > > > Fix
> > > > > -----
> > > > >
> > > > > I think we should keep the confirmed_flush even if the previous
> > > > > synced restart_lsn/catalog_xmin is newer. Attachments include a
> > > > > patch for the
> > > > same.
> > > > >
> > > >
> > > > This will fix the case we are facing but adds a new rule for slot
> > synchronization.
> > > > Can we think of a simpler way to fix this by avoiding updating other
> > > > slot fields (like two_phase, two_phase_at) if restart_lsn or
> > > > catalog_xmin of the local slot is ahead of the remote slot?
> > >
> > > Since the fix for xmin advancement during fast-forward decoding has
> > > been pushed (commit aaf9e95), I'm attaching the V2 patch to address
> > > the assert failure by avoiding updating other slot fields (like
> > > two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot is
> > ahead.
> > >
> >
> > The patch looks good to me. We can have minor comment tweak:
> >
> > + * Syncing only two_phase_at, without also syncing the latest
> > + * confirmed_lsn, might lead to transactions between the old
> > + * confirmed_lsn and two_phase_at being unexpectedly decoded and sent
> > + * to the subscriber.
> >
> > We can append "following a promotion".
>
> Thanks for reviewing. Here is V3 patch that addressed it.
>
> BTW, I also did some tests to confirm the catalog_xmin could
> still be ahead in some case, and here is an example:
>
> 1. Create a failover replication slot named 'logicalslot' on primary
>   and acquire it in the walsender.
>
> 2. Log two standby snapshots on primary. Before logging, call txid_current()
>  To assign a xid, so that each standby snapshot will hold a new xid in its oldestrunningXid field:
>    - txid_current();
>    - `0/3000420` - `running_xacts` (no running transactions, oldestrunningXid = 755)
>    - txid_current();
>    - `0/3000488` - `running_xacts` (no running transactions, oldestrunningXid = 756)
>
> 3. The walsender sets `0/3000420` as the `candidate_restart_lsn`, 755 as
>    `candidate_catalog_xmin`, skipping the second `running_xacts` because
>    `candidate_restart_lsn`/`candidate_catalog_xmin` is already valid.
>
> 4. After receiving a reply from the apply worker, the walsender assigns
>    `0/3000420` to `restart_lsn`, `755` to `catalog_xmin`. At this point, the
>    replication slot 'logicalslot' has `restart_lsn` set to `0/3000420`,
>    `catalog_xmin` set to `755`.
>
> 5. On the standby, execute `pg_sync_replication_slots()` to synchronize
>    'logicalslot'.
>
> 6. During synchronization, with the initial `restart_lsn` at `0/3000420`, the
>    sync slot reaches a consistent point at this position. As a result, it does
>    not update `candidate_restart_lsn` and `candidate_catalog_xmin` at
>    consistent point (refer to `SnapBuildProcessRunningXacts()`).
>
> 7. The sync process identifies the second standby snapshot at `0/3000488` and
>    uses its LSN as `candidate_restart_lsn`, and use the oldestrunningXid `756`
>    as `candidate_catalog_xmin`, eventually updating it to `restart_lsn` and
>    `catalog_xmin`.
>
> 8. Now, the synced slot holds `restart_lsn` at `0/3000488`, `catalog_xmin` at
>    `756`, which are all ahead of the remote slot on the primary server.
>
> Attaching a script to reproduce the same.
>
> Note that, to reproduce this stably, we'd better modify the value of
> LOG_SNAPSHOT_INTERVAL_MS in bgwriter.c to a bigger number to avoid
> unexpected xl_running_xacts logging.

Thank you for sharing the patch and reproducible steps. I agree that
this issue still happens on HEAD and should be fixed.

If I understand the problem correctly, we need to avoid setting the
values such that two_phase_at > confirmed_lsn. Because otherwise if a
transaction is prepared between them, we end up decoding the prepared
transaction twice or with an assertion failure if it's enabled. It
never happens on the primary since we always set the same value to
two_phase_at as confirmed_lsn.

Your fix looks good to me. Is it worth considering putting an
assertion to verify if new two_phase_at is equal to or greater than
confirmed_lsn (or at least it doesn't go backward), when syncing
two_phase_at?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Large expressions in indexes can't be stored (non-TOASTable)
Next
From: Nikhil Kumar Veldanda
Date:
Subject: Re: ZStandard (with dictionaries) compression support for TOAST compression