RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id OS0PR01MB5716A12A7A637DE044E6F20E94112@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > I have one comment on the 0001 patch:
> >
> > Thanks for the comments!
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts
> > > + when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from
> being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to
> > > + advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction
> ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID
> > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > prevents XIDs from being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot
> > after the apply worker has already set the 'oldest_nonremovable_xid'
> > because the launcher are doing that asynchronously. So, Before the
> > slot is created, there's a window where transaction IDs might wrap
> > around. If initially the apply worker has computed a candidate_xid
> > (755) and the xid wraparound before the launcher creates the slot,
> > causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher
> > could advance the xmin to 755 which cause the dead tuples to be removed
> prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and
> > will share after finishing)
> 
> The slot's first xmin is calculated by
> GetOldestSafeDecodingTransactionId(false). The initial computed
> cancidate_xid could be newer than this xid?

I think the issue occurs when the slot is created after an XID wraparound. As a
result, GetOldestSafeDecodingTransactionId() returns the current XID
(after wraparound), which appears older than the computed candidate_xid (e.g.,
oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
issue in [1]. What do you think ?

[1] https://www.postgresql.org/message-id/CABdArM6P0zoEVRN%2B3YHNET_oOaAVOKc-EPUnXiHkcBJ-uDKQVw%40mail.gmail.com

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Jeremy Schneider
Date:
Subject: Re: [PATCH] Optionally record Plan IDs to track plan changes for a query
Next
From: Masahiko Sawada
Date:
Subject: Re: Conflict detection for update_deleted in logical replication