RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id OS0PR01MB5716C8B3C364EEE86F91B3D294102@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
Re: Conflict detection for update_deleted in logical replication
Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> I have one comment on the 0001 patch:

Thanks for the comments!

> 
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when
> applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance
> the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
> 
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID wraparound
> happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.

I think the issue is that the launcher may create the replication slot after
the apply worker has already set the 'oldest_nonremovable_xid' because the
launcher are doing that asynchronously. So, Before the slot is created, there's
a window where transaction IDs might wrap around. If initially the apply worker
has computed a candidate_xid (755) and the xid wraparound before the launcher
creates the slot, causing the new current xid to be (740), then the old
candidate_xid(755) looks like a xid in the future, and the launcher could
advance the xmin to 755 which cause the dead tuples to be removed prematurely.
(We are trying to reproduce this to ensure that it's a real issue and will
share after finishing)

We thought of another approach, which is to create/drop this slot first as
soon as one enables/disables detect_update_deleted (E.g. create/drop slot
during DDL). But it seems complicate to control the concurrent slot
create/drop. For example, if one backend A enables detect_update_deteled, it
will create a slot. But if another backend B is disabling the
detect_update_deteled at the same time, then the newly created slot may be
dropped by backend B. I thought about checking the number of subscriptions that
enables detect_update_deteled before dropping the slot in backend B, but the
subscription changes caused by backend A may not visable yet (e.g. not
committed yet).

Does that make sense to you, or do you have some other ideas?

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Ilia Evdokimov
Date:
Subject: Re: Remove unused rel parameter in lookup_var_attr_stats