Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAD21AoA+XRb8KC8tdb89Y4zS5bjFv2zOAJNBCu9yVDLgZ+-YAg@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
RE: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.
>

Thank you for updating the patches!

I have one comment on the 0001 patch:

+       /*
+        * The changes made by this and later transactions are still
non-removable
+        * to allow for the detection of update_deleted conflicts when applying
+        * changes in this logical replication worker.
+        *
+        * Note that this info cannot directly protect dead tuples from being
+        * prematurely frozen or removed. The logical replication launcher
+        * asynchronously collects this info to determine whether to advance the
+        * xmin value of the replication slot.
+        *
+        * Therefore, FullTransactionId that includes both the
transaction ID and
+        * its epoch is used here instead of a single Transaction ID. This is
+        * critical because without considering the epoch, the transaction ID
+        * alone may appear as if it is in the future due to transaction ID
+        * wraparound.
+        */
+       FullTransactionId oldest_nonremovable_xid;

The last paragraph of the comment mentions that we need to use
FullTransactionId to properly compare XIDs even after the XID
wraparound happens. But once we set the oldest-nonremovable-xid it
prevents XIDs from being wraparound, no? I mean that workers'
oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
xmin) are never away from more than 2^31 XIDs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Conflict detection for update_deleted in logical replication
Next
From: vignesh C
Date:
Subject: Re: Conflict detection for update_deleted in logical replication