Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From shveta malik
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAJpy0uCgJ7STLky+VbzjiGkHCDrM+whVA6qPX5f=fJF4NRg-YA@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, Jun 2, 2025 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
>
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin)
> +{
> + FullTransactionId full_xmin;
> + FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
> + return;
>
> The above code suggests that the launcher could compute a new xmin
> that is less than slot's xmin. At first, this looks odd to me, but
> IIUC, this can happen when the user toggles retain_conflict_info flag
> at some random time when the launcher is trying to compute the new
> xmin value for the slot. One of the possible combinations of steps for
> this race could be as follows:
>
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false
> 2. Say the launcher calls get_subscription_list(), and worker A is
> already alive.
> 3. Assuming the apply worker will restart on changing
> retain_conflict_info, the user enables retain_conflict_info for
> subscription B.
> 4. The launcher processes the subscription B first in the first cycle,
> and starts worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759
> 6. Say a new txn happens, worker A gets 760 as candidate_xid and
> updates it to oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and
> the final xmin value is 760, because it only checks the
> oldest_nonremovable_xid from worker A. The launcher then updates the
> value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above
> check in the quoted code.
>
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by
> workers. The one more similar race condition handled by the patch is
> as follows:
>
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential race
> + * conditions. Such scenarios might occur if the replication slot is not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID wraparound
> + * could falsely make this ID appear as if it originates from the future
> + * w.r.t the transaction ID stored in the slot maintained by launcher. See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
>
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the
> slot after noticing the disable. But the apply worker may not notice
> the disable and it only notices that the retain_conflict_info is still
> enabled, so it will keep maintaining oldest_nonremovable_xid when the
> slot is not created.
>

Another case to handle is similar to above with only difference that
no transaction ID wraparound has happened. In such a case, the
launcher may end-up using worker's oldest_nonremovable_xid from the
previous cycle before user disabled-enabled retain_conflict_info. This
may result in the slot moving backward  in absence of suggested check
in advance_conflict_slot_xmin(),

thanks
Shveta



pgsql-hackers by date:

Previous
From: Ajin Cherian
Date:
Subject: Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding
Next
From: Fabrice Chapuis
Date:
Subject: synchronized_standby_slots used in logical replication