Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAJpy0uCgJ7STLky+VbzjiGkHCDrM+whVA6qPX5f=fJF4NRg-YA@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Mon, Jun 2, 2025 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Attaching the V32 patch set which addressed comments in [1]~[5]. > > > > Review comments: > =============== > * > +advance_conflict_slot_xmin(FullTransactionId new_xmin) > +{ > + FullTransactionId full_xmin; > + FullTransactionId next_full_xid; > + > + Assert(MyReplicationSlot); > + Assert(FullTransactionIdIsValid(new_xmin)); > + > + next_full_xid = ReadNextFullTransactionId(); > + > + /* > + * Compute FullTransactionId for the current xmin. This handles the case > + * where transaction ID wraparound has occurred. > + */ > + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid, > + MyReplicationSlot->data.xmin); > + > + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) > + return; > > The above code suggests that the launcher could compute a new xmin > that is less than slot's xmin. At first, this looks odd to me, but > IIUC, this can happen when the user toggles retain_conflict_info flag > at some random time when the launcher is trying to compute the new > xmin value for the slot. One of the possible combinations of steps for > this race could be as follows: > > 1. The subscriber has two subscriptions, A and B. Subscription A has > retain_conflict_info as true, and B has retain_conflict_info as false > 2. Say the launcher calls get_subscription_list(), and worker A is > already alive. > 3. Assuming the apply worker will restart on changing > retain_conflict_info, the user enables retain_conflict_info for > subscription B. > 4. The launcher processes the subscription B first in the first cycle, > and starts worker B. Say, worker B gets 759 as candidate_xid. > 5. The launcher creates the conflict detection slot, xmin = 759 > 6. Say a new txn happens, worker A gets 760 as candidate_xid and > updates it to oldest_nonremovable_xid. > 7. The launcher processes the subscription A in the first cycle, and > the final xmin value is 760, because it only checks the > oldest_nonremovable_xid from worker A. The launcher then updates the > value to slot.xmin. > 8. In the next cycle, the launcher finds that worker B has an older > oldest_nonremovable_xid 759, so the minimal xid would now be 759. The > launher would have retreated the slot's xmin unless we had the above > check in the quoted code. > > I think the above race is possible because the system lets the changed > subscription values of a subscription take effect asynchronously by > workers. The one more similar race condition handled by the patch is > as follows: > > * > ... > + * It's necessary to use FullTransactionId here to mitigate potential race > + * conditions. Such scenarios might occur if the replication slot is not > + * yet created by the launcher while the apply worker has already > + * initialized this field. During this period, a transaction ID wraparound > + * could falsely make this ID appear as if it originates from the future > + * w.r.t the transaction ID stored in the slot maintained by launcher. See > + * advance_conflict_slot_xmin. > ... > + FullTransactionId oldest_nonremovable_xid; > > This case can happen if the user disables and immediately enables the > retain_conflict_info option. In this case, the launcher may drop the > slot after noticing the disable. But the apply worker may not notice > the disable and it only notices that the retain_conflict_info is still > enabled, so it will keep maintaining oldest_nonremovable_xid when the > slot is not created. > Another case to handle is similar to above with only difference that no transaction ID wraparound has happened. In such a case, the launcher may end-up using worker's oldest_nonremovable_xid from the previous cycle before user disabled-enabled retain_conflict_info. This may result in the slot moving backward in absence of suggested check in advance_conflict_slot_xmin(), thanks Shveta
pgsql-hackers by date: