RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | OS0PR01MB57162E79B4142BA4DEEE5481946FA@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | RE: Conflict detection for update_deleted in logical replication ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Wed, Jun 4, 2025 at 6:43 PM Zhijie Hou (Fujitsu) wrote: > > On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote: > > > > On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Attaching the V32 patch set which addressed comments in [1]~[5]. > > > > > > > Review comments: > > =============== > > * > > +advance_conflict_slot_xmin(FullTransactionId new_xmin) { > > +FullTransactionId full_xmin; FullTransactionId next_full_xid; > > + > > + Assert(MyReplicationSlot); > > + Assert(FullTransactionIdIsValid(new_xmin)); > > + > > + next_full_xid = ReadNextFullTransactionId(); > > + > > + /* > > + * Compute FullTransactionId for the current xmin. This handles the > > + case > > + * where transaction ID wraparound has occurred. > > + */ > > + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid, > > + MyReplicationSlot->data.xmin); > > + > > + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) > > + return; > > > > The above code suggests that the launcher could compute a new xmin > > that is less than slot's xmin. At first, this looks odd to me, but > > IIUC, this can happen when the user toggles retain_conflict_info > > flag at some random time when the launcher is trying to compute the > > new xmin value for the slot. One of the possible combinations of > > steps for this > race could be as follows: > > > > 1. The subscriber has two subscriptions, A and B. Subscription A has > > retain_conflict_info as true, and B has retain_conflict_info as > > false 2. Say the launcher calls get_subscription_list(), and worker > > A is already > alive. > > 3. Assuming the apply worker will restart on changing > > retain_conflict_info, the user enables retain_conflict_info for subscription B. > > 4. The launcher processes the subscription B first in the first > > cycle, and starts worker B. Say, worker B gets 759 as candidate_xid. > > 5. The launcher creates the conflict detection slot, xmin = 759 6. > > Say a new txn happens, worker A gets 760 as candidate_xid and > > updates it to oldest_nonremovable_xid. > > 7. The launcher processes the subscription A in the first cycle, and > > the final xmin value is 760, because it only checks the > > oldest_nonremovable_xid from worker A. The launcher then updates the > value to slot.xmin. > > 8. In the next cycle, the launcher finds that worker B has an older > > oldest_nonremovable_xid 759, so the minimal xid would now be 759. > > The launher would have retreated the slot's xmin unless we had the > > above check in the quoted code. > > > > I think the above race is possible because the system lets the > > changed subscription values of a subscription take effect asynchronously by workers. > > The one more similar race condition handled by the patch is as follows: > > > > * > > ... > > + * It's necessary to use FullTransactionId here to mitigate > > + potential race > > + * conditions. Such scenarios might occur if the replication slot > > + is not > > + * yet created by the launcher while the apply worker has already > > + * initialized this field. During this period, a transaction ID > > + wraparound > > + * could falsely make this ID appear as if it originates from the > > + future > > + * w.r.t the transaction ID stored in the slot maintained by launcher. > > + See > > + * advance_conflict_slot_xmin. > > ... > > + FullTransactionId oldest_nonremovable_xid; > > > > This case can happen if the user disables and immediately enables > > the retain_conflict_info option. In this case, the launcher may drop > > the slot after noticing the disable. But the apply worker may not > > notice the disable and it only notices that the retain_conflict_info > > is still enabled, so it will keep maintaining > > oldest_nonremovable_xid when the slot > is not created. > > > > It is okay to handle both the race conditions, but I am worried we > > may miss some such race conditions which could lead to > > difficult-to-find bugs. So, at least for the first version of this > > option (aka for patches 0001 to 0003), we can add a condition that > > allows us to change retain_conflict_info only on disabled > > subscriptions. This will simplify the > patch. > > Agreed. > > > We can make a separate patch to > > allow changing retain_conflict_info option for enabled subscriptions. > > That will make it easier to evaluate such race conditions and the > > solutions > more deeply. > > I will prepare a separate patch soon. > > Here is the V33 patch set which includes the following changes: > > 0001: > * Renaming and typo fixes based on Shveta's comments [1] > * Comment changes suggested by Amit [2] > * Changed oldest_nonremoable_xid from FullTransactionID to TransactionID. > * Code refactoring in drop_conflict_slot_if_exists() > > 0002: > * Documentation updates suggested by Amit [2] > * Code modifications to adapt to TransactionID oldest_nonremoable_xid > > 0003: > * Documentation improvements suggested by Shveta [3] > * Added restriction: disallow changing retain_conflict_info when sub > is enabled or worker is alive One thing I forgot to mention. After adding the restriction, one race condition still exists: Consider an enabled subscription A with retain_conflict_info=false. If the launcher has fetched the subscription info from get_subscription_list() but hasn't yet started a worker for subscription A, and the user executes the following DDLs[1] during this period, it would create a race condition where the worker starts maintaining oldest_nonremovable_xid while the slot hasn't been created. To handle this, I added a check in InitializeLogRepWorker() that restarts the worker if the retain_conflict_info option changes during startup. This would ensure the worker runs with the latest configuration. This is also similar to the existing enable option check in that function. [1] ALTER SUBSCRIPTION A DISABLE; ALTER SUBSCRIPTION A SET (retain_conflict_info=true); ALTER SUBSCRIPTION A ENABLE; > > 0004: > * Simplified race condition handling due to the new restriction from > 0003 > > 0005: > * Code updates to accommodate both the TransactionID type for > oldest_nonremoable_xid and the new restriction from 0003 > > 0006: > * New test case for the restriction introduced in 0003 > > 0007: > No changes Best Regards, Hou zj
pgsql-hackers by date: