Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Nisha Moond |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CABdArM6P0zoEVRN+3YHNET_oOaAVOKc-EPUnXiHkcBJ-uDKQVw@mail.gmail.com Whole thread Raw |
In response to | RE: Conflict detection for update_deleted in logical replication ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > > > > I have one comment on the 0001 patch: > > Thanks for the comments! > > > > > + /* > > + * The changes made by this and later transactions are still > > non-removable > > + * to allow for the detection of update_deleted conflicts when > > applying > > + * changes in this logical replication worker. > > + * > > + * Note that this info cannot directly protect dead tuples from being > > + * prematurely frozen or removed. The logical replication launcher > > + * asynchronously collects this info to determine whether to advance > > the > > + * xmin value of the replication slot. > > + * > > + * Therefore, FullTransactionId that includes both the > > transaction ID and > > + * its epoch is used here instead of a single Transaction ID. This is > > + * critical because without considering the epoch, the transaction ID > > + * alone may appear as if it is in the future due to transaction ID > > + * wraparound. > > + */ > > + FullTransactionId oldest_nonremovable_xid; > > > > The last paragraph of the comment mentions that we need to use > > FullTransactionId to properly compare XIDs even after the XID wraparound > > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from > > being wraparound, no? I mean that workers' > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its > > xmin) are never away from more than 2^31 XIDs. > > I think the issue is that the launcher may create the replication slot after > the apply worker has already set the 'oldest_nonremovable_xid' because the > launcher are doing that asynchronously. So, Before the slot is created, there's > a window where transaction IDs might wrap around. If initially the apply worker > has computed a candidate_xid (755) and the xid wraparound before the launcher > creates the slot, causing the new current xid to be (740), then the old > candidate_xid(755) looks like a xid in the future, and the launcher could > advance the xmin to 755 which cause the dead tuples to be removed prematurely. > (We are trying to reproduce this to ensure that it's a real issue and will > share after finishing) > I tried to reproduce the issue described above, where an xid_wraparound occurs before the launcher creates the conflict slot, and the apply worker retains a very old xid (from before the wraparound) as its oldest_nonremovable_xid. In this scenario, the launcher will not update the apply worker's older epoch xid (oldest_nonremovable_xid = 755) as the conflict slot's xmin. This is because advance_conflict_slot_xmin() ensures proper handling by comparing the full 64-bit xids. However, this could lead to real issues if 32-bit TransactionID were used instead of 64-bit FullTransactionID. The detailed test steps and results are as below: Setup: A Publisher-Subscriber setup with logical replication. Steps done to reproduce the test scenario - On Sub - 1) Created a subscription with detect_update_deleted=off, so no conflict slot to start with. 2) Attached gdb to the launcher and put a breakpoint at advance_conflict_slot_xmin(). 3) Run "alter subscription ..... (detect_update_deleted=ON);" 4) Stopped the launcher at the start of the "advance_conflict_slot_xmin()", and blocked the creation of the conflict slot. 5) Attached another gdb session to the apply worker and made sure it has set an oldest_nonremovable_xid . In "maybe_advance_nonremovable_xid()" - (gdb) p MyLogicalRepWorker->oldest_nonremovable_xid $3 = {value = 760} -- so apply worker's oldest_nonremovable_xid = 760 6) Consumed ~4.2 billion xids to let the xid_wraparound happen. After the wraparound, the next_xid was "705", which is less than "760". 7) Released the launcher from gdb, but the apply_worker still stopped in gdb. 8) The slot gets created with xmin=705 : postgres=# select slot_name, slot_type, active, xmin, catalog_xmin, restart_lsn, inactive_since, confirmed_flush_lsn from pg_replication_slots; slot_name | slot_type | active | xmin | catalog_xmin | restart_lsn | inactive_since | confirmed_flush_lsn -----------------------+-----------+--------+------+--------------+-------------+----------------+--------------------- pg_conflict_detection | physical | t | 705 | | | | (1 row) Next, when launcher tries to advance the slot's xmin in advance_conflict_slot_xmin() with new_xmin as the apply worker's oldest_nonremovable_xid(760), it returns without updating the slot's xmin because of below check - ```` if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return false; ```` we are comparing the full xids (64-bit) in FullTransactionIdPrecedesOrEquals() and in this case the values are: new_xmin=760 full_xmin=4294968001 (w.r.t. xid=705) As "760 <= 4294968001", the launcher will return from here and not update the slot's xmin to "760". Above check will always be true in such scenarios. Note: The launcher would have updated the slot's xmin to 760 if 32-bit XIDs were being compared, i.e., "760 <= 705". -- Thanks, Nisha
pgsql-hackers by date: