Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAD21AoBA16LD01H46BFSdtJLLXPXR9XEXM8EvaOFvQ2Og0tkOw@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (shveta malik <shveta.malik@gmail.com>) |
List | pgsql-hackers |
On Tue, Nov 12, 2024 at 2:19 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, October 18, 2024 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > We thought of few options to fix this: > > > > > > > > 1) Add a Time-Based Subscription Option: > > > > > > > > We could add a new time-based option for subscriptions, such as > > > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining > > > > the candidate XID, the launcher will wait for the specified time > > > > before advancing the slot.xmin. This ensures that deleted tuples are > > > > retained for at least the duration defined by this new option. > > > > > > > > This approach is designed to simulate the functionality of the GUC > > > > (vacuum_committs_age), but with a simpler implementation that does > > > > not impact vacuum performance. We can maintain both this time-based > > > > method and the current automatic method. If a user does not specify > > > > the time-based option, we will continue using the existing approach > > > > to retain dead tuples until all concurrent transactions from the remote node > > have been applied. > > > > > > > > 2) Modification to the Logical Walsender > > > > > > > > On the logical walsender, which is as a physical standby, we can > > > > build an additional connection to the physical primary to obtain the > > > > latest WAL position. This position will then be sent as feedback to > > > > the logical subscriber. > > > > > > > > A potential concern is that this requires the walsender to use the > > > > walreceiver API, which may seem a bit unnatural. And, it starts an > > > > additional walsender process on the primary, as the logical > > > > walsender on the physical standby will need to communicate with this > > walsender to fetch the WAL position. > > > > > > > > > > This idea is worth considering, but I think it may not be a good > > > approach if the physical standby is cascading. We need to restrict the > > > update_delete conflict detection, if the standby is cascading, right? > > > > > > The other approach is that we send current_timestamp from the > > > subscriber and somehow check if the physical standby has applied > > > commit_lsn up to that commit_ts, if so, it can send that WAL position > > > to the subscriber, otherwise, wait for it to be applied. If we do this > > > then we don't need to add a restriction for cascaded physical standby. > > > I think the subscriber anyway needs to wait for such an LSN to be > > > applied on standby before advancing the xmin even if we get it from > > > the primary. This is because the subscriber can only be able to apply > > > and flush the WAL once it is applied on the standby. Am, I missing > > > something? > > > > > > This approach has a disadvantage that we are relying on clocks to be > > > synced on both nodes which we anyway need for conflict resolution as > > > discussed in the thread [1]. We also need to consider the Commit > > > Timestamp and LSN inversion issue as discussed in another thread [2] > > > if we want to pursue this approach because we may miss an LSN that has > > > a prior timestamp. > > > > > For the "publisher is also a standby" issue, I have modified the V8 patch to > report a warning in this case. As I personally feel this is not the main use case > for conflict detection, we can revisit it later after pushing the main patches > receiving some user feedback. > > > > > The problem due to Commit Timestamp and LSN inversion is that the standby > > may not consider the WAL LSN from an earlier timestamp, which could lead to > > the removal of required dead rows on the subscriber. > > > > The other problem pointed out by Hou-San offlist due to Commit Timestamp > > and LSN inversion is that we could miss sending the WAL LSN that the > > subscriber requires to retain dead rows for update_delete conflict. For example, > > consider the following case 2 node, bidirectional setup: > > > > Node A: > > T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM > > T2: DELETE FROM t WHERE id = 1; ts=10.02 AM > > > > Node B: > > T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM > > > > Say subscription is created with retain_dead_tuples = true/false > > > > After executing T2, the apply worker on Node A will check the latest wal flush > > location on Node B. Till that time, the T3 should have finished, so the xmin will > > be advanced only after applying the WALs that is later than T3. So, the dead > > tuple will not be removed before applying the T3, which means the > > update_delete can be detected. > > > > As there is a gap between when we acquire the commit_timestamp and the > > commit LSN, it is possible that T3 would have not yet flushed it's WAL even > > though it is committed earlier than T2. If this happens then we won't be able to > > detect update_deleted conflict reliably. > > > > Now, the one simpler idea is to acquire the commit timestamp and reserve WAL > > (LSN) under the same spinlock in > > ReserveXLogInsertLocation() but that could be costly as discussed in the > > thread [1]. The other more localized solution is to acquire a timestamp after > > reserving the commit WAL LSN outside the lock which will solve this particular > > problem. > > Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought > about another approach that can fix the issue independently. This idea is to > delay the non-removable xid advancement until all the remote concurrent > transactions that may have been assigned earlier timestamp have been committed. > > The implementation is: > > On the walsender, after receiving a request, it can send the oldest xid and > next xid along with the > > In response, the apply worker can safely advance the non-removable XID if > oldest_committing_xid == nextXid, indicating that there is no race conditions. > > Alternatively, if oldest_committing_xid != nextXid, the apply worker might send > a second request after some interval. If the subsequently obtained > oldest_committing_xid surpasses or equal to the initial nextXid, it indicates > that all previously risky transactions have committed, therefore the the > non-removable transaction ID can be advanced. > > > Attach the V8 patch set. Note that I put the new approach for above race > condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the > approach may or may not be accepted based on the discussion in WAL/LSN > inversion thread. I've started to review these patch series. I've reviewed only 0001 patch for now but let me share some comments: --- + if (*phase == DTR_WAIT_FOR_WALSENDER_WAL_POS) + { + Assert(xid_advance_attempt_time); What is this assertion for? If we want to check here that we have sent a request message for the publisher, I think it's clearer if we have "Assert(xid_advance_attempt_time > 0)". I'm not sure we really need this assertion though since it's never false once we set xid_advance_attempt_time. --- + /* + * Do not return here because the apply worker might have already + * applied all changes up to remote_lsn. Instead, proceed to the + * next phase to check if we can immediately advance the transaction + * ID. + */ + *phase = DTR_WAIT_FOR_LOCAL_FLUSH; + } If we always proceed to the next phase, is this phase really necessary? IIUC even if we jump to DTR_WAIT_FOR_LOCAL_FLUSH phase after DTR_REQUEST_WALSENDER_WAL_POS and have a check if we received the remote WAL position in DTR_WAIT_FOR_LOCAL_FLUSH phase, it would work fine. --- + /* + * Reaching here means the remote WAL position has been received, and + * all transactions up to that position on the publisher have been + * applied and flushed locally. So, now we can advance the + * non-removable transaction ID. + */ + SpinLockAcquire(&MyLogicalRepWorker->relmutex); + MyLogicalRepWorker->oldest_nonremovable_xid = candidate_xid; + SpinLockRelease(&MyLogicalRepWorker->relmutex); How about adding a debug log message showing new oldest_nonremovable_xid and related LSN for making the debug/investigation easier? For example, elog(LOG, "confirmed remote flush up to %X/%X: new oldest_nonremovable_xid %u", LSN_FORMAT_ARGS(*remote_lsn), XidFromFullTransactionId(candidate_xid)); --- + /* + * Exit early if the user has disabled sending messages to the + * publisher. + */ + if (wal_receiver_status_interval <= 0) + return; In send_feedback(), we send a feedback message if the publisher requests, even if wal_receiver_status_interval is 0. On the other hand, the above codes mean that we don't send a WAL position request and therefore never update oldest_nonremovable_xid if wal_receiver_status_interval is 0. I'm concerned it could be a pitfall for users. --- % git show | grep update_delete This set of patches aims to support the detection of update_deleted conflicts, transactions with earlier timestamps than the DELETE, detecting update_delete We assume that the appropriate resolution for update_deleted conflicts, to that when detecting the update_deleted conflict, and the remote update has a + * to detect update_deleted conflicts. + * update_deleted is necessary, as the UPDATEs in remote transactions should be + * to allow for the detection of update_delete conflicts when applying There are mixed 'update_delete' and 'update_deleted' in the commit message and the codes. I think it's better to use 'update_deleted'. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: