Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAA4eK1+jyC5toBww2PSqKPcV38PT2y2THixYMLxpNLZxy_chzA@mail.gmail.com Whole thread Raw |
Responses |
RE: Conflict detection for update_deleted in logical replication
|
List | pgsql-hackers |
On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > --- > ISSUES and SOLUTION > --- > > To detect update_deleted conflicts, we need to search for dead tuples in the > table. However, dead tuples can be removed by VACUUM at any time. Therefore, to > ensure consistent and accurate conflict detection, tuples deleted by other > origins must not be removed by VACUUM before the conflict detection process. If > the tuples are removed prematurely, it might lead to incorrect conflict > identification and resolution, causing data divergence between nodes. > > Here is an example of how VACUUM could affect conflict detection and how to > prevent this issue. Assume we have a bidirectional cluster with two nodes, A > and B. > > Node A: > T1: INSERT INTO t (id, value) VALUES (1,1); > T2: DELETE FROM t WHERE id = 1; > > Node B: > T3: UPDATE t SET value = 2 WHERE id = 1; > > To retain the deleted tuples, the initial idea was that once transaction T2 had > been applied to both nodes, there was no longer a need to preserve the dead > tuple on Node A. However, a scenario arises where transactions T3 and T2 occur > concurrently, with T3 committing slightly earlier than T2. In this case, if > Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then > Node A applies T3 after the VACUUM operation, it can only result in an > update_missing conflict. Given that the default resolution for update_missing > conflicts is apply_or_skip (e.g. convert update to insert if possible and apply > the insert), Node A will eventually hold a row (1,2) while Node B becomes > empty, causing data inconsistency. > > Therefore, the strategy needs to be expanded as follows: Node A cannot remove > the dead tuple until: > (a) The DELETE operation is replayed on all remote nodes, *AND* > (b) The transactions on logical standbys occurring before the replay of Node > A's DELETE are replayed on Node A as well. > > --- > THE DESIGN > --- > > To achieve the above, we plan to allow the logical walsender to maintain and > advance the slot.xmin to protect the data in the user table and introduce a new > logical standby feedback message. This message reports the WAL position that > has been replayed on the logical standby *AND* the changes occurring on the > logical standby before the WAL position are also replayed to the walsender's > node (where the walsender is running). After receiving the new feedback > message, the walsender will advance the slot.xmin based on the flush info, > similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin > of logical slot are unused during logical replication, so I think it's safe and > won't cause side-effect to reuse the xmin for this feature. > > We have introduced a new subscription option (feedback_slots='slot1,...'), > where these slots will be used to check condition (b): the transactions on > logical standbys occurring before the replay of Node A's DELETE are replayed on > Node A as well. Therefore, on Node B, users should specify the slots > corresponding to Node A in this option. The apply worker will get the oldest > confirmed flush LSN among the specified slots and send the LSN as a feedback > message to the walsender. -- I also thought of making it an automaic way, e.g. > let apply worker select the slots that acquired by the walsenders which connect > to the same remote server(e.g. if apply worker's connection info or some other > flags is same as the walsender's connection info). But it seems tricky because > if some slots are inactive which means the walsenders are not there, the apply > worker could not find the correct slots to check unless we save the host along > with the slot's persistence data. > > The new feedback message is sent only if feedback_slots is not NULL. > Don't you need to deal with versioning stuff for sending this new message? I mean what if the receiver side of this message is old and doesn't support this new message. One minor comment on 0003 ======================= 1. get_slot_confirmed_flush() { ... + /* + * To prevent concurrent slot dropping and creation while filtering the + * slots, take the ReplicationSlotControlLock outside of the loop. + */ + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); + + foreach_ptr(String, name, MySubscription->feedback_slots) + { + XLogRecPtr confirmed_flush; + ReplicationSlot *slot; + + slot = ValidateAndGetFeedbackSlot(strVal(name)); Why do we need to validate slots each time here? Isn't it better to do it once? -- With Regards, Amit Kapila.
pgsql-hackers by date: