Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAA4eK1+jyC5toBww2PSqKPcV38PT2y2THixYMLxpNLZxy_chzA@mail.gmail.com
Whole thread Raw
Responses RE: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL.
>

Don't you need to deal with versioning stuff for sending this new
message? I mean what if the receiver side of this message is old and
doesn't support this new message.

One minor comment on 0003
=======================
1.
get_slot_confirmed_flush()
{
...
+ /*
+ * To prevent concurrent slot dropping and creation while filtering the
+ * slots, take the ReplicationSlotControlLock outside of the loop.
+ */
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ foreach_ptr(String, name, MySubscription->feedback_slots)
+ {
+ XLogRecPtr confirmed_flush;
+ ReplicationSlot *slot;
+
+ slot = ValidateAndGetFeedbackSlot(strVal(name));

Why do we need to validate slots each time here? Isn't it better to do it once?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication
Next
From: Daniel Gustafsson
Date:
Subject: Re: [PATCH] Fix small overread during SASLprep