Thread: Re: Conflict detection for update_deleted in logical replication

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Hi hackers,
>
> I am starting a new thread to discuss and propose the conflict detection for
> update_deleted scenarios during logical replication. This conflict occurs when
> the apply worker cannot find the target tuple to be updated, as the tuple might
> have been removed by another origin.
>
> ---
> BACKGROUND
> ---
>
> Currently, when the apply worker cannot find the target tuple during an update,
> an update_missing conflict is logged. However, to facilitate future automatic
> conflict resolution, it has been agreed[1][2] that we need to detect both
> update_missing and update_deleted conflicts. Specifically, we will detect an
> update_deleted conflict if any dead tuple matching the old key value of the
> update operation is found; otherwise, it will be classified as update_missing.
>
> Detecting both update_deleted and update_missing conflicts is important for
> achieving eventual consistency in a bidirectional cluster, because the
> resolution for each conflict type can differs. For example, for an
> update_missing conflict, a feasible solution might be converting the update to
> an insert and applying it. While for an update_deleted conflict, the preferred
> approach could be to skip the update or compare the timestamps of the delete
> transactions with the remote update transaction's and choose the most recent
> one. For additional context, please refer to [3], which gives examples about
> how these differences could lead to data divergence.
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL. If the
> slots in feedback_slots are removed, a final message containing
> InvalidXLogRecPtr will be sent to inform the walsender to forget about the
> slot.xmin.
>
> To detect update_deleted conflicts during update operations, if the target row
> cannot be found, we perform an additional scan of the table using snapshotAny.
> This scan aims to locate the most recently deleted row that matches the old
> column values from the remote update operation and has not yet been removed by
> VACUUM. If any such tuples are found, we report the update_deleted conflict
> along with the origin and transaction information that deleted the tuple.
>
> Please refer to the attached POC patch set which implements above design. The
> patch set is split into some parts to make it easier for the initial review.
> Please note that each patch is interdependent and cannot work independently.
>
> Thanks a lot to Kuroda-San and Amit for the off-list discussion.
>
> Suggestions and comments are highly appreciated !
>

Thank You Hou-San for explaining the design. But to make it easier to
understand, would you be able to explain the sequence/timeline of the
*new* actions performed by the walsender and the apply processes for
the given example along with new feedback_slot config needed

Node A: (Procs: walsenderA, applyA)
  T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM

Node B: (Procs: walsenderB, applyB)
  T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
following flow:

Node A executed T1                                    - 10.00 AM
T1 replicated and applied on Node B                            - 10.0001 AM
Node B executed T3                                    - 10.01 AM
Node A executed T2 (741)                                - 10.02 AM
T2 replicated and applied on Node B    (delete_missing)                - 10.03 AM
T3 replicated and applied on Node A    (new action, detect update_deleted)        - 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A.                - 10.05 AM
                    
 

(new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM

Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
remove the dead tuple on Node A.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > > ---
> > > THE DESIGN
> > > ---
> > >
> > > To achieve the above, we plan to allow the logical walsender to
> > > maintain and advance the slot.xmin to protect the data in the user
> > > table and introduce a new logical standby feedback message. This
> > > message reports the WAL position that has been replayed on the logical
> > > standby *AND* the changes occurring on the logical standby before the
> > > WAL position are also replayed to the walsender's node (where the
> > > walsender is running). After receiving the new feedback message, the
> > > walsender will advance the slot.xmin based on the flush info, similar
> > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > > of logical slot are unused during logical replication, so I think it's safe and
> > won't cause side-effect to reuse the xmin for this feature.
> > >
> > > We have introduced a new subscription option
> > > (feedback_slots='slot1,...'), where these slots will be used to check
> > > condition (b): the transactions on logical standbys occurring before
> > > the replay of Node A's DELETE are replayed on Node A as well.
> > > Therefore, on Node B, users should specify the slots corresponding to
> > > Node A in this option. The apply worker will get the oldest confirmed
> > > flush LSN among the specified slots and send the LSN as a feedback
> > message to the walsender. -- I also thought of making it an automaic way, e.g.
> > > let apply worker select the slots that acquired by the walsenders
> > > which connect to the same remote server(e.g. if apply worker's
> > > connection info or some other flags is same as the walsender's
> > > connection info). But it seems tricky because if some slots are
> > > inactive which means the walsenders are not there, the apply worker
> > > could not find the correct slots to check unless we save the host along with
> > the slot's persistence data.
> > >
> > > The new feedback message is sent only if feedback_slots is not NULL.
> > > If the slots in feedback_slots are removed, a final message containing
> > > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > > the slot.xmin.
> > >
> > > To detect update_deleted conflicts during update operations, if the
> > > target row cannot be found, we perform an additional scan of the table using
> > snapshotAny.
> > > This scan aims to locate the most recently deleted row that matches
> > > the old column values from the remote update operation and has not yet
> > > been removed by VACUUM. If any such tuples are found, we report the
> > > update_deleted conflict along with the origin and transaction information
> > that deleted the tuple.
> > >
> > > Please refer to the attached POC patch set which implements above
> > > design. The patch set is split into some parts to make it easier for the initial
> > review.
> > > Please note that each patch is interdependent and cannot work
> > independently.
> > >
> > > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> > >
> > > Suggestions and comments are highly appreciated !
> > >
> >
> > Thank You Hou-San for explaining the design. But to make it easier to
> > understand, would you be able to explain the sequence/timeline of the
> > *new* actions performed by the walsender and the apply processes for the
> > given example along with new feedback_slot config needed
> >
> > Node A: (Procs: walsenderA, applyA)
> >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> >
> > Node B: (Procs: walsenderB, applyB)
> >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
>
> Thanks for reviewing! Let me elaborate further on the example:
>
> On node A, feedback_slots should include the logical slot that used to replicate changes
> from Node A to Node B. On node B, feedback_slots should include the logical
> slot that replicate changes from Node B to Node A.
>
> Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
> following flow:
>
> Node A executed T1                                                                      - 10.00 AM
> T1 replicated and applied on Node B                                                     - 10.0001 AM
> Node B executed T3                                                                      - 10.01 AM
> Node A executed T2 (741)                                                                - 10.02 AM
> T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM

Not related to this feature, but do you mean delete_origin_differ here?

> T3 replicated and applied on Node A     (new action, detect update_deleted)             - 10.04 AM
>
> (new action) Apply worker on Node B has confirmed that T2 has been applied
> locally and the transactions before T2 (e.g., T3) has been replicated and
> applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
> replayed T2), thus send the new feedback message to Node A.                             - 10.05 AM
>
> (new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM
>
> Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
> remove the dead tuple on Node A.
>

Thanks for the example. Can you please review below and let me know if
my understanding is correct.

1)
In a bidirectional replication setup, the user has to create slots in
a way that NodeA's sub's slot is Node B's feedback_slot and Node B's
sub's slot is Node A's feedback slot. And then only this feature will
work well, is it correct to say?

2)
Now coming back to multiple feedback_slots in a subscription, is the
below correct:

Say Node A has publications and subscriptions as follow:
------------------
A_pub1

A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)


Say Node B has publications and subscriptions as follow:
------------------
B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)

B_pub1
B_pub2
B_pub3

Then what will be the feedback_slot configuration for all
subscriptions of A and B? Is below correct:
------------------
A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

3)
If the above is true, then do we have a way to make sure that the user
 has given this configuration exactly the above way? If users end up
giving feedback_slots as some random slot  (say A_slot4 or incomplete
list), do we validate that? (I have not looked at code yet, just
trying to understand design first).

4)
Now coming to this:

> The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender.

 There will be one apply worker on B which will be due to B_sub1, so
will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't
it be sufficient to check confimed_lsn of say slot A_sub1 alone which
has subscribed to table 't' on which delete has been performed? Rest
of the  lots (A_sub2, A_sub3) might have subscribed to different
tables?

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1                                                                      - 10.00 AM
> > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > Node B executed T3                                                                      - 10.01 AM
> > Node A executed T2 (741)                                                                - 10.02 AM
> > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A     (new action, detect
> update_deleted)             - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> ------------------
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> ------------------
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A and
> B? Is below correct:
> ------------------
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which server the slot is replicating
changes to, which could be tricky as the slot currently doesn't have any info
to identify the remote server. Besides, the slot could be inactive temporarily
due to some subscriber side error, in which case we cannot verify the
subscription that used it.

> 
> 4)
> Now coming to this:
> 
> > The apply worker will get the oldest
> > confirmed flush LSN among the specified slots and send the LSN as a
> > feedback message to the walsender.
> 
>  There will be one apply worker on B which will be due to B_sub1, so will it
> check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> sufficient to check confimed_lsn of say slot A_sub1 alone which has
> subscribed to table 't' on which delete has been performed? Rest of the  lots
> (A_sub2, A_sub3) might have subscribed to different tables?

I think it's theoretically correct to only check the A_sub1. We could document
that user can do this by identifying the tables that each subscription
replicates, but it may not be user friendly.

Best Regards,
Hou zj


Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 2:45 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thank You Hou-San for explaining the design. But to make it easier
> > > > to understand, would you be able to explain the sequence/timeline of
> > > > the
> > > > *new* actions performed by the walsender and the apply processes for
> > > > the given example along with new feedback_slot config needed
> > > >
> > > > Node A: (Procs: walsenderA, applyA)
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > > >
> > > > Node B: (Procs: walsenderB, applyB)
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> > >
> > > Thanks for reviewing! Let me elaborate further on the example:
> > >
> > > On node A, feedback_slots should include the logical slot that used to
> > > replicate changes from Node A to Node B. On node B, feedback_slots
> > > should include the logical slot that replicate changes from Node B to Node A.
> > >
> > > Assume the slot.xmin on Node A has been initialized to a valid
> > > number(740) before the following flow:
> > >
> > > Node A executed T1                                                                      - 10.00 AM
> > > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > > Node B executed T3                                                                      - 10.01 AM
> > > Node A executed T2 (741)                                                                - 10.02 AM
> > > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> >
> > Not related to this feature, but do you mean delete_origin_differ here?
>
> Oh sorry, It's a miss. I meant delete_origin_differ.
>
> >
> > > T3 replicated and applied on Node A     (new action, detect
> > update_deleted)             - 10.04 AM
> > >
> > > (new action) Apply worker on Node B has confirmed that T2 has been
> > > applied locally and the transactions before T2 (e.g., T3) has been
> > > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> > >= lsn of the local
> > > replayed T2), thus send the new feedback message to Node A.
> > - 10.05 AM
> > >
> > > (new action) Walsender on Node A received the message and would
> > > advance the slot.xmin.- 10.06 AM
> > >
> > > Then, after the slot.xmin is advanced to a number greater than 741,
> > > the VACUUM would be able to remove the dead tuple on Node A.
> > >
> >
> > Thanks for the example. Can you please review below and let me know if my
> > understanding is correct.
> >
> > 1)
> > In a bidirectional replication setup, the user has to create slots in a way that
> > NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> > A's feedback slot. And then only this feature will work well, is it correct to say?
>
> Yes, your understanding is correct.
>
> >
> > 2)
> > Now coming back to multiple feedback_slots in a subscription, is the below
> > correct:
> >
> > Say Node A has publications and subscriptions as follow:
> > ------------------
> > A_pub1
> >
> > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> >
> >
> > Say Node B has publications and subscriptions as follow:
> > ------------------
> > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> >
> > B_pub1
> > B_pub2
> > B_pub3
> >
> > Then what will be the feedback_slot configuration for all subscriptions of A and
> > B? Is below correct:
> > ------------------
> > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
>
> Right. The above configurations are correct.

Okay. It seems difficult to understand configuration from user's perspective.

> >
> > 3)
> > If the above is true, then do we have a way to make sure that the user  has
> > given this configuration exactly the above way? If users end up giving
> > feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> > validate that? (I have not looked at code yet, just trying to understand design
> > first).
>
> The patch doesn't validate if the feedback slots belong to the correct
> subscriptions on remote server. It only validates if the slot is an existing,
> valid, logical slot. I think there are few challenges to validate it further.
> E.g. We need a way to identify the which server the slot is replicating
> changes to, which could be tricky as the slot currently doesn't have any info
> to identify the remote server. Besides, the slot could be inactive temporarily
> due to some subscriber side error, in which case we cannot verify the
> subscription that used it.

Okay, I understand the challenges here.

> >
> > 4)
> > Now coming to this:
> >
> > > The apply worker will get the oldest
> > > confirmed flush LSN among the specified slots and send the LSN as a
> > > feedback message to the walsender.
> >
> >  There will be one apply worker on B which will be due to B_sub1, so will it
> > check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> > sufficient to check confimed_lsn of say slot A_sub1 alone which has
> > subscribed to table 't' on which delete has been performed? Rest of the  lots
> > (A_sub2, A_sub3) might have subscribed to different tables?
>
> I think it's theoretically correct to only check the A_sub1. We could document
> that user can do this by identifying the tables that each subscription
> replicates, but it may not be user friendly.
>

Sorry, I fail to understand how user can identify the tables and give
feedback_slots accordingly? I thought feedback_slots is a one time
configuration when replication is setup (or say setup changes in
future); it can not keep on changing with each query. Or am I missing
something?

IMO, it is something which should be identified internally. Since the
query is on table 't1', feedback-slot which is for 't1' shall be used
to check lsn. But on rethinking,this optimization may not be worth the
effort, the identification part could be tricky, so it might be okay
to check all the slots.

~~

Another query is about 3 node setup. I couldn't figure out what would
be feedback_slots setting when it is not bidirectional, as in consider
the case where there are three nodes A,B,C. Node C is subscribing to
both Node A and Node B. Node A and Node B are the ones doing
concurrent "update" and "delete" which will both be replicated to Node
C. In this case what will be the feedback_slots setting on Node C? We
don't have any slots here which will be replicating changes from Node
C to Node A and Node C to Node B. This is given in [3] in your first
email ([1])

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > ------------------
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > ------------------
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > ------------------
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It would be pretty complicate although possible, so I
prefer to not mention it in the first place if it could not bring much
benefits.

> 
> IMO, it is something which should be identified internally. Since the query is on
> table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> rethinking,this optimization may not be worth the effort, the identification part
> could be tricky, so it might be okay to check all the slots.

I agree that identifying these internally would add complexity.

> 
> ~~
> 
> Another query is about 3 node setup. I couldn't figure out what would be
> feedback_slots setting when it is not bidirectional, as in consider the case
> where there are three nodes A,B,C. Node C is subscribing to both Node A and
> Node B. Node A and Node B are the ones doing concurrent "update" and
> "delete" which will both be replicated to Node C. In this case what will be the
> feedback_slots setting on Node C? We don't have any slots here which will be
> replicating changes from Node C to Node A and Node C to Node B. This is given
> in [3] in your first email ([1])

Thanks for pointing this, the link was a bit misleading. I think the solution
proposed in this thread is only used to allow detecting update_deleted reliably
in a bidirectional cluster.  For non- bidirectional cases, it would be more
tricky to predict the timing till when should we retain the dead tuples.


> 
> [1]:
> https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A
> 6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 5:56 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thanks for the example. Can you please review below and let me know
> > > > if my understanding is correct.
> > > >
> > > > 1)
> > > > In a bidirectional replication setup, the user has to create slots
> > > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > > B's sub's slot is Node A's feedback slot. And then only this feature will
> > work well, is it correct to say?
> > >
> > > Yes, your understanding is correct.
> > >
> > > >
> > > > 2)
> > > > Now coming back to multiple feedback_slots in a subscription, is the
> > > > below
> > > > correct:
> > > >
> > > > Say Node A has publications and subscriptions as follow:
> > > > ------------------
> > > > A_pub1
> > > >
> > > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > > >
> > > >
> > > > Say Node B has publications and subscriptions as follow:
> > > > ------------------
> > > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > > >
> > > > B_pub1
> > > > B_pub2
> > > > B_pub3
> > > >
> > > > Then what will be the feedback_slot configuration for all
> > > > subscriptions of A and B? Is below correct:
> > > > ------------------
> > > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> > >
> > > Right. The above configurations are correct.
> >
> > Okay. It seems difficult to understand configuration from user's perspective.
>
> Right. I think we could give an example in the document to make it clear.
>
> >
> > > >
> > > > 3)
> > > > If the above is true, then do we have a way to make sure that the
> > > > user  has given this configuration exactly the above way? If users
> > > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > > incomplete list), do we validate that? (I have not looked at code
> > > > yet, just trying to understand design first).
> > >
> > > The patch doesn't validate if the feedback slots belong to the correct
> > > subscriptions on remote server. It only validates if the slot is an
> > > existing, valid, logical slot. I think there are few challenges to validate it
> > further.
> > > E.g. We need a way to identify the which server the slot is
> > > replicating changes to, which could be tricky as the slot currently
> > > doesn't have any info to identify the remote server. Besides, the slot
> > > could be inactive temporarily due to some subscriber side error, in
> > > which case we cannot verify the subscription that used it.
> >
> > Okay, I understand the challenges here.
> >
> > > >
> > > > 4)
> > > > Now coming to this:
> > > >
> > > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > > specified slots and send the LSN as a feedback message to the
> > > > > walsender.
> > > >
> > > >  There will be one apply worker on B which will be due to B_sub1, so
> > > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > > alone which has subscribed to table 't' on which delete has been
> > > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> > different tables?
> > >
> > > I think it's theoretically correct to only check the A_sub1. We could
> > > document that user can do this by identifying the tables that each
> > > subscription replicates, but it may not be user friendly.
> > >
> >
> > Sorry, I fail to understand how user can identify the tables and give
> > feedback_slots accordingly? I thought feedback_slots is a one time
> > configuration when replication is setup (or say setup changes in future); it can
> > not keep on changing with each query. Or am I missing something?
>
> I meant that user have all the publication information(including the tables
> added in a publication) that the subscription subscribes to, and could also
> have the slot_name, so I think it's possible to identify the tables that each
> subscription includes and add the feedback_slots correspondingly before
> starting the replication. It would be pretty complicate although possible, so I
> prefer to not mention it in the first place if it could not bring much
> benefits.
>
> >
> > IMO, it is something which should be identified internally. Since the query is on
> > table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> > rethinking,this optimization may not be worth the effort, the identification part
> > could be tricky, so it might be okay to check all the slots.
>
> I agree that identifying these internally would add complexity.
>
> >
> > ~~
> >
> > Another query is about 3 node setup. I couldn't figure out what would be
> > feedback_slots setting when it is not bidirectional, as in consider the case
> > where there are three nodes A,B,C. Node C is subscribing to both Node A and
> > Node B. Node A and Node B are the ones doing concurrent "update" and
> > "delete" which will both be replicated to Node C. In this case what will be the
> > feedback_slots setting on Node C? We don't have any slots here which will be
> > replicating changes from Node C to Node A and Node C to Node B. This is given
> > in [3] in your first email ([1])
>
> Thanks for pointing this, the link was a bit misleading. I think the solution
> proposed in this thread is only used to allow detecting update_deleted reliably
> in a bidirectional cluster.  For non- bidirectional cases, it would be more
> tricky to predict the timing till when should we retain the dead tuples.
>

So in brief, this solution is only for bidrectional setup? For
non-bidirectional, feedback_slots is non-configurable and thus
irrelevant.

Irrespective of above, if user ends up setting feedback_slot to some
random but existing slot which is not at all consuming changes, then
it may so happen that the node will never send feedback msg to another
node resulting in accumulation of dead tuples on another node. Is that
a possibility?

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random but
> existing slot which is not at all consuming changes, then it may so happen that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Sep 11, 2024 at 11:07 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > >
> > > > Another query is about 3 node setup. I couldn't figure out what
> > > > would be feedback_slots setting when it is not bidirectional, as in
> > > > consider the case where there are three nodes A,B,C. Node C is
> > > > subscribing to both Node A and Node B. Node A and Node B are the
> > > > ones doing concurrent "update" and "delete" which will both be
> > > > replicated to Node C. In this case what will be the feedback_slots
> > > > setting on Node C? We don't have any slots here which will be
> > > > replicating changes from Node C to Node A and Node C to Node B. This
> > > > is given in [3] in your first email ([1])
> > >
> > > Thanks for pointing this, the link was a bit misleading. I think the
> > > solution proposed in this thread is only used to allow detecting
> > > update_deleted reliably in a bidirectional cluster.  For non-
> > > bidirectional cases, it would be more tricky to predict the timing till when
> > should we retain the dead tuples.
> > >
> >
> > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > feedback_slots is non-configurable and thus irrelevant.
>
> Right.
>

One possible idea to address the non-bidirectional case raised by
Shveta is to use a time-based cut-off to remove dead tuples. As
mentioned earlier in my email [1], we can define a new GUC parameter
say vacuum_committs_age which would indicate that we will allow rows
to be removed only if the modified time of the tuple as indicated by
committs module is greater than the vacuum_committs_age. We could keep
this parameter a table-level option without introducing a GUC as this
may not apply to all tables. I checked and found that some other
replication solutions like GoldenGate also allowed similar parameters
(tombstone_deletes) to be specified at table level [2]. The other
advantage of allowing it at table level is that it won't hamper the
performance of hot-pruning or vacuum in general. Note, I am careful
here because to decide whether to remove a dead tuple or not we need
to compare its committs_time both during hot-pruning and vacuum.

Note that tombstones_deletes is a general concept used by replication
solutions to detect updated_deleted conflict and time-based purging is
recommended. See [3][4]. We previously discussed having tombstone
tables to keep the deleted records information but it was suggested to
prevent the vacuum from removing the required dead tuples as that
would be simpler than inventing a new kind of tables/store for
tombstone_deletes [5]. So, we came up with the idea of feedback slots
discussed in this email but that didn't work out in all cases and
appears difficult to configure as pointed out by Shveta. So, now, we
are back to one of the other ideas [1] discussed previously to solve
this problem.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[2] -
BEGIN
  DBMS_GOLDENGATE_ADM.ALTER_AUTO_CDR(
    schema_name       => 'hr',
    table_name        => 'employees',
    tombstone_deletes => TRUE);
END;
/
[3] - https://en.wikipedia.org/wiki/Tombstone_(data_store)
[4] -
https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/automatic-conflict-detection-and-resolution1.html#GUID-423C6EE8-1C62-4085-899C-8454B8FB9C92
[5] - https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > >
> > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > feedback_slots is non-configurable and thus irrelevant.
> >
> > Right.
> >
>
> One possible idea to address the non-bidirectional case raised by
> Shveta is to use a time-based cut-off to remove dead tuples. As
> mentioned earlier in my email [1], we can define a new GUC parameter
> say vacuum_committs_age which would indicate that we will allow rows
> to be removed only if the modified time of the tuple as indicated by
> committs module is greater than the vacuum_committs_age. We could keep
> this parameter a table-level option without introducing a GUC as this
> may not apply to all tables. I checked and found that some other
> replication solutions like GoldenGate also allowed similar parameters
> (tombstone_deletes) to be specified at table level [2]. The other
> advantage of allowing it at table level is that it won't hamper the
> performance of hot-pruning or vacuum in general. Note, I am careful
> here because to decide whether to remove a dead tuple or not we need
> to compare its committs_time both during hot-pruning and vacuum.

+1 on the idea, but IIUC this value doesn’t need to be significant; it
can be limited to just a few minutes. The one which is sufficient to
handle replication delays caused by network lag or other factors,
assuming clock skew has already been addressed.

This new parameter is necessary only for cases where an UPDATE and
DELETE on the same row occur concurrently, but the replication order
to a third node is not preserved, which could result in data
divergence. Consider the following example:

Node A:
   T1: INSERT INTO t (id, value) VALUES (1,1);  (10.01 AM)
   T2: DELETE FROM t WHERE id = 1;             (10.03 AM)

Node B:
   T3: UPDATE t SET value = 2 WHERE id = 1;    (10.02 AM)

Assume a third node (Node C) subscribes to both Node A and Node B. The
"correct" order of messages received by Node C would be T1-T3-T2, but
it could also receive them in the order T1-T2-T3, wherein  sayT3 is
received with a lag of say 2 mins. In such a scenario, T3 should be
able to recognize that the row was deleted by T2 on Node C, thereby
detecting the update-deleted conflict and skipping the apply.

The 'vacuum_committs_age' parameter should account for this lag, which
could lead to the order reversal of UPDATE and DELETE operations.

Any subsequent attempt to update the same row after conflict detection
and resolution should not pose an issue. For example, if Node A
triggers the following at 10:20 AM:
UPDATE t SET value = 3 WHERE id = 1;

Since the row has already been deleted, the UPDATE will not proceed
and therefore will not generate a replication operation on the other
nodes, indicating that vacuum need not to preserve the dead row to
this far.

thanks
Shveta



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > feedback_slots is non-configurable and thus irrelevant.
> > >
> > > Right.
> > >
> >
> > One possible idea to address the non-bidirectional case raised by
> > Shveta is to use a time-based cut-off to remove dead tuples. As
> > mentioned earlier in my email [1], we can define a new GUC parameter
> > say vacuum_committs_age which would indicate that we will allow rows
> > to be removed only if the modified time of the tuple as indicated by
> > committs module is greater than the vacuum_committs_age. We could keep
> > this parameter a table-level option without introducing a GUC as this
> > may not apply to all tables. I checked and found that some other
> > replication solutions like GoldenGate also allowed similar parameters
> > (tombstone_deletes) to be specified at table level [2]. The other
> > advantage of allowing it at table level is that it won't hamper the
> > performance of hot-pruning or vacuum in general. Note, I am careful
> > here because to decide whether to remove a dead tuple or not we need
> > to compare its committs_time both during hot-pruning and vacuum.
>
> +1 on the idea,

I agree that this idea is much simpler than the idea originally
proposed in this thread.

IIUC vacuum_committs_age specifies a time rather than an XID age. But
how can we implement it? If it ends up affecting the vacuum cutoff, we
should be careful not to end up with the same result of
vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
the implementation needs not to affect the performance of
ComputeXidHorizons().

> but IIUC this value doesn’t need to be significant; it
> can be limited to just a few minutes. The one which is sufficient to
> handle replication delays caused by network lag or other factors,
> assuming clock skew has already been addressed.

I think that in a non-bidirectional case the value could need to be a
large number. Is that right?

Regards,

[1] https://www.postgresql.org/message-id/20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > > feedback_slots is non-configurable and thus irrelevant.
> > > > >
> > > > > Right.
> > > > >
> > > >
> > > > One possible idea to address the non-bidirectional case raised by
> > > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > > say vacuum_committs_age which would indicate that we will allow rows
> > > > to be removed only if the modified time of the tuple as indicated by
> > > > committs module is greater than the vacuum_committs_age. We could keep
> > > > this parameter a table-level option without introducing a GUC as this
> > > > may not apply to all tables. I checked and found that some other
> > > > replication solutions like GoldenGate also allowed similar parameters
> > > > (tombstone_deletes) to be specified at table level [2]. The other
> > > > advantage of allowing it at table level is that it won't hamper the
> > > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > > here because to decide whether to remove a dead tuple or not we need
> > > > to compare its committs_time both during hot-pruning and vacuum.
> > >
> > > +1 on the idea,
> >
> > I agree that this idea is much simpler than the idea originally
> > proposed in this thread.
> >
> > IIUC vacuum_committs_age specifies a time rather than an XID age.
> >
>
> Your understanding is correct that vacuum_committs_age specifies a time.
>
> >
> > But
> > how can we implement it? If it ends up affecting the vacuum cutoff, we
> > should be careful not to end up with the same result of
> > vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> > the implementation needs not to affect the performance of
> > ComputeXidHorizons().
> >
>
> I haven't thought about the implementation details yet but I think
> during pruning (for example in heap_prune_satisfies_vacuum()), apart
> from checking if the tuple satisfies
> HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> committs is greater than configured vacuum_committs_age (for the
> table) to decide whether tuple can be removed.

Sounds very costly. I think we need to do performance tests. Even if
the vacuum gets slower only on the particular table having the
vacuum_committs_age setting, it would affect overall autovacuum
performance. Also, it would affect HOT pruning performance.

>
> > > but IIUC this value doesn’t need to be significant; it
> > > can be limited to just a few minutes. The one which is sufficient to
> > > handle replication delays caused by network lag or other factors,
> > > assuming clock skew has already been addressed.
> >
> > I think that in a non-bidirectional case the value could need to be a
> > large number. Is that right?
> >
>
> As per my understanding, even for non-bidirectional cases, the value
> should be small. For example, in the case, pointed out by Shveta [1],
> where the updates from 2 nodes are received by a third node, this
> setting is expected to be small. This setting primarily deals with
> concurrent transactions on multiple nodes, so it should be small but I
> could be missing something.
>

I might be missing something but the scenario I was thinking of is
something below.

Suppose that we setup uni-directional logical replication between Node
A and Node B (e.g., Node A -> Node B) and both nodes have the same row
with key = 1:

Node A:
    T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
      -> This change is applied on Node B at 10:01 AM.

Node B:
    T2: DELETE FROM t WHERE key = 1;         (05:00 AM)

If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
Node A would raise an "update_missing" conflict. On the other hand, if
a vacuum runs on Node B at 11:00 AM, the change would raise an
"update_deleted" conflict. It looks whether we detect an
"update_deleted" or an "updated_missing" depends on the timing of
vacuum, and to avoid such a situation, we would need to set
vacuum_committs_age to more than 5 hours.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I haven't thought about the implementation details yet but I think
> > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > from checking if the tuple satisfies
> > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > committs is greater than configured vacuum_committs_age (for the
> > table) to decide whether tuple can be removed.
>
> Sounds very costly. I think we need to do performance tests. Even if
> the vacuum gets slower only on the particular table having the
> vacuum_committs_age setting, it would affect overall autovacuum
> performance. Also, it would affect HOT pruning performance.
>

Agreed that we should do some performance testing and additionally
think of any better way to implement. I think the cost won't be much
if the tuples to be removed are from a single transaction because the
required commit_ts information would be cached but when the tuples are
from different transactions, we could see a noticeable impact. We need
to test to say anything concrete on this.

> >
> > > > but IIUC this value doesn’t need to be significant; it
> > > > can be limited to just a few minutes. The one which is sufficient to
> > > > handle replication delays caused by network lag or other factors,
> > > > assuming clock skew has already been addressed.
> > >
> > > I think that in a non-bidirectional case the value could need to be a
> > > large number. Is that right?
> > >
> >
> > As per my understanding, even for non-bidirectional cases, the value
> > should be small. For example, in the case, pointed out by Shveta [1],
> > where the updates from 2 nodes are received by a third node, this
> > setting is expected to be small. This setting primarily deals with
> > concurrent transactions on multiple nodes, so it should be small but I
> > could be missing something.
> >
>
> I might be missing something but the scenario I was thinking of is
> something below.
>
> Suppose that we setup uni-directional logical replication between Node
> A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> with key = 1:
>
> Node A:
>     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
>       -> This change is applied on Node B at 10:01 AM.
>
> Node B:
>     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
>
> If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> Node A would raise an "update_missing" conflict. On the other hand, if
> a vacuum runs on Node B at 11:00 AM, the change would raise an
> "update_deleted" conflict. It looks whether we detect an
> "update_deleted" or an "updated_missing" depends on the timing of
> vacuum, and to avoid such a situation, we would need to set
> vacuum_committs_age to more than 5 hours.
>

Yeah, in this case, it would detect a different conflict (if we don't
set vacuum_committs_age to greater than 5 hours) but as per my
understanding, the primary purpose of conflict detection and
resolution is to avoid data inconsistency in a bi-directional setup.
Assume, in the above case it is a bi-directional setup, then we want
to have the same data in both nodes. Now, if there are other cases
like the one you mentioned that require to detect the conflict
reliably than I agree this value could be large and probably not the
best way to achieve it. I think we can mention in the docs that the
primary purpose of this is to achieve data consistency among
bi-directional kind of setups.

Having said that even in the above case, the result should be the same
whether the vacuum has removed the row or not. Say, if the vacuum has
not yet removed the row (due to vacuum_committs_age or otherwise) then
also because the incoming update has a later timestamp, we will
convert the update to insert as per last_update_wins resolution
method, so the conflict will be considered as update_missing. And,
say, the vacuum has removed the row and the conflict detected is
update_missing, then also we will convert the update to insert. In
short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
has higher commit-ts, UPDATE should win.

So, we can expect data consistency in bidirectional cases and expect a
deterministic behavior in other cases (e.g. the final data in a table
does not depend on the order of applying the transactions from other
nodes).

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I haven't thought about the implementation details yet but I think
> > > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > > from checking if the tuple satisfies
> > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > > committs is greater than configured vacuum_committs_age (for the
> > > table) to decide whether tuple can be removed.
> >
> > Sounds very costly. I think we need to do performance tests. Even if
> > the vacuum gets slower only on the particular table having the
> > vacuum_committs_age setting, it would affect overall autovacuum
> > performance. Also, it would affect HOT pruning performance.
> >
>
> Agreed that we should do some performance testing and additionally
> think of any better way to implement. I think the cost won't be much
> if the tuples to be removed are from a single transaction because the
> required commit_ts information would be cached but when the tuples are
> from different transactions, we could see a noticeable impact. We need
> to test to say anything concrete on this.

Agreed.

>
> > >
> > > > > but IIUC this value doesn’t need to be significant; it
> > > > > can be limited to just a few minutes. The one which is sufficient to
> > > > > handle replication delays caused by network lag or other factors,
> > > > > assuming clock skew has already been addressed.
> > > >
> > > > I think that in a non-bidirectional case the value could need to be a
> > > > large number. Is that right?
> > > >
> > >
> > > As per my understanding, even for non-bidirectional cases, the value
> > > should be small. For example, in the case, pointed out by Shveta [1],
> > > where the updates from 2 nodes are received by a third node, this
> > > setting is expected to be small. This setting primarily deals with
> > > concurrent transactions on multiple nodes, so it should be small but I
> > > could be missing something.
> > >
> >
> > I might be missing something but the scenario I was thinking of is
> > something below.
> >
> > Suppose that we setup uni-directional logical replication between Node
> > A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> > with key = 1:
> >
> > Node A:
> >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> >       -> This change is applied on Node B at 10:01 AM.
> >
> > Node B:
> >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> >
> > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > Node A would raise an "update_missing" conflict. On the other hand, if
> > a vacuum runs on Node B at 11:00 AM, the change would raise an
> > "update_deleted" conflict. It looks whether we detect an
> > "update_deleted" or an "updated_missing" depends on the timing of
> > vacuum, and to avoid such a situation, we would need to set
> > vacuum_committs_age to more than 5 hours.
> >
>
> Yeah, in this case, it would detect a different conflict (if we don't
> set vacuum_committs_age to greater than 5 hours) but as per my
> understanding, the primary purpose of conflict detection and
> resolution is to avoid data inconsistency in a bi-directional setup.
> Assume, in the above case it is a bi-directional setup, then we want
> to have the same data in both nodes. Now, if there are other cases
> like the one you mentioned that require to detect the conflict
> reliably than I agree this value could be large and probably not the
> best way to achieve it. I think we can mention in the docs that the
> primary purpose of this is to achieve data consistency among
> bi-directional kind of setups.
>
> Having said that even in the above case, the result should be the same
> whether the vacuum has removed the row or not. Say, if the vacuum has
> not yet removed the row (due to vacuum_committs_age or otherwise) then
> also because the incoming update has a later timestamp, we will
> convert the update to insert as per last_update_wins resolution
> method, so the conflict will be considered as update_missing. And,
> say, the vacuum has removed the row and the conflict detected is
> update_missing, then also we will convert the update to insert. In
> short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> has higher commit-ts, UPDATE should win.
>
> So, we can expect data consistency in bidirectional cases and expect a
> deterministic behavior in other cases (e.g. the final data in a table
> does not depend on the order of applying the transactions from other
> nodes).

Agreed.

I think that such a time-based configuration parameter would be a
reasonable solution. The current concerns are that it might affect
vacuum performance and lead to a similar bug we had with
vacuum_defer_cleanup_age.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:

> -----Original Message-----
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila <amit.kapila16@gmail.com>
> Cc: shveta malik <shveta.malik@gmail.com>; Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: Conflict detection for update_deleted in logical replication
> 
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
> 
> Agreed.
> 
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > >       -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency in bidirectional cases and expect a
> > deterministic behavior in other cases (e.g. the final data in a table
> > does not depend on the order of applying the transactions from other
> > nodes).
> 
> Agreed.
> 
> I think that such a time-based configuration parameter would be a reasonable
> solution. The current concerns are that it might affect vacuum performance and
> lead to a similar bug we had with vacuum_defer_cleanup_age.

Thanks for the feedback!

I am working on the POC patch and doing some initial performance tests on this idea.
I will share the results after finishing.

Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to our
approach for retaining dead tuples and have come up with another idea that can
reliably detect conflicts without requiring users to choose a wise value for
the vacuum_committs_age. This new idea could also reduce the performance
impact. Thanks a lot to Amit for off-list discussion.

The concept of the new idea is that, the dead tuples are only useful to detect
conflicts when applying *concurrent* transactions from remotes. Any subsequent
UPDATE from a remote node after removing the dead tuples should have a later
timestamp, meaning it's reasonable to detect an update_missing scenario and
convert the UPDATE to an INSERT when applying it.

To achieve above, we can create an additional replication slot on the
subscriber side, maintained by the apply worker. This slot is used to retain
the dead tuples. The apply worker will advance the slot.xmin after confirming
that all the concurrent transaction on publisher has been applied locally.

The process of advancing the slot.xmin could be:

1) the apply worker call GetRunningTransactionData() to get the
'oldestRunningXid' and consider this as 'candidate_xmin'.
2) the apply worker send a new message to walsender to request the latest wal
flush position(GetFlushRecPtr) on publisher, and save it to
'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
extend the existing keepalive message(e,g extends the requestReply bit in
keepalive message to add a 'request_wal_position' value)
3) The apply worker can continue to apply changes. After applying all the WALs
upto 'candidate_remote_wal_lsn', the apply worker can then advance the
slot.xmin to 'candidate_xmin'.

This approach ensures that dead tuples are not removed until all concurrent
transactions have been applied. It can be effective for both bidirectional and
non-bidirectional replication cases.

We could introduce a boolean subscription option (retain_dead_tuples) to
control whether this feature is enabled. Each subscription intending to detect
update-delete conflicts should set retain_dead_tuples to true.

The following explains how it works in different cases to achieve data
consistency:

--
2 nodes, bidirectional case 1:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.01 AM

subscription retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

--
2 nodes, bidirectional case 2:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

After executing T2, the apply worker on Node A will request the latest wal
flush location on Node B. And the T3 is either running concurrently or has not
started. In both cases, the T3 must have a later timestamp. So, even if the
dead tuple is removed in this cases and update_missing is detected, the default
resolution is to convert UDPATE to INSERT which is OK because the data are
still consistent on Node A and B.

--
3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
--

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

Node C:
    apply T1, T2, T3

After applying T2, the apply worker on Node C will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

Your feedback on this idea would be greatly appreciated.

Best Regards,
Hou zj



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> On Friday, September 20, 2024 2:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > 
> >
> > I think that such a time-based configuration parameter would be a
> > reasonable solution. The current concerns are that it might affect
> > vacuum performance and lead to a similar bug we had with
> vacuum_defer_cleanup_age.
> 
> Thanks for the feedback!
> 
> I am working on the POC patch and doing some initial performance tests on
> this idea.
> I will share the results after finishing.
> 
> Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to
> our approach for retaining dead tuples and have come up with another idea that
> can reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
> 
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any
> subsequent UPDATE from a remote node after removing the dead tuples
> should have a later timestamp, meaning it's reasonable to detect an
> update_missing scenario and convert the UPDATE to an INSERT when
> applying it.
> 
> To achieve above, we can create an additional replication slot on the subscriber
> side, maintained by the apply worker. This slot is used to retain the dead tuples.
> The apply worker will advance the slot.xmin after confirming that all the
> concurrent transaction on publisher has been applied locally.
> 
> The process of advancing the slot.xmin could be:
> 
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or extend the existing keepalive message(e,g extends the
> requestReply bit in keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
> 
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
> 
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
> 
> The following explains how it works in different cases to achieve data
> consistency:
...
> --
> 3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
> --

Sorry for a typo here, the time of T2 and T3 were reversed.
Please see the following correction:

> 
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Here T2 should be at ts=10.02 AM

> 
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

T3 should be at ts=10.01 AM

> 
> Node C:
>     apply T1, T2, T3
> 
> After applying T2, the apply worker on Node C will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
> 
> Your feedback on this idea would be greatly appreciated.
> 

Best Regards,
Hou zj 


Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Apart from the vacuum_defer_cleanup_age idea.
>

I think you meant to say vacuum_committs_age idea.

> we’ve given more thought to our
> approach for retaining dead tuples and have come up with another idea that can
> reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
>
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any subsequent
> UPDATE from a remote node after removing the dead tuples should have a later
> timestamp, meaning it's reasonable to detect an update_missing scenario and
> convert the UPDATE to an INSERT when applying it.
>
> To achieve above, we can create an additional replication slot on the
> subscriber side, maintained by the apply worker. This slot is used to retain
> the dead tuples. The apply worker will advance the slot.xmin after confirming
> that all the concurrent transaction on publisher has been applied locally.
>
> The process of advancing the slot.xmin could be:
>
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> extend the existing keepalive message(e,g extends the requestReply bit in
> keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
>
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
>
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
>

As each apply worker needs a separate slot to retain deleted rows, the
requirement for slots will increase. The other possibility is to
maintain one slot by launcher or some other central process that
traverses all subscriptions, remember the ones marked with
retain_dead_rows (let's call this list as retain_sub_list). Then using
running_transactions get the oldest running_xact, and then get the
remote flush location from the other node (publisher node) and store
those as candidate values (candidate_xmin and
candidate_remote_wal_lsn) in slot. We can probably reuse existing
candidate variables of the slot. Next, we can check the remote_flush
locations from all the origins corresponding in retain_sub_list and if
all are ahead of candidate_remote_wal_lsn, we can update the slot's
xmin to candidate_xmin.

I think in the above idea we can an optimization to combine the
request for remote wal LSN from different subscriptions pointing to
the same node to avoid sending multiple requests to the same node. I
am not sure if using pg_subscription.subconninfo is sufficient for
this, if not we can probably leave this optimization.

If this idea is feasible then it would reduce the number of slots
required to retain the deleted rows but the launcher needs to get the
remote wal location corresponding to each publisher node. There are
two ways to achieve that (a) launcher requests one of the apply
workers corresponding to subscriptions pointing to the same publisher
node to get this information; (b) launcher launches another worker to
get the remote wal flush location.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
  T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
  T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
  current XID: 300
  latest flush LSN; 3000

Node B
  current XID: 700
  latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in this scenario, Node-B applies changes in the
order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
resulting in the table having a row with val=3 (assuming
latest_committs_win is the default resolver for this confliction).

Please confirm this scenario as I might be missing something.

>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Does it mean that we use one candiate_remote_wal_lsn in a slot for all
subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
LSN of one of publishers, so other publishers could have completely
different LSNs. How do we compare the candidate_remote_wal_lsn to
remote_flush locations from all the origins?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
> 
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
> 
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> On Node-A, whether we detect "update_deleted" or "update_missing"
> still depends on when vacuum removes the tuple deleted by T2.

I think in this case, no matter we detect "update_delete" or "update_missing",
the final data is the same. Because T4's commit timestamp should be later than
T2 on node A, so in the case of "update_deleted", it will compare the commit
timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
which means we will convert the update into insert and apply. Even if the
deleted tuple is deleted and "update_missing" is detected, the update will
still be converted into insert and applied. So, the result is the same.

> 
> If applying T4 raises an "update_missing" (i.e. the changes are
> applied in the order of T2->T3->(vacuum)->T4), it converts into an
> insert, resulting in the table having a row with value = 3.
> 
> If applying T4 raises an "update_deleted" (i.e. the changes are
> applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
> in the table having no row.
> 
> On the other hand, in this scenario, Node-B applies changes in the
> order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
> resulting in the table having a row with val=3 (assuming
> latest_committs_win is the default resolver for this confliction).
> 
> Please confirm this scenario as I might be missing something.

As explained above, I think the data can be consistent in this case as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 24, 2024 at 2:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > As each apply worker needs a separate slot to retain deleted rows, the
> > requirement for slots will increase. The other possibility is to
> > maintain one slot by launcher or some other central process that
> > traverses all subscriptions, remember the ones marked with
> > retain_dead_rows (let's call this list as retain_sub_list). Then using
> > running_transactions get the oldest running_xact, and then get the
> > remote flush location from the other node (publisher node) and store
> > those as candidate values (candidate_xmin and
> > candidate_remote_wal_lsn) in slot. We can probably reuse existing
> > candidate variables of the slot. Next, we can check the remote_flush
> > locations from all the origins corresponding in retain_sub_list and if
> > all are ahead of candidate_remote_wal_lsn, we can update the slot's
> > xmin to candidate_xmin.
>
> Does it mean that we use one candiate_remote_wal_lsn in a slot for all
> subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
> LSN of one of publishers, so other publishers could have completely
> different LSNs. How do we compare the candidate_remote_wal_lsn to
> remote_flush locations from all the origins?
>

This should be an array/list with one element per publisher. We can
copy candidate_xmin to actual xmin only when the
candiate_remote_wal_lsn's corresponding to all publishers have been
applied aka their remote_flush locations (present in origins) are
ahead. The advantages I see with this are (a) reduces the number of
slots required to achieve the retention of deleted rows for conflict
detection, (b) in some cases we can avoid sending messages to the
publisher because with this we only need to send message to a
particular publisher once rather than by all the apply workers
corresponding to same publisher node.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 24, 2024 at 9:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>

As per my understanding, one of the main reasons to keep it logical is
to allow syncing it to standbys (slotsync functionality). It is
required because after promotion the subscriptions replicated to
standby could be enabled to make it a subscriber. If that is not
possible due to any reason then we can consider it to be a physical
slot as well.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> >   current XID: 300
> >   latest flush LSN; 3000
> >
> > Node B
> >   current XID: 700
> >   latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > 301 (candidate_xmin).
> > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > deleted by T2.
> >
> > Here, on Node-B, there might be a transition between LSN 7100 and 8000
> > that might require the tuple that is deleted by T2.
> >
> > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > On Node-A, whether we detect "update_deleted" or "update_missing"
> > still depends on when vacuum removes the tuple deleted by T2.
>
> I think in this case, no matter we detect "update_delete" or "update_missing",
> the final data is the same. Because T4's commit timestamp should be later than
> T2 on node A, so in the case of "update_deleted", it will compare the commit
> timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
> which means we will convert the update into insert and apply. Even if the
> deleted tuple is deleted and "update_missing" is detected, the update will
> still be converted into insert and applied. So, the result is the same.

The "latest_timestamp_wins" is the default resolution method for
"update_deleted"? When I checked the wiki page[1], the "skip" was the
default solution method for that.

Regards,

[1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > I'm still studying this idea but let me confirm the following scenario.
> > > >
> > > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > > and XIDs and commit LSNs of T2 and T3 are the following:
> > > >
> > > > Node A
> > > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> > commit-LSN:1000
> > > >
> > > > Node B
> > > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > > commit-LSN:5000
> > > >
> > > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > > latest flush WAL position of Node-A and Node-B are following:
> > > >
> > > > Node A
> > > >   current XID: 300
> > > >   latest flush LSN; 3000
> > > >
> > > > Node B
> > > >   current XID: 700
> > > >   latest flush LSN: 7000
> > > >
> > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > > (i.e., the logical replication is delaying for 5 min).
> > > >
> > > > Consider the following scenario:
> > > >
> > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > > gets 301 (set as candidate_xmin).
> > > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> > 8000.
> > > > 4. The apply worker on Node-A continues applying changes, and
> > > > applies the transactions up to remote (commit) LSN 7100.
> > > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > > 301 (candidate_xmin).
> > > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > > deleted by T2.
> > > >
> > > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > > 8000 that might require the tuple that is deleted by T2.
> > > >
> > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > > still depends on when vacuum removes the tuple deleted by T2.
> > >
> > > I think in this case, no matter we detect "update_delete" or
> > > "update_missing", the final data is the same. Because T4's commit
> > > timestamp should be later than
> > > T2 on node A, so in the case of "update_deleted", it will compare the
> > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > > T4 should win, which means we will convert the update into insert and
> > > apply. Even if the deleted tuple is deleted and "update_missing" is
> > > detected, the update will still be converted into insert and applied. So, the
> > result is the same.
> >
> > The "latest_timestamp_wins" is the default resolution method for
> > "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> > solution method for that.
>
> Right, I think the wiki needs some update.
>
> I think using 'skip' as default for update_delete could easily cause data
> divergence when the dead tuple is deleted by an old transaction while the
> UPDATE has a newer timestamp like the case you mentioned. It's necessary to
> follow the last update win strategy when the incoming update has later
> timestamp, which is to convert update to insert.

Right. If "latest_timestamp_wins" is the default resolution for
"update_deleted", I think your idea works fine unless I'm missing
corner cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com