Thread: Re: Conflict detection for update_deleted in logical replication

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

10 September 2024, 09:44:55

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Hi hackers,
>
> I am starting a new thread to discuss and propose the conflict detection for
> update_deleted scenarios during logical replication. This conflict occurs when
> the apply worker cannot find the target tuple to be updated, as the tuple might
> have been removed by another origin.
>
> ---
> BACKGROUND
> ---
>
> Currently, when the apply worker cannot find the target tuple during an update,
> an update_missing conflict is logged. However, to facilitate future automatic
> conflict resolution, it has been agreed[1][2] that we need to detect both
> update_missing and update_deleted conflicts. Specifically, we will detect an
> update_deleted conflict if any dead tuple matching the old key value of the
> update operation is found; otherwise, it will be classified as update_missing.
>
> Detecting both update_deleted and update_missing conflicts is important for
> achieving eventual consistency in a bidirectional cluster, because the
> resolution for each conflict type can differs. For example, for an
> update_missing conflict, a feasible solution might be converting the update to
> an insert and applying it. While for an update_deleted conflict, the preferred
> approach could be to skip the update or compare the timestamps of the delete
> transactions with the remote update transaction's and choose the most recent
> one. For additional context, please refer to [3], which gives examples about
> how these differences could lead to data divergence.
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL. If the
> slots in feedback_slots are removed, a final message containing
> InvalidXLogRecPtr will be sent to inform the walsender to forget about the
> slot.xmin.
>
> To detect update_deleted conflicts during update operations, if the target row
> cannot be found, we perform an additional scan of the table using snapshotAny.
> This scan aims to locate the most recently deleted row that matches the old
> column values from the remote update operation and has not yet been removed by
> VACUUM. If any such tuples are found, we report the update_deleted conflict
> along with the origin and transaction information that deleted the tuple.
>
> Please refer to the attached POC patch set which implements above design. The
> patch set is split into some parts to make it easier for the initial review.
> Please note that each patch is interdependent and cannot work independently.
>
> Thanks a lot to Kuroda-San and Amit for the off-list discussion.
>
> Suggestions and comments are highly appreciated !
>

Thank You Hou-San for explaining the design. But to make it easier to
understand, would you be able to explain the sequence/timeline of the
*new* actions performed by the walsender and the apply processes for
the given example along with new feedback_slot config needed

Node A: (Procs: walsenderA, applyA)
  T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM

Node B: (Procs: walsenderB, applyB)
  T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

10 September 2024, 11:10:33

On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
following flow:

Node A executed T1                                    - 10.00 AM
T1 replicated and applied on Node B                            - 10.0001 AM
Node B executed T3                                    - 10.01 AM
Node A executed T2 (741)                                - 10.02 AM
T2 replicated and applied on Node B    (delete_missing)                - 10.03 AM
T3 replicated and applied on Node A    (new action, detect update_deleted)        - 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A.                - 10.05 AM
                    
 

(new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM

Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
remove the dead tuple on Node A.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

10 September 2024, 12:55:58

On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > > ---
> > > THE DESIGN
> > > ---
> > >
> > > To achieve the above, we plan to allow the logical walsender to
> > > maintain and advance the slot.xmin to protect the data in the user
> > > table and introduce a new logical standby feedback message. This
> > > message reports the WAL position that has been replayed on the logical
> > > standby *AND* the changes occurring on the logical standby before the
> > > WAL position are also replayed to the walsender's node (where the
> > > walsender is running). After receiving the new feedback message, the
> > > walsender will advance the slot.xmin based on the flush info, similar
> > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > > of logical slot are unused during logical replication, so I think it's safe and
> > won't cause side-effect to reuse the xmin for this feature.
> > >
> > > We have introduced a new subscription option
> > > (feedback_slots='slot1,...'), where these slots will be used to check
> > > condition (b): the transactions on logical standbys occurring before
> > > the replay of Node A's DELETE are replayed on Node A as well.
> > > Therefore, on Node B, users should specify the slots corresponding to
> > > Node A in this option. The apply worker will get the oldest confirmed
> > > flush LSN among the specified slots and send the LSN as a feedback
> > message to the walsender. -- I also thought of making it an automaic way, e.g.
> > > let apply worker select the slots that acquired by the walsenders
> > > which connect to the same remote server(e.g. if apply worker's
> > > connection info or some other flags is same as the walsender's
> > > connection info). But it seems tricky because if some slots are
> > > inactive which means the walsenders are not there, the apply worker
> > > could not find the correct slots to check unless we save the host along with
> > the slot's persistence data.
> > >
> > > The new feedback message is sent only if feedback_slots is not NULL.
> > > If the slots in feedback_slots are removed, a final message containing
> > > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > > the slot.xmin.
> > >
> > > To detect update_deleted conflicts during update operations, if the
> > > target row cannot be found, we perform an additional scan of the table using
> > snapshotAny.
> > > This scan aims to locate the most recently deleted row that matches
> > > the old column values from the remote update operation and has not yet
> > > been removed by VACUUM. If any such tuples are found, we report the
> > > update_deleted conflict along with the origin and transaction information
> > that deleted the tuple.
> > >
> > > Please refer to the attached POC patch set which implements above
> > > design. The patch set is split into some parts to make it easier for the initial
> > review.
> > > Please note that each patch is interdependent and cannot work
> > independently.
> > >
> > > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> > >
> > > Suggestions and comments are highly appreciated !
> > >
> >
> > Thank You Hou-San for explaining the design. But to make it easier to
> > understand, would you be able to explain the sequence/timeline of the
> > *new* actions performed by the walsender and the apply processes for the
> > given example along with new feedback_slot config needed
> >
> > Node A: (Procs: walsenderA, applyA)
> >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> >
> > Node B: (Procs: walsenderB, applyB)
> >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
>
> Thanks for reviewing! Let me elaborate further on the example:
>
> On node A, feedback_slots should include the logical slot that used to replicate changes
> from Node A to Node B. On node B, feedback_slots should include the logical
> slot that replicate changes from Node B to Node A.
>
> Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
> following flow:
>
> Node A executed T1                                                                      - 10.00 AM
> T1 replicated and applied on Node B                                                     - 10.0001 AM
> Node B executed T3                                                                      - 10.01 AM
> Node A executed T2 (741)                                                                - 10.02 AM
> T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM

Not related to this feature, but do you mean delete_origin_differ here?

> T3 replicated and applied on Node A     (new action, detect update_deleted)             - 10.04 AM
>
> (new action) Apply worker on Node B has confirmed that T2 has been applied
> locally and the transactions before T2 (e.g., T3) has been replicated and
> applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
> replayed T2), thus send the new feedback message to Node A.                             - 10.05 AM
>
> (new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM
>
> Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
> remove the dead tuple on Node A.
>

Thanks for the example. Can you please review below and let me know if
my understanding is correct.

1)
In a bidirectional replication setup, the user has to create slots in
a way that NodeA's sub's slot is Node B's feedback_slot and Node B's
sub's slot is Node A's feedback slot. And then only this feature will
work well, is it correct to say?

2)
Now coming back to multiple feedback_slots in a subscription, is the
below correct:

Say Node A has publications and subscriptions as follow:
------------------
A_pub1

A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)


Say Node B has publications and subscriptions as follow:
------------------
B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)

B_pub1
B_pub2
B_pub3

Then what will be the feedback_slot configuration for all
subscriptions of A and B? Is below correct:
------------------
A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

3)
If the above is true, then do we have a way to make sure that the user
 has given this configuration exactly the above way? If users end up
giving feedback_slots as some random slot  (say A_slot4 or incomplete
list), do we validate that? (I have not looked at code yet, just
trying to understand design first).

4)
Now coming to this:

> The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender.

 There will be one apply worker on B which will be due to B_sub1, so
will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't
it be sufficient to check confimed_lsn of say slot A_sub1 alone which
has subscribed to table 't' on which delete has been performed? Rest
of the  lots (A_sub2, A_sub3) might have subscribed to different
tables?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

10 September 2024, 14:00:01

On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1                                                                      - 10.00 AM
> > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > Node B executed T3                                                                      - 10.01 AM
> > Node A executed T2 (741)                                                                - 10.02 AM
> > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A     (new action, detect
> update_deleted)             - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> ------------------
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> ------------------
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A and
> B? Is below correct:
> ------------------
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which server the slot is replicating
changes to, which could be tricky as the slot currently doesn't have any info
to identify the remote server. Besides, the slot could be inactive temporarily
due to some subscriber side error, in which case we cannot verify the
subscription that used it.

> 
> 4)
> Now coming to this:
> 
> > The apply worker will get the oldest
> > confirmed flush LSN among the specified slots and send the LSN as a
> > feedback message to the walsender.
> 
>  There will be one apply worker on B which will be due to B_sub1, so will it
> check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> sufficient to check confimed_lsn of say slot A_sub1 alone which has
> subscribed to table 't' on which delete has been performed? Rest of the  lots
> (A_sub2, A_sub3) might have subscribed to different tables?

I think it's theoretically correct to only check the A_sub1. We could document
that user can do this by identifying the tables that each subscription
replicates, but it may not be user friendly.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

11 September 2024, 07:18:25

On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 2:45 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thank You Hou-San for explaining the design. But to make it easier
> > > > to understand, would you be able to explain the sequence/timeline of
> > > > the
> > > > *new* actions performed by the walsender and the apply processes for
> > > > the given example along with new feedback_slot config needed
> > > >
> > > > Node A: (Procs: walsenderA, applyA)
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > > >
> > > > Node B: (Procs: walsenderB, applyB)
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> > >
> > > Thanks for reviewing! Let me elaborate further on the example:
> > >
> > > On node A, feedback_slots should include the logical slot that used to
> > > replicate changes from Node A to Node B. On node B, feedback_slots
> > > should include the logical slot that replicate changes from Node B to Node A.
> > >
> > > Assume the slot.xmin on Node A has been initialized to a valid
> > > number(740) before the following flow:
> > >
> > > Node A executed T1                                                                      - 10.00 AM
> > > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > > Node B executed T3                                                                      - 10.01 AM
> > > Node A executed T2 (741)                                                                - 10.02 AM
> > > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> >
> > Not related to this feature, but do you mean delete_origin_differ here?
>
> Oh sorry, It's a miss. I meant delete_origin_differ.
>
> >
> > > T3 replicated and applied on Node A     (new action, detect
> > update_deleted)             - 10.04 AM
> > >
> > > (new action) Apply worker on Node B has confirmed that T2 has been
> > > applied locally and the transactions before T2 (e.g., T3) has been
> > > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> > >= lsn of the local
> > > replayed T2), thus send the new feedback message to Node A.
> > - 10.05 AM
> > >
> > > (new action) Walsender on Node A received the message and would
> > > advance the slot.xmin.- 10.06 AM
> > >
> > > Then, after the slot.xmin is advanced to a number greater than 741,
> > > the VACUUM would be able to remove the dead tuple on Node A.
> > >
> >
> > Thanks for the example. Can you please review below and let me know if my
> > understanding is correct.
> >
> > 1)
> > In a bidirectional replication setup, the user has to create slots in a way that
> > NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> > A's feedback slot. And then only this feature will work well, is it correct to say?
>
> Yes, your understanding is correct.
>
> >
> > 2)
> > Now coming back to multiple feedback_slots in a subscription, is the below
> > correct:
> >
> > Say Node A has publications and subscriptions as follow:
> > ------------------
> > A_pub1
> >
> > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> >
> >
> > Say Node B has publications and subscriptions as follow:
> > ------------------
> > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> >
> > B_pub1
> > B_pub2
> > B_pub3
> >
> > Then what will be the feedback_slot configuration for all subscriptions of A and
> > B? Is below correct:
> > ------------------
> > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
>
> Right. The above configurations are correct.

Okay. It seems difficult to understand configuration from user's perspective.

> >
> > 3)
> > If the above is true, then do we have a way to make sure that the user  has
> > given this configuration exactly the above way? If users end up giving
> > feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> > validate that? (I have not looked at code yet, just trying to understand design
> > first).
>
> The patch doesn't validate if the feedback slots belong to the correct
> subscriptions on remote server. It only validates if the slot is an existing,
> valid, logical slot. I think there are few challenges to validate it further.
> E.g. We need a way to identify the which server the slot is replicating
> changes to, which could be tricky as the slot currently doesn't have any info
> to identify the remote server. Besides, the slot could be inactive temporarily
> due to some subscriber side error, in which case we cannot verify the
> subscription that used it.

Okay, I understand the challenges here.

> >
> > 4)
> > Now coming to this:
> >
> > > The apply worker will get the oldest
> > > confirmed flush LSN among the specified slots and send the LSN as a
> > > feedback message to the walsender.
> >
> >  There will be one apply worker on B which will be due to B_sub1, so will it
> > check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> > sufficient to check confimed_lsn of say slot A_sub1 alone which has
> > subscribed to table 't' on which delete has been performed? Rest of the  lots
> > (A_sub2, A_sub3) might have subscribed to different tables?
>
> I think it's theoretically correct to only check the A_sub1. We could document
> that user can do this by identifying the tables that each subscription
> replicates, but it may not be user friendly.
>

Sorry, I fail to understand how user can identify the tables and give
feedback_slots accordingly? I thought feedback_slots is a one time
configuration when replication is setup (or say setup changes in
future); it can not keep on changing with each query. Or am I missing
something?

IMO, it is something which should be identified internally. Since the
query is on table 't1', feedback-slot which is for 't1' shall be used
to check lsn. But on rethinking,this optimization may not be worth the
effort, the identification part could be tricky, so it might be okay
to check all the slots.

~~

Another query is about 3 node setup. I couldn't figure out what would
be feedback_slots setting when it is not bidirectional, as in consider
the case where there are three nodes A,B,C. Node C is subscribing to
both Node A and Node B. Node A and Node B are the ones doing
concurrent "update" and "delete" which will both be replicated to Node
C. In this case what will be the feedback_slots setting on Node C? We
don't have any slots here which will be replicating changes from Node
C to Node A and Node C to Node B. This is given in [3] in your first
email ([1])

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

11 September 2024, 07:45:08

On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > ------------------
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > ------------------
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > ------------------
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It would be pretty complicate although possible, so I
prefer to not mention it in the first place if it could not bring much
benefits.

> 
> IMO, it is something which should be identified internally. Since the query is on
> table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> rethinking,this optimization may not be worth the effort, the identification part
> could be tricky, so it might be okay to check all the slots.

I agree that identifying these internally would add complexity.

> 
> ~~
> 
> Another query is about 3 node setup. I couldn't figure out what would be
> feedback_slots setting when it is not bidirectional, as in consider the case
> where there are three nodes A,B,C. Node C is subscribing to both Node A and
> Node B. Node A and Node B are the ones doing concurrent "update" and
> "delete" which will both be replicated to Node C. In this case what will be the
> feedback_slots setting on Node C? We don't have any slots here which will be
> replicating changes from Node C to Node A and Node C to Node B. This is given
> in [3] in your first email ([1])

Thanks for pointing this, the link was a bit misleading. I think the solution
proposed in this thread is only used to allow detecting update_deleted reliably
in a bidirectional cluster.  For non- bidirectional cases, it would be more
tricky to predict the timing till when should we retain the dead tuples.


> 
> [1]:
> https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A
> 6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

11 September 2024, 08:02:48

On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 5:56 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thanks for the example. Can you please review below and let me know
> > > > if my understanding is correct.
> > > >
> > > > 1)
> > > > In a bidirectional replication setup, the user has to create slots
> > > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > > B's sub's slot is Node A's feedback slot. And then only this feature will
> > work well, is it correct to say?
> > >
> > > Yes, your understanding is correct.
> > >
> > > >
> > > > 2)
> > > > Now coming back to multiple feedback_slots in a subscription, is the
> > > > below
> > > > correct:
> > > >
> > > > Say Node A has publications and subscriptions as follow:
> > > > ------------------
> > > > A_pub1
> > > >
> > > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > > >
> > > >
> > > > Say Node B has publications and subscriptions as follow:
> > > > ------------------
> > > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > > >
> > > > B_pub1
> > > > B_pub2
> > > > B_pub3
> > > >
> > > > Then what will be the feedback_slot configuration for all
> > > > subscriptions of A and B? Is below correct:
> > > > ------------------
> > > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> > >
> > > Right. The above configurations are correct.
> >
> > Okay. It seems difficult to understand configuration from user's perspective.
>
> Right. I think we could give an example in the document to make it clear.
>
> >
> > > >
> > > > 3)
> > > > If the above is true, then do we have a way to make sure that the
> > > > user  has given this configuration exactly the above way? If users
> > > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > > incomplete list), do we validate that? (I have not looked at code
> > > > yet, just trying to understand design first).
> > >
> > > The patch doesn't validate if the feedback slots belong to the correct
> > > subscriptions on remote server. It only validates if the slot is an
> > > existing, valid, logical slot. I think there are few challenges to validate it
> > further.
> > > E.g. We need a way to identify the which server the slot is
> > > replicating changes to, which could be tricky as the slot currently
> > > doesn't have any info to identify the remote server. Besides, the slot
> > > could be inactive temporarily due to some subscriber side error, in
> > > which case we cannot verify the subscription that used it.
> >
> > Okay, I understand the challenges here.
> >
> > > >
> > > > 4)
> > > > Now coming to this:
> > > >
> > > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > > specified slots and send the LSN as a feedback message to the
> > > > > walsender.
> > > >
> > > >  There will be one apply worker on B which will be due to B_sub1, so
> > > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > > alone which has subscribed to table 't' on which delete has been
> > > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> > different tables?
> > >
> > > I think it's theoretically correct to only check the A_sub1. We could
> > > document that user can do this by identifying the tables that each
> > > subscription replicates, but it may not be user friendly.
> > >
> >
> > Sorry, I fail to understand how user can identify the tables and give
> > feedback_slots accordingly? I thought feedback_slots is a one time
> > configuration when replication is setup (or say setup changes in future); it can
> > not keep on changing with each query. Or am I missing something?
>
> I meant that user have all the publication information(including the tables
> added in a publication) that the subscription subscribes to, and could also
> have the slot_name, so I think it's possible to identify the tables that each
> subscription includes and add the feedback_slots correspondingly before
> starting the replication. It would be pretty complicate although possible, so I
> prefer to not mention it in the first place if it could not bring much
> benefits.
>
> >
> > IMO, it is something which should be identified internally. Since the query is on
> > table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> > rethinking,this optimization may not be worth the effort, the identification part
> > could be tricky, so it might be okay to check all the slots.
>
> I agree that identifying these internally would add complexity.
>
> >
> > ~~
> >
> > Another query is about 3 node setup. I couldn't figure out what would be
> > feedback_slots setting when it is not bidirectional, as in consider the case
> > where there are three nodes A,B,C. Node C is subscribing to both Node A and
> > Node B. Node A and Node B are the ones doing concurrent "update" and
> > "delete" which will both be replicated to Node C. In this case what will be the
> > feedback_slots setting on Node C? We don't have any slots here which will be
> > replicating changes from Node C to Node A and Node C to Node B. This is given
> > in [3] in your first email ([1])
>
> Thanks for pointing this, the link was a bit misleading. I think the solution
> proposed in this thread is only used to allow detecting update_deleted reliably
> in a bidirectional cluster.  For non- bidirectional cases, it would be more
> tricky to predict the timing till when should we retain the dead tuples.
>

So in brief, this solution is only for bidrectional setup? For
non-bidirectional, feedback_slots is non-configurable and thus
irrelevant.

Irrespective of above, if user ends up setting feedback_slot to some
random but existing slot which is not at all consuming changes, then
it may so happen that the node will never send feedback msg to another
node resulting in accumulation of dead tuples on another node. Is that
a possibility?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

11 September 2024, 08:36:55

On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random but
> existing slot which is not at all consuming changes, then it may so happen that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

13 September 2024, 09:07:56

On Wed, Sep 11, 2024 at 11:07 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > >
> > > > Another query is about 3 node setup. I couldn't figure out what
> > > > would be feedback_slots setting when it is not bidirectional, as in
> > > > consider the case where there are three nodes A,B,C. Node C is
> > > > subscribing to both Node A and Node B. Node A and Node B are the
> > > > ones doing concurrent "update" and "delete" which will both be
> > > > replicated to Node C. In this case what will be the feedback_slots
> > > > setting on Node C? We don't have any slots here which will be
> > > > replicating changes from Node C to Node A and Node C to Node B. This
> > > > is given in [3] in your first email ([1])
> > >
> > > Thanks for pointing this, the link was a bit misleading. I think the
> > > solution proposed in this thread is only used to allow detecting
> > > update_deleted reliably in a bidirectional cluster.  For non-
> > > bidirectional cases, it would be more tricky to predict the timing till when
> > should we retain the dead tuples.
> > >
> >
> > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > feedback_slots is non-configurable and thus irrelevant.
>
> Right.
>

One possible idea to address the non-bidirectional case raised by
Shveta is to use a time-based cut-off to remove dead tuples. As
mentioned earlier in my email [1], we can define a new GUC parameter
say vacuum_committs_age which would indicate that we will allow rows
to be removed only if the modified time of the tuple as indicated by
committs module is greater than the vacuum_committs_age. We could keep
this parameter a table-level option without introducing a GUC as this
may not apply to all tables. I checked and found that some other
replication solutions like GoldenGate also allowed similar parameters
(tombstone_deletes) to be specified at table level [2]. The other
advantage of allowing it at table level is that it won't hamper the
performance of hot-pruning or vacuum in general. Note, I am careful
here because to decide whether to remove a dead tuple or not we need
to compare its committs_time both during hot-pruning and vacuum.

Note that tombstones_deletes is a general concept used by replication
solutions to detect updated_deleted conflict and time-based purging is
recommended. See [3][4]. We previously discussed having tombstone
tables to keep the deleted records information but it was suggested to
prevent the vacuum from removing the required dead tuples as that
would be simpler than inventing a new kind of tables/store for
tombstone_deletes [5]. So, we came up with the idea of feedback slots
discussed in this email but that didn't work out in all cases and
appears difficult to configure as pointed out by Shveta. So, now, we
are back to one of the other ideas [1] discussed previously to solve
this problem.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[2] -
BEGIN
  DBMS_GOLDENGATE_ADM.ALTER_AUTO_CDR(
    schema_name       => 'hr',
    table_name        => 'employees',
    tombstone_deletes => TRUE);
END;
/
[3] - https://en.wikipedia.org/wiki/Tombstone_(data_store)
[4] -
https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/automatic-conflict-detection-and-resolution1.html#GUID-423C6EE8-1C62-4085-899C-8454B8FB9C92
[5] - https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

13 September 2024, 10:55:52

On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > >
> > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > feedback_slots is non-configurable and thus irrelevant.
> >
> > Right.
> >
>
> One possible idea to address the non-bidirectional case raised by
> Shveta is to use a time-based cut-off to remove dead tuples. As
> mentioned earlier in my email [1], we can define a new GUC parameter
> say vacuum_committs_age which would indicate that we will allow rows
> to be removed only if the modified time of the tuple as indicated by
> committs module is greater than the vacuum_committs_age. We could keep
> this parameter a table-level option without introducing a GUC as this
> may not apply to all tables. I checked and found that some other
> replication solutions like GoldenGate also allowed similar parameters
> (tombstone_deletes) to be specified at table level [2]. The other
> advantage of allowing it at table level is that it won't hamper the
> performance of hot-pruning or vacuum in general. Note, I am careful
> here because to decide whether to remove a dead tuple or not we need
> to compare its committs_time both during hot-pruning and vacuum.

+1 on the idea, but IIUC this value doesn’t need to be significant; it
can be limited to just a few minutes. The one which is sufficient to
handle replication delays caused by network lag or other factors,
assuming clock skew has already been addressed.

This new parameter is necessary only for cases where an UPDATE and
DELETE on the same row occur concurrently, but the replication order
to a third node is not preserved, which could result in data
divergence. Consider the following example:

Node A:
   T1: INSERT INTO t (id, value) VALUES (1,1);  (10.01 AM)
   T2: DELETE FROM t WHERE id = 1;             (10.03 AM)

Node B:
   T3: UPDATE t SET value = 2 WHERE id = 1;    (10.02 AM)

Assume a third node (Node C) subscribes to both Node A and Node B. The
"correct" order of messages received by Node C would be T1-T3-T2, but
it could also receive them in the order T1-T2-T3, wherein  sayT3 is
received with a lag of say 2 mins. In such a scenario, T3 should be
able to recognize that the row was deleted by T2 on Node C, thereby
detecting the update-deleted conflict and skipping the apply.

The 'vacuum_committs_age' parameter should account for this lag, which
could lead to the order reversal of UPDATE and DELETE operations.

Any subsequent attempt to update the same row after conflict detection
and resolution should not pose an issue. For example, if Node A
triggers the following at 10:20 AM:
UPDATE t SET value = 3 WHERE id = 1;

Since the row has already been deleted, the UPDATE will not proceed
and therefore will not generate a replication operation on the other
nodes, indicating that vacuum need not to preserve the dead row to
this far.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

17 September 2024, 03:38:16

On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > feedback_slots is non-configurable and thus irrelevant.
> > >
> > > Right.
> > >
> >
> > One possible idea to address the non-bidirectional case raised by
> > Shveta is to use a time-based cut-off to remove dead tuples. As
> > mentioned earlier in my email [1], we can define a new GUC parameter
> > say vacuum_committs_age which would indicate that we will allow rows
> > to be removed only if the modified time of the tuple as indicated by
> > committs module is greater than the vacuum_committs_age. We could keep
> > this parameter a table-level option without introducing a GUC as this
> > may not apply to all tables. I checked and found that some other
> > replication solutions like GoldenGate also allowed similar parameters
> > (tombstone_deletes) to be specified at table level [2]. The other
> > advantage of allowing it at table level is that it won't hamper the
> > performance of hot-pruning or vacuum in general. Note, I am careful
> > here because to decide whether to remove a dead tuple or not we need
> > to compare its committs_time both during hot-pruning and vacuum.
>
> +1 on the idea,

I agree that this idea is much simpler than the idea originally
proposed in this thread.

IIUC vacuum_committs_age specifies a time rather than an XID age. But
how can we implement it? If it ends up affecting the vacuum cutoff, we
should be careful not to end up with the same result of
vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
the implementation needs not to affect the performance of
ComputeXidHorizons().

> but IIUC this value doesn’t need to be significant; it
> can be limited to just a few minutes. The one which is sufficient to
> handle replication delays caused by network lag or other factors,
> assuming clock skew has already been addressed.

I think that in a non-bidirectional case the value could need to be a
large number. Is that right?

Regards,

[1] https://www.postgresql.org/message-id/20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

17 September 2024, 09:53:18

On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

17 September 2024, 20:54:05

On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > > feedback_slots is non-configurable and thus irrelevant.
> > > > >
> > > > > Right.
> > > > >
> > > >
> > > > One possible idea to address the non-bidirectional case raised by
> > > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > > say vacuum_committs_age which would indicate that we will allow rows
> > > > to be removed only if the modified time of the tuple as indicated by
> > > > committs module is greater than the vacuum_committs_age. We could keep
> > > > this parameter a table-level option without introducing a GUC as this
> > > > may not apply to all tables. I checked and found that some other
> > > > replication solutions like GoldenGate also allowed similar parameters
> > > > (tombstone_deletes) to be specified at table level [2]. The other
> > > > advantage of allowing it at table level is that it won't hamper the
> > > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > > here because to decide whether to remove a dead tuple or not we need
> > > > to compare its committs_time both during hot-pruning and vacuum.
> > >
> > > +1 on the idea,
> >
> > I agree that this idea is much simpler than the idea originally
> > proposed in this thread.
> >
> > IIUC vacuum_committs_age specifies a time rather than an XID age.
> >
>
> Your understanding is correct that vacuum_committs_age specifies a time.
>
> >
> > But
> > how can we implement it? If it ends up affecting the vacuum cutoff, we
> > should be careful not to end up with the same result of
> > vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> > the implementation needs not to affect the performance of
> > ComputeXidHorizons().
> >
>
> I haven't thought about the implementation details yet but I think
> during pruning (for example in heap_prune_satisfies_vacuum()), apart
> from checking if the tuple satisfies
> HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> committs is greater than configured vacuum_committs_age (for the
> table) to decide whether tuple can be removed.

Sounds very costly. I think we need to do performance tests. Even if
the vacuum gets slower only on the particular table having the
vacuum_committs_age setting, it would affect overall autovacuum
performance. Also, it would affect HOT pruning performance.

>
> > > but IIUC this value doesn’t need to be significant; it
> > > can be limited to just a few minutes. The one which is sufficient to
> > > handle replication delays caused by network lag or other factors,
> > > assuming clock skew has already been addressed.
> >
> > I think that in a non-bidirectional case the value could need to be a
> > large number. Is that right?
> >
>
> As per my understanding, even for non-bidirectional cases, the value
> should be small. For example, in the case, pointed out by Shveta [1],
> where the updates from 2 nodes are received by a third node, this
> setting is expected to be small. This setting primarily deals with
> concurrent transactions on multiple nodes, so it should be small but I
> could be missing something.
>

I might be missing something but the scenario I was thinking of is
something below.

Suppose that we setup uni-directional logical replication between Node
A and Node B (e.g., Node A -> Node B) and both nodes have the same row
with key = 1:

Node A:
    T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
      -> This change is applied on Node B at 10:01 AM.

Node B:
    T2: DELETE FROM t WHERE key = 1;         (05:00 AM)

If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
Node A would raise an "update_missing" conflict. On the other hand, if
a vacuum runs on Node B at 11:00 AM, the change would raise an
"update_deleted" conflict. It looks whether we detect an
"update_deleted" or an "updated_missing" depends on the timing of
vacuum, and to avoid such a situation, we would need to set
vacuum_committs_age to more than 5 hours.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

18 September 2024, 07:29:10

On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I haven't thought about the implementation details yet but I think
> > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > from checking if the tuple satisfies
> > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > committs is greater than configured vacuum_committs_age (for the
> > table) to decide whether tuple can be removed.
>
> Sounds very costly. I think we need to do performance tests. Even if
> the vacuum gets slower only on the particular table having the
> vacuum_committs_age setting, it would affect overall autovacuum
> performance. Also, it would affect HOT pruning performance.
>

Agreed that we should do some performance testing and additionally
think of any better way to implement. I think the cost won't be much
if the tuples to be removed are from a single transaction because the
required commit_ts information would be cached but when the tuples are
from different transactions, we could see a noticeable impact. We need
to test to say anything concrete on this.

> >
> > > > but IIUC this value doesn’t need to be significant; it
> > > > can be limited to just a few minutes. The one which is sufficient to
> > > > handle replication delays caused by network lag or other factors,
> > > > assuming clock skew has already been addressed.
> > >
> > > I think that in a non-bidirectional case the value could need to be a
> > > large number. Is that right?
> > >
> >
> > As per my understanding, even for non-bidirectional cases, the value
> > should be small. For example, in the case, pointed out by Shveta [1],
> > where the updates from 2 nodes are received by a third node, this
> > setting is expected to be small. This setting primarily deals with
> > concurrent transactions on multiple nodes, so it should be small but I
> > could be missing something.
> >
>
> I might be missing something but the scenario I was thinking of is
> something below.
>
> Suppose that we setup uni-directional logical replication between Node
> A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> with key = 1:
>
> Node A:
>     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
>       -> This change is applied on Node B at 10:01 AM.
>
> Node B:
>     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
>
> If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> Node A would raise an "update_missing" conflict. On the other hand, if
> a vacuum runs on Node B at 11:00 AM, the change would raise an
> "update_deleted" conflict. It looks whether we detect an
> "update_deleted" or an "updated_missing" depends on the timing of
> vacuum, and to avoid such a situation, we would need to set
> vacuum_committs_age to more than 5 hours.
>

Yeah, in this case, it would detect a different conflict (if we don't
set vacuum_committs_age to greater than 5 hours) but as per my
understanding, the primary purpose of conflict detection and
resolution is to avoid data inconsistency in a bi-directional setup.
Assume, in the above case it is a bi-directional setup, then we want
to have the same data in both nodes. Now, if there are other cases
like the one you mentioned that require to detect the conflict
reliably than I agree this value could be large and probably not the
best way to achieve it. I think we can mention in the docs that the
primary purpose of this is to achieve data consistency among
bi-directional kind of setups.

Having said that even in the above case, the result should be the same
whether the vacuum has removed the row or not. Say, if the vacuum has
not yet removed the row (due to vacuum_committs_age or otherwise) then
also because the incoming update has a later timestamp, we will
convert the update to insert as per last_update_wins resolution
method, so the conflict will be considered as update_missing. And,
say, the vacuum has removed the row and the conflict detected is
update_missing, then also we will convert the update to insert. In
short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
has higher commit-ts, UPDATE should win.

So, we can expect data consistency in bidirectional cases and expect a
deterministic behavior in other cases (e.g. the final data in a table
does not depend on the order of applying the transactions from other
nodes).

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

19 September 2024, 21:49:18

On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I haven't thought about the implementation details yet but I think
> > > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > > from checking if the tuple satisfies
> > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > > committs is greater than configured vacuum_committs_age (for the
> > > table) to decide whether tuple can be removed.
> >
> > Sounds very costly. I think we need to do performance tests. Even if
> > the vacuum gets slower only on the particular table having the
> > vacuum_committs_age setting, it would affect overall autovacuum
> > performance. Also, it would affect HOT pruning performance.
> >
>
> Agreed that we should do some performance testing and additionally
> think of any better way to implement. I think the cost won't be much
> if the tuples to be removed are from a single transaction because the
> required commit_ts information would be cached but when the tuples are
> from different transactions, we could see a noticeable impact. We need
> to test to say anything concrete on this.

Agreed.

>
> > >
> > > > > but IIUC this value doesn’t need to be significant; it
> > > > > can be limited to just a few minutes. The one which is sufficient to
> > > > > handle replication delays caused by network lag or other factors,
> > > > > assuming clock skew has already been addressed.
> > > >
> > > > I think that in a non-bidirectional case the value could need to be a
> > > > large number. Is that right?
> > > >
> > >
> > > As per my understanding, even for non-bidirectional cases, the value
> > > should be small. For example, in the case, pointed out by Shveta [1],
> > > where the updates from 2 nodes are received by a third node, this
> > > setting is expected to be small. This setting primarily deals with
> > > concurrent transactions on multiple nodes, so it should be small but I
> > > could be missing something.
> > >
> >
> > I might be missing something but the scenario I was thinking of is
> > something below.
> >
> > Suppose that we setup uni-directional logical replication between Node
> > A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> > with key = 1:
> >
> > Node A:
> >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> >       -> This change is applied on Node B at 10:01 AM.
> >
> > Node B:
> >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> >
> > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > Node A would raise an "update_missing" conflict. On the other hand, if
> > a vacuum runs on Node B at 11:00 AM, the change would raise an
> > "update_deleted" conflict. It looks whether we detect an
> > "update_deleted" or an "updated_missing" depends on the timing of
> > vacuum, and to avoid such a situation, we would need to set
> > vacuum_committs_age to more than 5 hours.
> >
>
> Yeah, in this case, it would detect a different conflict (if we don't
> set vacuum_committs_age to greater than 5 hours) but as per my
> understanding, the primary purpose of conflict detection and
> resolution is to avoid data inconsistency in a bi-directional setup.
> Assume, in the above case it is a bi-directional setup, then we want
> to have the same data in both nodes. Now, if there are other cases
> like the one you mentioned that require to detect the conflict
> reliably than I agree this value could be large and probably not the
> best way to achieve it. I think we can mention in the docs that the
> primary purpose of this is to achieve data consistency among
> bi-directional kind of setups.
>
> Having said that even in the above case, the result should be the same
> whether the vacuum has removed the row or not. Say, if the vacuum has
> not yet removed the row (due to vacuum_committs_age or otherwise) then
> also because the incoming update has a later timestamp, we will
> convert the update to insert as per last_update_wins resolution
> method, so the conflict will be considered as update_missing. And,
> say, the vacuum has removed the row and the conflict detected is
> update_missing, then also we will convert the update to insert. In
> short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> has higher commit-ts, UPDATE should win.
>
> So, we can expect data consistency in bidirectional cases and expect a
> deterministic behavior in other cases (e.g. the final data in a table
> does not depend on the order of applying the transactions from other
> nodes).

Agreed.

I think that such a time-based configuration parameter would be a
reasonable solution. The current concerns are that it might affect
vacuum performance and lead to a similar bug we had with
vacuum_defer_cleanup_age.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

20 September 2024, 05:54:59

> -----Original Message-----
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila <amit.kapila16@gmail.com>
> Cc: shveta malik <shveta.malik@gmail.com>; Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: Conflict detection for update_deleted in logical replication
> 
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
> 
> Agreed.
> 
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > >       -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency in bidirectional cases and expect a
> > deterministic behavior in other cases (e.g. the final data in a table
> > does not depend on the order of applying the transactions from other
> > nodes).
> 
> Agreed.
> 
> I think that such a time-based configuration parameter would be a reasonable
> solution. The current concerns are that it might affect vacuum performance and
> lead to a similar bug we had with vacuum_defer_cleanup_age.

Thanks for the feedback!

I am working on the POC patch and doing some initial performance tests on this idea.
I will share the results after finishing.

Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to our
approach for retaining dead tuples and have come up with another idea that can
reliably detect conflicts without requiring users to choose a wise value for
the vacuum_committs_age. This new idea could also reduce the performance
impact. Thanks a lot to Amit for off-list discussion.

The concept of the new idea is that, the dead tuples are only useful to detect
conflicts when applying *concurrent* transactions from remotes. Any subsequent
UPDATE from a remote node after removing the dead tuples should have a later
timestamp, meaning it's reasonable to detect an update_missing scenario and
convert the UPDATE to an INSERT when applying it.

To achieve above, we can create an additional replication slot on the
subscriber side, maintained by the apply worker. This slot is used to retain
the dead tuples. The apply worker will advance the slot.xmin after confirming
that all the concurrent transaction on publisher has been applied locally.

The process of advancing the slot.xmin could be:

1) the apply worker call GetRunningTransactionData() to get the
'oldestRunningXid' and consider this as 'candidate_xmin'.
2) the apply worker send a new message to walsender to request the latest wal
flush position(GetFlushRecPtr) on publisher, and save it to
'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
extend the existing keepalive message(e,g extends the requestReply bit in
keepalive message to add a 'request_wal_position' value)
3) The apply worker can continue to apply changes. After applying all the WALs
upto 'candidate_remote_wal_lsn', the apply worker can then advance the
slot.xmin to 'candidate_xmin'.

This approach ensures that dead tuples are not removed until all concurrent
transactions have been applied. It can be effective for both bidirectional and
non-bidirectional replication cases.

We could introduce a boolean subscription option (retain_dead_tuples) to
control whether this feature is enabled. Each subscription intending to detect
update-delete conflicts should set retain_dead_tuples to true.

The following explains how it works in different cases to achieve data
consistency:

--
2 nodes, bidirectional case 1:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.01 AM

subscription retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

--
2 nodes, bidirectional case 2:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

After executing T2, the apply worker on Node A will request the latest wal
flush location on Node B. And the T3 is either running concurrently or has not
started. In both cases, the T3 must have a later timestamp. So, even if the
dead tuple is removed in this cases and update_missing is detected, the default
resolution is to convert UDPATE to INSERT which is OK because the data are
still consistent on Node A and B.

--
3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
--

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

Node C:
    apply T1, T2, T3

After applying T2, the apply worker on Node C will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

Your feedback on this idea would be greatly appreciated.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

20 September 2024, 06:59:07

On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> On Friday, September 20, 2024 2:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > 
> >
> > I think that such a time-based configuration parameter would be a
> > reasonable solution. The current concerns are that it might affect
> > vacuum performance and lead to a similar bug we had with
> vacuum_defer_cleanup_age.
> 
> Thanks for the feedback!
> 
> I am working on the POC patch and doing some initial performance tests on
> this idea.
> I will share the results after finishing.
> 
> Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to
> our approach for retaining dead tuples and have come up with another idea that
> can reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
> 
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any
> subsequent UPDATE from a remote node after removing the dead tuples
> should have a later timestamp, meaning it's reasonable to detect an
> update_missing scenario and convert the UPDATE to an INSERT when
> applying it.
> 
> To achieve above, we can create an additional replication slot on the subscriber
> side, maintained by the apply worker. This slot is used to retain the dead tuples.
> The apply worker will advance the slot.xmin after confirming that all the
> concurrent transaction on publisher has been applied locally.
> 
> The process of advancing the slot.xmin could be:
> 
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or extend the existing keepalive message(e,g extends the
> requestReply bit in keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
> 
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
> 
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
> 
> The following explains how it works in different cases to achieve data
> consistency:
...
> --
> 3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
> --

Sorry for a typo here, the time of T2 and T3 were reversed.
Please see the following correction:

> 
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Here T2 should be at ts=10.02 AM

> 
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

T3 should be at ts=10.01 AM

> 
> Node C:
>     apply T1, T2, T3
> 
> After applying T2, the apply worker on Node C will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
> 
> Your feedback on this idea would be greatly appreciated.
> 

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

20 September 2024, 12:46:26

On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Apart from the vacuum_defer_cleanup_age idea.
>

I think you meant to say vacuum_committs_age idea.

> we’ve given more thought to our
> approach for retaining dead tuples and have come up with another idea that can
> reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
>
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any subsequent
> UPDATE from a remote node after removing the dead tuples should have a later
> timestamp, meaning it's reasonable to detect an update_missing scenario and
> convert the UPDATE to an INSERT when applying it.
>
> To achieve above, we can create an additional replication slot on the
> subscriber side, maintained by the apply worker. This slot is used to retain
> the dead tuples. The apply worker will advance the slot.xmin after confirming
> that all the concurrent transaction on publisher has been applied locally.
>
> The process of advancing the slot.xmin could be:
>
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> extend the existing keepalive message(e,g extends the requestReply bit in
> keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
>
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
>
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
>

As each apply worker needs a separate slot to retain deleted rows, the
requirement for slots will increase. The other possibility is to
maintain one slot by launcher or some other central process that
traverses all subscriptions, remember the ones marked with
retain_dead_rows (let's call this list as retain_sub_list). Then using
running_transactions get the oldest running_xact, and then get the
remote flush location from the other node (publisher node) and store
those as candidate values (candidate_xmin and
candidate_remote_wal_lsn) in slot. We can probably reuse existing
candidate variables of the slot. Next, we can check the remote_flush
locations from all the origins corresponding in retain_sub_list and if
all are ahead of candidate_remote_wal_lsn, we can update the slot's
xmin to candidate_xmin.

I think in the above idea we can an optimization to combine the
request for remote wal LSN from different subscriptions pointing to
the same node to avoid sending multiple requests to the same node. I
am not sure if using pg_subscription.subconninfo is sufficient for
this, if not we can probably leave this optimization.

If this idea is feasible then it would reduce the number of slots
required to retain the deleted rows but the launcher needs to get the
remote wal location corresponding to each publisher node. There are
two ways to achieve that (a) launcher requests one of the apply
workers corresponding to subscriptions pointing to the same publisher
node to get this information; (b) launcher launches another worker to
get the remote wal flush location.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

24 September 2024, 00:05:17

Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
  T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
  T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
  current XID: 300
  latest flush LSN; 3000

Node B
  current XID: 700
  latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in this scenario, Node-B applies changes in the
order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
resulting in the table having a row with val=3 (assuming
latest_committs_win is the default resolver for this confliction).

Please confirm this scenario as I might be missing something.

>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Does it mean that we use one candiate_remote_wal_lsn in a slot for all
subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
LSN of one of publishers, so other publishers could have completely
different LSNs. How do we compare the candidate_remote_wal_lsn to
remote_flush locations from all the origins?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

24 September 2024, 06:32:33

On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
> 
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
> 
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> On Node-A, whether we detect "update_deleted" or "update_missing"
> still depends on when vacuum removes the tuple deleted by T2.

I think in this case, no matter we detect "update_delete" or "update_missing",
the final data is the same. Because T4's commit timestamp should be later than
T2 on node A, so in the case of "update_deleted", it will compare the commit
timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
which means we will convert the update into insert and apply. Even if the
deleted tuple is deleted and "update_missing" is detected, the update will
still be converted into insert and applied. So, the result is the same.

> 
> If applying T4 raises an "update_missing" (i.e. the changes are
> applied in the order of T2->T3->(vacuum)->T4), it converts into an
> insert, resulting in the table having a row with value = 3.
> 
> If applying T4 raises an "update_deleted" (i.e. the changes are
> applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
> in the table having no row.
> 
> On the other hand, in this scenario, Node-B applies changes in the
> order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
> resulting in the table having a row with val=3 (assuming
> latest_committs_win is the default resolver for this confliction).
> 
> Please confirm this scenario as I might be missing something.

As explained above, I think the data can be consistent in this case as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

24 September 2024, 07:05:55

On Tue, Sep 24, 2024 at 2:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > As each apply worker needs a separate slot to retain deleted rows, the
> > requirement for slots will increase. The other possibility is to
> > maintain one slot by launcher or some other central process that
> > traverses all subscriptions, remember the ones marked with
> > retain_dead_rows (let's call this list as retain_sub_list). Then using
> > running_transactions get the oldest running_xact, and then get the
> > remote flush location from the other node (publisher node) and store
> > those as candidate values (candidate_xmin and
> > candidate_remote_wal_lsn) in slot. We can probably reuse existing
> > candidate variables of the slot. Next, we can check the remote_flush
> > locations from all the origins corresponding in retain_sub_list and if
> > all are ahead of candidate_remote_wal_lsn, we can update the slot's
> > xmin to candidate_xmin.
>
> Does it mean that we use one candiate_remote_wal_lsn in a slot for all
> subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
> LSN of one of publishers, so other publishers could have completely
> different LSNs. How do we compare the candidate_remote_wal_lsn to
> remote_flush locations from all the origins?
>

This should be an array/list with one element per publisher. We can
copy candidate_xmin to actual xmin only when the
candiate_remote_wal_lsn's corresponding to all publishers have been
applied aka their remote_flush locations (present in origins) are
ahead. The advantages I see with this are (a) reduces the number of
slots required to achieve the retention of deleted rows for conflict
detection, (b) in some cases we can avoid sending messages to the
publisher because with this we only need to send message to a
particular publisher once rather than by all the apply workers
corresponding to same publisher node.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

24 September 2024, 08:19:10

On Tue, Sep 24, 2024 at 9:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>

As per my understanding, one of the main reasons to keep it logical is
to allow syncing it to standbys (slotsync functionality). It is
required because after promotion the subscriptions replicated to
standby could be enabled to make it a subscriber. If that is not
possible due to any reason then we can consider it to be a physical
slot as well.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

24 September 2024, 09:42:15

On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> >   current XID: 300
> >   latest flush LSN; 3000
> >
> > Node B
> >   current XID: 700
> >   latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > 301 (candidate_xmin).
> > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > deleted by T2.
> >
> > Here, on Node-B, there might be a transition between LSN 7100 and 8000
> > that might require the tuple that is deleted by T2.
> >
> > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > On Node-A, whether we detect "update_deleted" or "update_missing"
> > still depends on when vacuum removes the tuple deleted by T2.
>
> I think in this case, no matter we detect "update_delete" or "update_missing",
> the final data is the same. Because T4's commit timestamp should be later than
> T2 on node A, so in the case of "update_deleted", it will compare the commit
> timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
> which means we will convert the update into insert and apply. Even if the
> deleted tuple is deleted and "update_missing" is detected, the update will
> still be converted into insert and applied. So, the result is the same.

The "latest_timestamp_wins" is the default resolution method for
"update_deleted"? When I checked the wiki page[1], the "skip" was the
default solution method for that.

Regards,

[1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

24 September 2024, 10:14:53

On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

24 September 2024, 20:24:56

On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > I'm still studying this idea but let me confirm the following scenario.
> > > >
> > > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > > and XIDs and commit LSNs of T2 and T3 are the following:
> > > >
> > > > Node A
> > > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> > commit-LSN:1000
> > > >
> > > > Node B
> > > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > > commit-LSN:5000
> > > >
> > > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > > latest flush WAL position of Node-A and Node-B are following:
> > > >
> > > > Node A
> > > >   current XID: 300
> > > >   latest flush LSN; 3000
> > > >
> > > > Node B
> > > >   current XID: 700
> > > >   latest flush LSN: 7000
> > > >
> > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > > (i.e., the logical replication is delaying for 5 min).
> > > >
> > > > Consider the following scenario:
> > > >
> > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > > gets 301 (set as candidate_xmin).
> > > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> > 8000.
> > > > 4. The apply worker on Node-A continues applying changes, and
> > > > applies the transactions up to remote (commit) LSN 7100.
> > > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > > 301 (candidate_xmin).
> > > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > > deleted by T2.
> > > >
> > > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > > 8000 that might require the tuple that is deleted by T2.
> > > >
> > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > > still depends on when vacuum removes the tuple deleted by T2.
> > >
> > > I think in this case, no matter we detect "update_delete" or
> > > "update_missing", the final data is the same. Because T4's commit
> > > timestamp should be later than
> > > T2 on node A, so in the case of "update_deleted", it will compare the
> > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > > T4 should win, which means we will convert the update into insert and
> > > apply. Even if the deleted tuple is deleted and "update_missing" is
> > > detected, the update will still be converted into insert and applied. So, the
> > result is the same.
> >
> > The "latest_timestamp_wins" is the default resolution method for
> > "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> > solution method for that.
>
> Right, I think the wiki needs some update.
>
> I think using 'skip' as default for update_delete could easily cause data
> divergence when the dead tuple is deleted by an old transaction while the
> UPDATE has a newer timestamp like the case you mentioned. It's necessary to
> follow the last update win strategy when the incoming update has later
> timestamp, which is to convert update to insert.

Right. If "latest_timestamp_wins" is the default resolution for
"update_deleted", I think your idea works fine unless I'm missing
corner cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

24 September 2024, 21:22:36

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.
> >
> > The process of advancing the slot.xmin could be:
> >
> > 1) the apply worker call GetRunningTransactionData() to get the
> > 'oldestRunningXid' and consider this as 'candidate_xmin'.
> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)
> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >
>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Yeah, I think that such an idea to reduce the number required slots
would be necessary.

>
> I think in the above idea we can an optimization to combine the
> request for remote wal LSN from different subscriptions pointing to
> the same node to avoid sending multiple requests to the same node. I
> am not sure if using pg_subscription.subconninfo is sufficient for
> this, if not we can probably leave this optimization.
>
> If this idea is feasible then it would reduce the number of slots
> required to retain the deleted rows but the launcher needs to get the
> remote wal location corresponding to each publisher node. There are
> two ways to achieve that (a) launcher requests one of the apply
> workers corresponding to subscriptions pointing to the same publisher
> node to get this information; (b) launcher launches another worker to
> get the remote wal flush location.

I think the remote wal flush location is asked using a replication
protocol. Therefore, if a new worker is responsible for asking wal
flush location from multiple publishers (like the idea (b)), the
corresponding process would need to be launched on publisher sides and
logical replication would also need to start on each connection. I
think it would be better to get the remote wal flush location using
the existing logical replication connection (i.e., between the logical
wal sender and the apply worker), and advertise the locations on the
shared memory. Then, the central process who holds the slot to retain
the deleted row versions traverses them and increases slot.xmin if
possible.

The cost of requesting the remote wal flush location would not be huge
if we don't ask it very frequently. So probably we can start by having
each apply worker (in the retain_sub_list) ask the remote wal flush
location and can leave the optimization of avoiding sending the
request for the same publisher.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

01 October 2024, 15:43:33

On Mon, Sep 30, 2024 at 12:02 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 25, 2024 2:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I think the remote wal flush location is asked using a replication protocol.
> > Therefore, if a new worker is responsible for asking wal flush location from
> > multiple publishers (like the idea (b)), the corresponding process would need
> > to be launched on publisher sides and logical replication would also need to
> > start on each connection. I think it would be better to get the remote wal flush
> > location using the existing logical replication connection (i.e., between the
> > logical wal sender and the apply worker), and advertise the locations on the
> > shared memory. Then, the central process who holds the slot to retain the
> > deleted row versions traverses them and increases slot.xmin if possible.
> >
> > The cost of requesting the remote wal flush location would not be huge if we
> > don't ask it very frequently. So probably we can start by having each apply
> > worker (in the retain_sub_list) ask the remote wal flush location and can leave
> > the optimization of avoiding sending the request for the same publisher.
>
> Agreed. Here is the POC patch set based on this idea.
>
> The implementation is as follows:
>
> A subscription option is added to allow users to specify whether dead
> tuples on the subscriber, which are useful for detecting update_deleted
> conflicts, should be retained. The default setting is false. If set to true,
> the detection of update_deleted will be enabled,
>

I find the option name retain_dead_tuples bit misleading because by
name one can't make out the purpose of the same. It is better to name
it as detect_update_deleted or something on those lines.

> and an additional replication
> slot named pg_conflict_detection will be created on the subscriber to prevent
> dead tuples from being removed. Note that if multiple subscriptions on one node
> enable this option, only one replication slot will be created.
>

In general, we should have done this by default but as detecting
update_deleted type conflict has some overhead in terms of retaining
dead tuples for more time, so having an option seems reasonable. But I
suggest to keep this as a separate last patch. If we can make the core
idea work by default then we can enable it via option in the end.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

02 October 2024, 09:33:50

Dear Hou,

Thanks for updating the patch! Here are my comments.
My comments do not take care which file contains the change, and the ordering may
be random.

1.
```
+       and <link
linkend="sql-createsubscription-params-with-detect-update-deleted"><literal>detect_conflict</literal></link>
+       are enabled.
```
"detect_conflict" still exists, it should be "detect_update_deleted".

2. maybe_advance_nonremovable_xid
```
+        /* Send a wal position request message to the server */
+        walrcv_send(LogRepWorkerWalRcvConn, "x", sizeof(uint8))
```
I think the character is used for PoC purpose, so it's about time we change it.
How about:

- 'W', because it requests the WAL location, or
- 'S', because it is accosiated with 's' message.

3. maybe_advance_nonremovable_xid
```
+        if (!AllTablesyncsReady())
+            return;
```
If we do not update oldest_nonremovable_xid during the sync, why do we send
the status message? I feel we can return in any phases if !AllTablesyncsReady().

4. advance_conflict_slot_xmin
```
+            ReplicationSlotCreate(CONFLICT_DETECTION_SLOT, false,
+                                  RS_PERSISTENT, false, false, false);
```
Hmm. You said the slot would be logical, but now it is physical. Which is correct?

5. advance_conflict_slot_xmin
```
+            xmin_horizon = GetOldestSafeDecodingTransactionId(true);
```
Since the slot won't do the logical decoding, we do not have to use oldest-safe-decoding
xid. I feel it is OK to use the latest xid.

6. advance_conflict_slot_xmin
```
+    /* No need to update xmin if the slot has been invalidated */
+    if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
```
I feel the slot won't be invalidated. According to
InvalidatePossiblyObsoleteSlot(), the physical slot cannot be invalidated if it
has invalid restart_lsn.

7. ApplyLauncherMain
```
+            retain_dead_tuples |= sub->detectupdatedeleted;
```
Can you tell me why it must be updated even if the sub is disabled?

8. ApplyLauncherMain

If the subscription which detect_update_deleted = true exists but wal_receiver_status_interval = 0,
the slot won't be advanced and dead tuple retains forever... is it right? Can we
avoid it anyway?

9. FindMostRecentlyDeletedTupleInfo

It looks for me that the scan does not use indexes even if exists, but I feel it should use.
Am I missing something or is there a reason?

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716E0A283D1B66954CDF5A694682%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

14 October 2024, 06:39:49

On Friday, October 11, 2024 4:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
>
> Attach the V4 patch set which addressed above comments.
>

While reviewing the patch, I noticed that the current design could not work in
a non-bidirectional cluster (publisher -> subscriber) when the publisher is
also a physical standby. (We supported logical decoding on a physical standby
recently, so it's possible to take a physical standby as a logical publisher).

The cluster looks like:

physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)

The issue arises because the physical standby (acting as the publisher) might
lag behind its primary. As a result, the logical walsender on the standby might
not be able to get the latest WAL position when requested by the logical
subscriber. We can only get the WAL replay position but there may be more WALs
that are being replicated from the primary and those WALs could have older
commit timestamp. (Note that transactions on both primary and standby have
the same commit timestamp).

So, the logical walsender might send an outdated WAL position as feedback.
This, in turn, can cause the replication slot on the subscriber to advance
prematurely, leading to the unwanted removal of dead tuples. As a result, the
apply worker may fail to correctly detect update-delete conflicts.

We thought of few options to fix this:

1) Add a Time-Based Subscription Option:

We could add a new time-based option for subscriptions, such as
retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
candidate XID, the launcher will wait for the specified time before advancing
the slot.xmin. This ensures that deleted tuples are retained for at least the
duration defined by this new option.

This approach is designed to simulate the functionality of the GUC
(vacuum_committs_age), but with a simpler implementation that does not impact
vacuum performance. We can maintain both this time-based method and the current
automatic method. If a user does not specify the time-based option, we will
continue using the existing approach to retain dead tuples until all concurrent
transactions from the remote node have been applied.

2) Modification to the Logical Walsender

On the logical walsender, which is as a physical standby, we can build an
additional connection to the physical primary to obtain the latest WAL
position. This position will then be sent as feedback to the logical
subscriber.

A potential concern is that this requires the walsender to use the walreceiver
API, which may seem a bit unnatural. And, it starts an additional walsender
process on the primary, as the logical walsender on the physical standby will
need to communicate with this walsender to fetch the WAL position.

3) Documentation of Restrictions

As an alternative, we could simply document the restriction that detecting
update_delete is not supported if the publisher is also acting as a physical
standby.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

15 October 2024, 14:33:14

On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> While reviewing the patch, I noticed that the current design could not work in
> a non-bidirectional cluster (publisher -> subscriber) when the publisher is
> also a physical standby. (We supported logical decoding on a physical standby
> recently, so it's possible to take a physical standby as a logical publisher).
>
> The cluster looks like:
>
>         physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)
>
> The issue arises because the physical standby (acting as the publisher) might
> lag behind its primary. As a result, the logical walsender on the standby might
> not be able to get the latest WAL position when requested by the logical
> subscriber. We can only get the WAL replay position but there may be more WALs
> that are being replicated from the primary and those WALs could have older
> commit timestamp. (Note that transactions on both primary and standby have
> the same commit timestamp).
>
> So, the logical walsender might send an outdated WAL position as feedback.
> This, in turn, can cause the replication slot on the subscriber to advance
> prematurely, leading to the unwanted removal of dead tuples. As a result, the
> apply worker may fail to correctly detect update-delete conflicts.
>
> We thought of few options to fix this:
>
> 1) Add a Time-Based Subscription Option:
>
> We could add a new time-based option for subscriptions, such as
> retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> candidate XID, the launcher will wait for the specified time before advancing
> the slot.xmin. This ensures that deleted tuples are retained for at least the
> duration defined by this new option.
>
> This approach is designed to simulate the functionality of the GUC
> (vacuum_committs_age), but with a simpler implementation that does not impact
> vacuum performance. We can maintain both this time-based method and the current
> automatic method. If a user does not specify the time-based option, we will
> continue using the existing approach to retain dead tuples until all concurrent
> transactions from the remote node have been applied.
>
> 2) Modification to the Logical Walsender
>
> On the logical walsender, which is as a physical standby, we can build an
> additional connection to the physical primary to obtain the latest WAL
> position. This position will then be sent as feedback to the logical
> subscriber.
>
> A potential concern is that this requires the walsender to use the walreceiver
> API, which may seem a bit unnatural. And, it starts an additional walsender
> process on the primary, as the logical walsender on the physical standby will
> need to communicate with this walsender to fetch the WAL position.
>

This idea is worth considering, but I think it may not be a good
approach if the physical standby is cascading. We need to restrict the
update_delete conflict detection, if the standby is cascading, right?

The other approach is that we send current_timestamp from the
subscriber and somehow check if the physical standby has applied
commit_lsn up to that commit_ts, if so, it can send that WAL position
to the subscriber, otherwise, wait for it to be applied. If we do this
then we don't need to add a restriction for cascaded physical standby.
I think the subscriber anyway needs to wait for such an LSN to be
applied on standby before advancing the xmin even if we get it from
the primary. This is because the subscriber can only be able to apply
and flush the WAL once it is applied on the standby. Am, I missing
something?

This approach has a disadvantage that we are relying on clocks to be
synced on both nodes which we anyway need for conflict resolution as
discussed in the thread [1]. We also need to consider the Commit
Timestamp and LSN inversion issue as discussed in another thread [2]
if we want to pursue this approach because we may miss an LSN that has
a prior timestamp.

> 3) Documentation of Restrictions
>
> As an alternative, we could simply document the restriction that detecting
> update_delete is not supported if the publisher is also acting as a physical
> standby.
>

If we don't want to go for something along the lines of the approach
mentioned in (2) then I think we can do a combination of (1) and (3)
where we can error out if the user has not provided retain_dead_tuples
and the publisher is physical standby.

[1] - https://www.postgresql.org/message-id/CABdArM4%3D152B9PoyF4kggQ4LniYtjBCdUjL%3DqBwD-jcogP2BPQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

18 October 2024, 12:44:54

On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > We thought of few options to fix this:
> >
> > 1) Add a Time-Based Subscription Option:
> >
> > We could add a new time-based option for subscriptions, such as
> > retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> > candidate XID, the launcher will wait for the specified time before advancing
> > the slot.xmin. This ensures that deleted tuples are retained for at least the
> > duration defined by this new option.
> >
> > This approach is designed to simulate the functionality of the GUC
> > (vacuum_committs_age), but with a simpler implementation that does not impact
> > vacuum performance. We can maintain both this time-based method and the current
> > automatic method. If a user does not specify the time-based option, we will
> > continue using the existing approach to retain dead tuples until all concurrent
> > transactions from the remote node have been applied.
> >
> > 2) Modification to the Logical Walsender
> >
> > On the logical walsender, which is as a physical standby, we can build an
> > additional connection to the physical primary to obtain the latest WAL
> > position. This position will then be sent as feedback to the logical
> > subscriber.
> >
> > A potential concern is that this requires the walsender to use the walreceiver
> > API, which may seem a bit unnatural. And, it starts an additional walsender
> > process on the primary, as the logical walsender on the physical standby will
> > need to communicate with this walsender to fetch the WAL position.
> >
>
> This idea is worth considering, but I think it may not be a good
> approach if the physical standby is cascading. We need to restrict the
> update_delete conflict detection, if the standby is cascading, right?
>
> The other approach is that we send current_timestamp from the
> subscriber and somehow check if the physical standby has applied
> commit_lsn up to that commit_ts, if so, it can send that WAL position
> to the subscriber, otherwise, wait for it to be applied. If we do this
> then we don't need to add a restriction for cascaded physical standby.
> I think the subscriber anyway needs to wait for such an LSN to be
> applied on standby before advancing the xmin even if we get it from
> the primary. This is because the subscriber can only be able to apply
> and flush the WAL once it is applied on the standby. Am, I missing
> something?
>
> This approach has a disadvantage that we are relying on clocks to be
> synced on both nodes which we anyway need for conflict resolution as
> discussed in the thread [1]. We also need to consider the Commit
> Timestamp and LSN inversion issue as discussed in another thread [2]
> if we want to pursue this approach because we may miss an LSN that has
> a prior timestamp.
>

The problem due to Commit Timestamp and LSN inversion is that the
standby may not consider the WAL LSN from an earlier timestamp, which
could lead to the removal of required dead rows on the subscriber.

The other problem pointed out by Hou-San offlist due to Commit
Timestamp and LSN inversion is that we could miss sending the WAL LSN
that the subscriber requires to retain dead rows for update_delete
conflict. For example, consider the following case 2 node,
bidirectional setup:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1; ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

Say subscription is created with retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest
wal flush location on Node B. Till that time, the T3 should have
finished, so the xmin will be advanced only after applying the WALs
that is later than T3. So, the dead tuple will not be removed before
applying the T3, which means the update_delete can be detected.

As there is a gap between when we acquire the commit_timestamp and the
commit LSN, it is possible that T3 would have not yet flushed it's WAL
even though it is committed earlier than T2. If this happens then we
won't be able to detect update_deleted conflict reliably.

Now, the one simpler idea is to acquire the commit timestamp and
reserve WAL (LSN) under the same spinlock in
ReserveXLogInsertLocation() but that could be costly as discussed in
the thread [1]. The other more localized solution is to acquire a
timestamp after reserving the commit WAL LSN outside the lock which
will solve this particular problem.

[1] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

18 October 2024, 12:46:34

On Fri, Oct 11, 2024 at 2:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attach the V4 patch set which addressed above comments.
>

A few minor comments:
1.
+ * Retaining the dead tuples for this period is sufficient because any
+ * subsequent transaction from the publisher will have a later timestamp.
+ * Therefore, it is acceptable if dead tuples are removed by vacuum and an
+ * update_missing conflict is detected, as the correct resolution for the
+ * last-update-wins strategy in this case is to convert the UPDATE to an INSERT
+ * and apply it anyway.
+ *
+ * The 'remote_wal_pos' will be reset after sending a new request to walsender.
+ */
+static void
+maybe_advance_nonremovable_xid(XLogRecPtr *remote_wal_pos,
+    DeadTupleRetainPhase *phase)

We should cover the key point of retaining dead tuples which is to
avoid converting updates to inserts (considering the conflict as
update_missing) in the comments above and also in the commit message.

2. In maybe_advance_nonremovable_xid() all three phases are handled by
different if blocks but as per my understanding the phase value will
be unique in one call to the function. So, shouldn't it be handled
with else if?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Peter Smith

Date:

24 October 2024, 08:00:09

Hi Hou-san, here are my review comments for patch v5-0001.

======
General

1.
Sometimes in the commit message and code comments the patch refers to
"transaction id" and other times to "transaction ID". The patch should
use the same wording everywhere.

======
Commit message.

2.
"While for concurrent remote transactions with earlier timestamps,..."

I think this means:
"But, for concurrent remote transactions with earlier timestamps than
the DELETE,..."

Maybe expressed this way is clearer.

~~~

3.
... the resolution would be to convert the update to an insert.

Change this to uppercase like done elsewhere:
"... the resolution would be to convert the UPDATE to an INSERT.

======
doc/src/sgml/protocol.sgml

4.
            +       <varlistentry
id="protocol-replication-primary-wal-status-update">
+        <term>Primary WAL status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>
+           <listitem>
+            <para>
+             Identifies the message as a primary WAL status update.
+            </para>
+           </listitem>
+          </varlistentry>

I felt it would be better if this is described as just a "Primary
status update" instead of a "Primary WAL status update". Doing this
makes it more flexible in case there is a future requirement to put
more status values in here which may not be strictly WAL related.

~~~

5.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Standby WAL status request (F)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('W')</term>
+           <listitem>
+            <para>
+             Identifies the message as a request for the WAL status
on the primary.
+            </para>
+           </listitem>
+          </varlistentry>
+         </variablelist>
+        </listitem>
+       </varlistentry>

5a.
Ditto the previous comment #4. Perhaps you should just call this a
"Primary status request".

~

5b.
Also, The letter 'W' also seems chosen because of WAL. But it might be
more flexible if those identifiers are more generic.

e.g.
's' = the request for primary status update
'S' = the primary status update

======
src/backend/replication/logical/worker.c

6.
+ else if (c == 's')
+ {
+ TimestampTz timestamp;
+
+ remote_lsn = pq_getmsgint64(&s);
+ timestamp = pq_getmsgint64(&s);
+
+ maybe_advance_nonremovable_xid(&remote_lsn, &phase);
+ UpdateWorkerStats(last_received, timestamp, false);
+ }

Since there's no equivalent #define or enum value, IMO it is too hard
to know the intent of this code without already knowing the meaning of
the magic letter 's'. At least there could be a comment here to
explain that this is for handling an incoming "Primary status update"
message.

~~~

maybe_advance_nonremovable_xid:

7.
+ * The oldest_nonremovable_xid is maintained in shared memory to prevent dead
+ * rows from being removed prematurely when the apply worker still needs them
+ * to detect update-delete conflicts.

/update-delete/update_deleted/

~

8.
+ * applied and flushed locally. The process involves:
+ *
+ * DTR_REQUEST_WALSENDER_WAL_POS - Call GetOldestActiveTransactionId() to get
+ * the candidate xmin and send a message to request the remote WAL position
+ * from the walsender.
+ *
+ * DTR_WAIT_FOR_WALSENDER_WAL_POS - Wait for receiving the WAL position from
+ * the walsender.
+ *
+ * DTR_WAIT_FOR_LOCAL_FLUSH - Advance the non-removable transaction ID if the
+ * current flush location has reached or surpassed the received WAL position.

8a.
This part would be easier to read if those 3 phases were indented from
the rest of this function comment.

~

8b.
/Wait for receiving/Wait to receive/

~

9.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, which involves
+ * converting an UPDATE to an INSERT and applying it if remote transactions

The commit message referred to a "latest_timestamp_wins". I suppose
that is the same as what this function comment called
"last-update-wins". The patch should use consistent terminology.

It would be better if the commit message and (parts of) this function
comment were just cut/pasted to be identical. Currently, they seem to
be saying the same thing, but using slightly different wording.

~

10.
+ static TimestampTz xid_advance_attemp_time = 0;
+ static FullTransactionId candidate_xid;

typo in var name - "attemp"

~

11.
+ *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+
+ /*
+ * Do not return here because the apply worker might have already
+ * applied all changes up to remote_wal_pos. Proceeding to the next
+ * phase to check if we can immediately advance the transaction ID.
+ */

11a.
IMO this comment should be above the *phase assignment.

11b.
/Proceeding to the next phase to check.../Instead, proceed to the next
phase to check.../

~

12.
+ /*
+ * Advance the non-removable transaction id if the remote wal position
+ * has been received, and all transactions up to that position on the
+ * publisher have been applied and flushed locally.
+ */

Some minor re-wording would help clarify this comment.

SUGGESTION
Reaching here means the remote wal position has been received, and all
transactions up to that position on the
publisher have been applied and flushed locally. So, now we can
advance the non-removable transaction id.

~

13.
+ *phase = DTR_REQUEST_WALSENDER_WAL_POS;
+
+ /*
+ * Do not return here as enough time might have passed since the last
+ * wal position request. Proceeding to the next phase to determine if
+ * we can send the next request.
+ */

13a.
IMO this comment should be above the *phase assignment.

~

13b.
This comment should have the same wording here as in the previous
if-block (see #11b).

/Proceeding to the next phase to determine.../Instead, proceed to the
next phase to check.../

~

14.
+ FullTransactionId next_full_xix;
+ FullTransactionId full_xid;

You probably mean 'next_full_xid' (not xix)

~

15.
+ /*
+ * Exit early if the user has disabled sending messages to the
+ * publisher.
+ */
+ if (wal_receiver_status_interval <= 0)
+ return;

What are the implications of this early exit? If the update request is
not possible, then I guess the update status is never received, but
then I suppose that means none of this update_deleted logic is
possible. If that is correct, then will there be some documented
warning/caution about conflict-handling implications by disabling that
GUC?

======
src/backend/replication/walsender.c

16.
+/*
+ * Process the standby message requesting the latest WAL write position.
+ */
+static void
+ProcessStandbyWalPosRequestMessage(void)

Ideally, this function comment should be referring to this message we
are creating by the same name that it was called in the documentation.
For example something like:

"Process the request for a primary status update message."

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Conflict detection for update_deleted in logical replication

From

Nisha Moond

Date:

25 October 2024, 11:51:07

> Here is the V5 patch set which addressed above comments.
>
Here are a couple of comments on v5 patch-set -

1) In FindMostRecentlyDeletedTupleInfo(),

+ /* Try to find the tuple */
+ while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+ {
+ Assert(tuples_equal(scanslot, searchslot, eq));
+ update_recent_dead_tuple_info(scanslot, oldestXmin, delete_xid,
+   delete_time, delete_origin);
+ }

In my tests, I found that the above assert() triggers during
unidirectional replication of an update on a table. While doing the
replica identity index scan, it can only ensure to match the indexed
columns value, but the current Assert() assumes all the column values
should match, which seems wrong.

2) Since update_deleted requires both 'track_commit_timestamp' and the
'detect_update_deleted' to be enabled, should we raise an error in the
CREATE and ALTER subscription commands when track_commit_timestamp=OFF
but the user specifies detect_update_deleted=true?

Re: Conflict detection for update_deleted in logical replication

From

Michail Nikolaev

Date:

25 October 2024, 13:20:48

Hello, Hayato!

> Thanks for updating the patch! While reviewing yours, I found a corner case that
> a recently deleted tuple cannot be detected when index scan is chosen.
> This can happen when indices are re-built during the replication.
> Unfortunately, I don't have any solutions for it.

I just randomly saw your message, so, I could be wrong and out of the context - so, sorry in advance.

But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0] (WaitForOlderSnapshots) while creating index concurrently.

[1]: https://github.com/postgres/postgres/blob/68dfecbef210dc000271553cfcb2342989d4ca0f/src/backend/commands/indexcmds.c#L1758-L1765

Best regards,

Mikhail.

Re: Conflict detection for update_deleted in logical replication

From

Peter Smith

Date:

28 October 2024, 08:40:28

Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

======
General.

1.
There are multiple comments in this patch mentioning 'wal' which
probably should say 'WAL' (uppercase).

~~~

2.
There are multiple comments in this patch missing periods (.)

======
doc/src/sgml/protocol.sgml

3.
+        <term>Primary status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>

Currently, there are identifiers 's' for the "Primary status update"
message, and 'S' for the "Primary status request" message.

As mentioned in the previous review ([1] #5b) I preferred it to be the
other way around:
'S' = status from primary
's' = request status from primary

Of course, it doesn't make any difference, but "S" seems more
important than "s", so therefore "S" being the main msg and coming
from the *primary* seemed more natural to me.

~~~

4.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Primary status request (F)</term>

Is it better to call this slightly differently to emphasise this is
only the request?

/Primary status request/Request primary status update/

======
src/backend/replication/logical/worker.c

5.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, as dead tuples are
+ * useful for detecting conflicts only during the application of concurrent

As mentioned in review [1] #9, this is still called "last-update-wins
strategy" here in the comment, but was called the "last update win
strategy" strategy in the commit message. Those terms should be the
same -- e.g. the 'last-update-wins' strategy.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPs3sgXh2%3DrHDaqjU%3Dp28CK5rCgCLJZgPByc6Tr7_P2imw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

29 October 2024, 10:59:04

Dear Mikhail,

Thanks for giving comments!

> But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0]
> (WaitForOlderSnapshots) while creating index concurrently.

WaitForOlderSnapshots() waits other transactions which can access older tuples
than the specified (=current) transaction, right? I think it does not solve our issue.

Assuming that same workloads [1] are executed, slot.xmin on node2 is arbitrary
older than noted SQL, and WaitForOlderSnapshots(slot.xmin) is added in
ReindexRelationConcurrently(). In this case, transaction older than slot.xmin
does not exist at step 5, so the REINDEX will finish immediately. Then, the worker
receives changes at step 7 so it is problematic if worker uses the reindexed index.

From another point of view... this approach must fix REINDEX code, but we should
not modify other component of codes as much as possible. This feature is related
with the replication so that changes should be closed within the replication subdir.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5692541820BCC365C69442FFF54F2%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

From

Michail Nikolaev

Date:

29 October 2024, 13:40:54

Hello Hayato,

> WaitForOlderSnapshots() waits other transactions which can access older tuples
> than the specified (=current) transaction, right? I think it does not solve our issue.

Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
The reason WaitForOlderSnapshots is used in ReindexConcurrently and DefineIndex is to guarantee that any transaction
needing to view rows not included in the index has completed before the index is marked as valid.
The same logic should apply here — we need to wait for the xmin of slot used in conflict detection as well.

> From another point of view... this approach must fix REINDEX code, but we should
> not modify other component of codes as much as possible. This feature is related
> with the replication so that changes should be closed within the replication subdir.

One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
This way, WaitForOlderSnapshots will account for it.

By the way, WaitForOlderSnapshots is also used in partitioning and other areas for similar reasons,
so these might be good places to check for any related issues.

Best regards,
Mikhail,

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

30 October 2024, 05:55:21

Dear Mikhail,

Thanks for describing more detail!

> Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
> rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
> ...
> One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
> This way, WaitForOlderSnapshots will account for it.

Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
error out, deadlock...). In this case, the snapshot cannot eb registered by the
worker and index can be re-built during the period.

If we do not assume the existence of workers, we must directly somehow check slot.xmin
and wait until it is advanced until the REINDEXing transaction. I still think it
is risky and another topic.

Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
detection. We can work on it in later versions based on the needs.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

30 October 2024, 11:33:44

Dear Hou,

Thanks for updating the patch! Here are my comments.

01. CreateSubscription
```
+    if (opts.detectupdatedeleted && !track_commit_timestamp)
+        ereport(ERROR,
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("detecting update_deleted conflicts requires \"%s\" to be enabled",
+                       "track_commit_timestamp"));
```

I don't think this guard is sufficient. I found two cases:

* Creates a subscription with detect_update_deleted = false and track_commit_timestamp = true,
  then alters detect_update_deleted to true.
* Creates a subscription with detect_update_deleted = true and track_commit_timestamp = true,
  then update track_commit_timestamp to true and restart the instance.

Based on that, how about detecting the inconsistency on the apply worker? It check
the parameters and do error out when it starts or re-reads a catalog. If we want
to detect in SQL commands, this can do in parse_subscription_options().

02. AlterSubscription()
```
+                    ApplyLauncherWakeupAtCommit();
```

The reason why launcher should wake up is different from other parts. Can we add comments
that it is needed to track/untrack the xmin?

03. build_index_column_bitmap()
```
+    for (int i = 0; i < indexinfo->ii_NumIndexAttrs; i++)
+    {
+        int         keycol = indexinfo->ii_IndexAttrNumbers[i];
+
+        index_bitmap = bms_add_member(index_bitmap, keycol);
+    }
```

I feel we can assert the ii_IndexAttrNumbers is valid, because the passed index is a replica identity key.

04. LogicalRepApplyLoop()

Can we move the definition of "phase" to the maybe_advance_nonremovable_xid() and
make it static? The variable is used only by the function.

05. LogicalRepApplyLoop()
```
+                        UpdateWorkerStats(last_received, timestamp, false);
```

The statistics seems not correct. last_received is not sent at "timestamp", it had
already been sent earlier. Do we really have to update here?

06. ErrorOnReservedSlotName()

I feel we should document that the slot name 'pg_conflict_detection' cannot be specified
unconditionally.

07. General

update_deleted can happen without DELETE commands. Should we rename the conflict
reason, like 'update_target_modified'?

E.g., there is a 2-way replication system and below transactions happen:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts = 10.00
  T2: UPDATE t SET id = 2 WHERE id = 1; ts = 10.02
Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts = 10.01

Then, T3 comes to Node A after executing T2. T3 tries to find id = 1 but find a
dead tuple instead. In this case, 'update_delete' happens without the delete.

08. Others

Also, here is an analysis related with the truncation of commit timestamp. I worried the
case that commit timestamp might be removed so that the detection would not go well.
But it seemed that entries can be removed when it behinds GetOldestNonRemovableTransactionId(NULL),
i.e., horizons.shared_oldest_nonremovable. The value is affected by the replication
slots so that interesting commit_ts entries for us are not removed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

From

Michail Nikolaev

Date:

30 October 2024, 14:31:29

Hello Hayato!

> Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
> error out, deadlock...). In this case, the snapshot cannot eb registered by the
> worker and index can be re-built during the period.

However, the xmin of a slot affects replication_slot_xmin in ProcArrayStruct, so it might
be straightforward to wait for it during concurrent index builds. We could consider adding
a separate conflict_resolution_replication_slot_xmin to wait only for that.

> Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> detection. We can work on it in later versions based on the needs.

From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
in production databases on regular basic, so any master-master system should be unaffected by it.

Best regards,
Mikhail.

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

31 October 2024, 03:55:01

Dear Mikhail,

Thanks for the reply!

> > Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> > detection. We can work on it in later versions based on the needs.
>
> From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
> in production databases on regular basic, so any master-master system should be unaffected by it.

I think you do not understand what I said correctly. The main point here is that
the index scan is not needed to detect the update_deleted. In the first version
workers can do the normal sequential scan instead. This workaround definitely does
not affect REINDEX CONCURRENTLY.
After the patch being good shape or pushed, we can support using the index to find
the dead tuple, at that time we can consider how we ensure the index contains the entry
for dead tuples.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

05 November 2024, 05:24:44

On Monday, October 28, 2024 1:40 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

Thanks for the comments!

> 
> ======
> doc/src/sgml/protocol.sgml
> 
> 3.
> +        <term>Primary status update (B)</term>
> +        <listitem>
> +         <variablelist>
> +          <varlistentry>
> +           <term>Byte1('s')</term>
> 
> Currently, there are identifiers 's' for the "Primary status update"
> message, and 'S' for the "Primary status request" message.
> 
> As mentioned in the previous review ([1] #5b) I preferred it to be the other way
> around:
> 'S' = status from primary
> 's' = request status from primary
> 
> Of course, it doesn't make any difference, but "S" seems more important than
> "s", so therefore "S" being the main msg and coming from the *primary*
> seemed more natural to me.

I am not sure if one message is more important than another, so I prefer to
keep the current style. Since this is a minor issue, we can easily revise it in
future version patches if we receive additional feedback.

Other comments look good to me and will address in V7 patch set.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

13 November 2024, 03:34:55

On Tue, Nov 12, 2024 at 2:19 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, October 18, 2024 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > We thought of few options to fix this:
> > > >
> > > > 1) Add a Time-Based Subscription Option:
> > > >
> > > > We could add a new time-based option for subscriptions, such as
> > > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining
> > > > the candidate XID, the launcher will wait for the specified time
> > > > before advancing the slot.xmin. This ensures that deleted tuples are
> > > > retained for at least the duration defined by this new option.
> > > >
> > > > This approach is designed to simulate the functionality of the GUC
> > > > (vacuum_committs_age), but with a simpler implementation that does
> > > > not impact vacuum performance. We can maintain both this time-based
> > > > method and the current automatic method. If a user does not specify
> > > > the time-based option, we will continue using the existing approach
> > > > to retain dead tuples until all concurrent transactions from the remote node
> > have been applied.
> > > >
> > > > 2) Modification to the Logical Walsender
> > > >
> > > > On the logical walsender, which is as a physical standby, we can
> > > > build an additional connection to the physical primary to obtain the
> > > > latest WAL position. This position will then be sent as feedback to
> > > > the logical subscriber.
> > > >
> > > > A potential concern is that this requires the walsender to use the
> > > > walreceiver API, which may seem a bit unnatural. And, it starts an
> > > > additional walsender process on the primary, as the logical
> > > > walsender on the physical standby will need to communicate with this
> > walsender to fetch the WAL position.
> > > >
> > >
> > > This idea is worth considering, but I think it may not be a good
> > > approach if the physical standby is cascading. We need to restrict the
> > > update_delete conflict detection, if the standby is cascading, right?
> > >
> > > The other approach is that we send current_timestamp from the
> > > subscriber and somehow check if the physical standby has applied
> > > commit_lsn up to that commit_ts, if so, it can send that WAL position
> > > to the subscriber, otherwise, wait for it to be applied. If we do this
> > > then we don't need to add a restriction for cascaded physical standby.
> > > I think the subscriber anyway needs to wait for such an LSN to be
> > > applied on standby before advancing the xmin even if we get it from
> > > the primary. This is because the subscriber can only be able to apply
> > > and flush the WAL once it is applied on the standby. Am, I missing
> > > something?
> > >
> > > This approach has a disadvantage that we are relying on clocks to be
> > > synced on both nodes which we anyway need for conflict resolution as
> > > discussed in the thread [1]. We also need to consider the Commit
> > > Timestamp and LSN inversion issue as discussed in another thread [2]
> > > if we want to pursue this approach because we may miss an LSN that has
> > > a prior timestamp.
> > >
>
> For the "publisher is also a standby" issue, I have modified the V8 patch to
> report a warning in this case. As I personally feel this is not the main use case
> for conflict detection, we can revisit it later after pushing the main patches
> receiving some user feedback.
>
> >
> > The problem due to Commit Timestamp and LSN inversion is that the standby
> > may not consider the WAL LSN from an earlier timestamp, which could lead to
> > the removal of required dead rows on the subscriber.
> >
> > The other problem pointed out by Hou-San offlist due to Commit Timestamp
> > and LSN inversion is that we could miss sending the WAL LSN that the
> > subscriber requires to retain dead rows for update_delete conflict. For example,
> > consider the following case 2 node, bidirectional setup:
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1; ts=10.02 AM
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Say subscription is created with retain_dead_tuples = true/false
> >
> > After executing T2, the apply worker on Node A will check the latest wal flush
> > location on Node B. Till that time, the T3 should have finished, so the xmin will
> > be advanced only after applying the WALs that is later than T3. So, the dead
> > tuple will not be removed before applying the T3, which means the
> > update_delete can be detected.
> >
> > As there is a gap between when we acquire the commit_timestamp and the
> > commit LSN, it is possible that T3 would have not yet flushed it's WAL even
> > though it is committed earlier than T2. If this happens then we won't be able to
> > detect update_deleted conflict reliably.
> >
> > Now, the one simpler idea is to acquire the commit timestamp and reserve WAL
> > (LSN) under the same spinlock in
> > ReserveXLogInsertLocation() but that could be costly as discussed in the
> > thread [1]. The other more localized solution is to acquire a timestamp after
> > reserving the commit WAL LSN outside the lock which will solve this particular
> > problem.
>
> Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought
> about another approach that can fix the issue independently. This idea is to
> delay the non-removable xid advancement until all the remote concurrent
> transactions that may have been assigned earlier timestamp have been committed.
>
> The implementation is:
>
> On the walsender, after receiving a request, it can send the oldest xid and
> next xid along with the
>
> In response, the apply worker can safely advance the non-removable XID if
> oldest_committing_xid == nextXid, indicating that there is no race conditions.
>
> Alternatively, if oldest_committing_xid != nextXid, the apply worker might send
> a second request after some interval. If the subsequently obtained
> oldest_committing_xid surpasses or equal to the initial nextXid, it indicates
> that all previously risky transactions have committed, therefore the the
> non-removable transaction ID can be advanced.
>
>
> Attach the V8 patch set. Note that I put the new approach for above race
> condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the
> approach may or may not be accepted based on the discussion in WAL/LSN
> inversion thread.

I've started to review these patch series. I've reviewed only 0001
patch for now but let me share some comments:

---
+        if (*phase == DTR_WAIT_FOR_WALSENDER_WAL_POS)
+        {
+                Assert(xid_advance_attempt_time);

What is this assertion for? If we want to check here that we have sent
a request message for the publisher, I think it's clearer if we have
"Assert(xid_advance_attempt_time > 0)". I'm not sure we really need
this assertion though since it's never false once we set
xid_advance_attempt_time.

---
+                /*
+                 * Do not return here because the apply worker might
have already
+                 * applied all changes up to remote_lsn. Instead,
proceed to the
+                 * next phase to check if we can immediately advance
the transaction
+                 * ID.
+                 */
+                *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+        }

If we always proceed to the next phase, is this phase really
necessary? IIUC even if we jump to DTR_WAIT_FOR_LOCAL_FLUSH phase
after DTR_REQUEST_WALSENDER_WAL_POS and have a check if we received
the remote WAL position in DTR_WAIT_FOR_LOCAL_FLUSH phase, it would
work fine.

---
+                /*
+                 * Reaching here means the remote WAL position has
been received, and
+                 * all transactions up to that position on the
publisher have been
+                 * applied and flushed locally. So, now we can advance the
+                 * non-removable transaction ID.
+                 */
+                SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+                MyLogicalRepWorker->oldest_nonremovable_xid = candidate_xid;
+                SpinLockRelease(&MyLogicalRepWorker->relmutex);

How about adding a debug log message showing new
oldest_nonremovable_xid and related LSN for making the
debug/investigation easier? For example,

elog(LOG, "confirmed remote flush up to %X/%X: new oldest_nonremovable_xid %u",
     LSN_FORMAT_ARGS(*remote_lsn),
     XidFromFullTransactionId(candidate_xid));

---
+                /*
+                 * Exit early if the user has disabled sending messages to the
+                 * publisher.
+                 */
+                if (wal_receiver_status_interval <= 0)
+                        return;

In send_feedback(), we send a feedback message if the publisher
requests, even if wal_receiver_status_interval is 0. On the other
hand, the above codes mean that we don't send a WAL position request
and therefore never update oldest_nonremovable_xid if
wal_receiver_status_interval is 0. I'm concerned it could be a pitfall
for users.

---
% git show | grep update_delete
    This set of patches aims to support the detection of
update_deleted conflicts,
    transactions with earlier timestamps than the DELETE, detecting
update_delete
    We assume that the appropriate resolution for update_deleted conflicts, to
    that when detecting the update_deleted conflict, and the remote update has a
+ * to detect update_deleted conflicts.
+ * update_deleted is necessary, as the UPDATEs in remote transactions should be
+        * to allow for the detection of update_delete conflicts when applying

There are mixed 'update_delete' and 'update_deleted' in the commit
message and the codes. I think it's better to use 'update_deleted'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Nisha Moond

Date:

20 November 2024, 15:05:23

On Thu, Nov 14, 2024 at 8:24 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V9 patch set which addressed above comments.
>

Reviewed v9 patch-set and here are my comments for below changes:

@@ -1175,10 +1189,29 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

  LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
  w = logicalrep_worker_find(sub->oid, InvalidOid, false);
+
+ if (can_advance_xmin && w != NULL)
+ {
+ FullTransactionId nonremovable_xid;
+
+ SpinLockAcquire(&w->relmutex);
+ nonremovable_xid = w->oldest_nonremovable_xid;
+ SpinLockRelease(&w->relmutex);
+
+ if (!FullTransactionIdIsValid(xmin) ||
+ !FullTransactionIdIsValid(nonremovable_xid) ||
+ FullTransactionIdPrecedes(nonremovable_xid, xmin))
+ xmin = nonremovable_xid;
+ }
+

1) In Patch-0002, could you please add a comment above "+ if
(can_advance_xmin && w != NULL)" to briefly explain the purpose of
finding the minimum XID at this point?

2) In Patch-0004, with the addition of the 'detect_update_deleted'
option, I see the following two issues in the above code:
2a) Currently, all enabled subscriptions are considered when comparing
and finding the minimum XID, even if detect_update_deleted is disabled
for some subscriptions.
I suggest excluding the oldest_nonremovable_xid of subscriptions where
detect_update_deleted=false by updating the check as follows:

    if (sub->detectupdatedeleted && can_advance_xmin && w != NULL)

2b) I understand why advancing xmin is not allowed when a subscription
is disabled. However, the current check allows a disabled subscription
with detect_update_deleted=false to block xmin advancement, which
seems incorrect. Should the check also account for
detect_update_deleted?, like :
  if (sub->detectupdatedeleted &&  !sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

However, I'm not sure if this is the right fix, as it could lead to
inconsistencies if the detect_update_deleted is set to false after
disabling the sub.
Thoughts?

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

25 November 2024, 12:49:31

On Thu, Nov 21, 2024 at 3:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V10 patch set which addressed above comments
> and fixed a CFbot warning due to un-initialized variable.
>

We should make the v10_2-0001* as the first main patch for review till
we have a consensus to resolve LSN<->Timestamp inversion issue. This
is because v10_2 doesn't rely on the correctness of LSN<->Timestamp
mapping. Now, say in some later release, we fix the LSN<->Timestamp
inversion issue, we can simply avoid sending remote_xact information
and it will behave the same as your v10_1 approach.

Comments on v10_2_0001*:
======================
1.
+/*
+ * The phases involved in advancing the non-removable transaction ID.
+ *
+ * Refer to maybe_advance_nonremovable_xid() for details on how the function
+ * transitions between these phases.
+ */
+typedef enum
+{
+ DTR_GET_CANDIDATE_XID,
+ DTR_REQUEST_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_LOCAL_FLUSH
+} DeadTupleRetainPhase;

First, can we have a better name for this enum like
RetainConflictInfoPhase or something like that? Second, the phase
transition is not very clear from the comments atop
maybe_advance_nonremovable_xid. You can refer to comments atop
tablesync.c or snapbuild.c to see other cases where we have explained
phase transitions.

2.
+ *   Wait for the status from the walsender. After receiving the first status
+ *   after acquiring a new candidate transaction ID, do not proceed if there
+ *   are ongoing concurrent remote transactions.

In this part of the comments: " .. after acquiring a new candidate
transaction ID ..." appears misplaced.

3. In maybe_advance_nonremovable_xid(), the handling of each phase
looks ad-hoc though I see that you have done that have so that you can
handle the phase change functionality after changing the phase
immediately. If we have to ever extend this functionality, it will be
tricky to handle the new phase or at least the code will become
complicated. How about handling each phase in the order of their
occurrence and having separate functions for each phase as we have in
apply_dispatch()? That way it would be convenient to invoke the phase
handling functionality even if it needs to be called multiple times in
the same function.

4.
/*
+ * An invalid position indiates the publisher is also
+ * a physical standby. In this scenario, advancing the
+ * non-removable transaction ID is not supported. This
+ * is because the logical walsender on the standby can
+ * only get the WAL replay position but there may be
+ * more WALs that are being replicated from the
+ * primary and those WALs could have earlier commit
+ * timestamp. Refer to
+ * maybe_advance_nonremovable_xid() for details.
+ */
+ if (XLogRecPtrIsInvalid(remote_lsn))
+ {
+ ereport(WARNING,
+ errmsg("cannot get the latest WAL position from the publisher"),
+ errdetail("The connected publisher is also a standby server."));
+
+ /*
+ * Continuously revert to the request phase until
+ * the standby server (publisher) is promoted, at
+ * which point a valid WAL position will be
+ * received.
+ */
+ phase = DTR_REQUEST_PUBLISHER_STATUS;
+ }

Shouldn't this be an ERROR as the patch doesn't support this case? The
same should be true for later patches for the subscription option.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

27 November 2024, 13:56:09

On Tue, Nov 26, 2024 at 1:50 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>

Few comments on the latest 0001 patch:
1.
+ * - RCI_REQUEST_PUBLISHER_STATUS:
+ *   Send a message to the walsender requesting the publisher status, which
+ *   includes the latest WAL write position and information about running
+ *   transactions.

Shall we make the later part of this comment (".. information about
running transactions.") accurate w.r.t the latest changes of
requesting xacts that are known to be in the process of committing?

2.
+ * The overall state progression is: GET_CANDIDATE_XID ->
+ * REQUEST_PUBLISHER_STATUS -> WAIT_FOR_PUBLISHER_STATUS -> (loop to
+ * REQUEST_PUBLISHER_STATUS if concurrent remote transactions persist) ->
+ * WAIT_FOR_LOCAL_FLUSH.

This state machine progression misses to mention that after we waited
for flush the state again moves back to GET_CANDIDATE_XID.

3.
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+ /* Send a WAL position request message to the server */
+ walrcv_send(LogRepWorkerWalRcvConn,
+ reply_message->data, reply_message->len);

This message requests more than a WAL write position but the comment
is incomplete.

4.
+/*
+ * Process the request for a primary status update message.
+ */
+static void
+ProcessStandbyPSRequestMessage(void)
...
+ /*
+ * Information about running transactions and the WAL write position is
+ * only available on a non-standby server.
+ */
+ if (!RecoveryInProgress())
+ {
+ oldestXidInCommit = GetOldestTransactionIdInCommit();
+ nextFullXid = ReadNextFullTransactionId();
+ lsn = GetXLogWriteRecPtr();
+ }

Shall we ever reach here for a standby case? If not shouldn't that be an ERROR?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

29 November 2024, 13:35:20

Dear Hou,

Thanks for updating the patch! Here are my comments mainly for 0001.

01. protocol.sgml

I think the ordering of attributes in "Primary status update" seems not correct.
The second entry is LSN, not the oldest running xid.

02. maybe_advance_nonremovable_xid

```
+        case RCI_REQUEST_PUBLISHER_STATUS:
+            request_publisher_status(data);
+            break;
```

I think the part is not reachable because the transit
RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
get_candidate_xid()->request_publisher_status().
Can we remove this?

03. RetainConflictInfoData

```
+    Timestamp   xid_advance_attempt_time;   /* when the candidate_xid is
+                                             * decided */
+    Timestamp   reply_time;     /* when the publisher responds with status */
+
+} RetainConflictInfoData;
```

The datatype should be TimestampTz.

04. get_candidate_xid

```
+    if (!TimestampDifferenceExceeds(data->xid_advance_attempt_time, now,
+                                    wal_receiver_status_interval * 1000))
+        return;
```

I think data->xid_advance_attempt_time can be accessed without the initialization
at the first try. I've found the patch could not pass test for 32-bit build
due to the reason.


05. request_publisher_status

```
+    if (!reply_message)
+    {
+        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
+
+        reply_message = makeStringInfo();
+        MemoryContextSwitchTo(oldctx);
+    }
+    else
+        resetStringInfo(reply_message);
```

Same lines exist in two functions: can we provide an inline function?

06. wait_for_publisher_status

```
+    if (!FullTransactionIdIsValid(data->last_phase_at))
+        data->last_phase_at = FullTransactionIdFromEpochAndXid(data->remote_epoch,
+                                                               data->remote_nextxid);
+
```

Not sure, is there a possibility that data->last_phase_at is valid here? It is initialized
just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

07. wait_for_publisher_status

I think all calculations and checking in the function can be done even on the
walsender. Based on this, I come up with an idea to reduce the message size:
walsender can just send a status (boolean) whether there are any running transactions
instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
amount of calc. on publisher side?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

29 November 2024, 13:52:04

On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 07. wait_for_publisher_status
>
> I think all calculations and checking in the function can be done even on the
> walsender. Based on this, I come up with an idea to reduce the message size:
> walsender can just send a status (boolean) whether there are any running transactions
> instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
> amount of calc. on publisher side?
>

Won't it be tricky to implement this tracking on publisher side?
Because we not only need to check that there is no running xact but
also that the oldest_running_xact that was present last time when the
status message arrived has finished. Won't this need more bookkeeping
on publisher's side?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

02 December 2024, 07:14:01

On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 02. maybe_advance_nonremovable_xid
>
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
>
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
> get_candidate_xid()->request_publisher_status().
> Can we remove this?
>

After changing phase to RCI_REQUEST_PUBLISHER_STATUS, we directly
invoke request_publisher_status, and similarly, after changing phase
to RCI_WAIT_FOR_LOCAL_FLUSH, we call wait_for_local_flush. Won't it be
better that in both cases and other similar cases, we instead invoke
maybe_advance_nonremovable_xid()? This will make
maybe_advance_nonremovable_xid() the only function with the knowledge
to take action based on phase rather than spreading the knowledge of
phase-related actions to various functions. Then we should also add a
comment at the end in request_publisher_status() where we change the
phase but don't do anything. The comment should explain the reason for
the same.

One more point, it seems on a busy server, the patch won't be able to
advance nonremovable_xid. We should call
maybe_advance_nonremovable_xid() at all the places where we call
send_feedback() and additionally, we should also call it after
applying some threshold number (say 100) of messages. The latter is to
avoid the cases where we won't invoke the required functionality on a
busy server with a large value of sender/receiver timeouts.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

02 December 2024, 14:43:32

On Friday, November 29, 2024 6:35 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Hou,
> 
> Thanks for updating the patch! Here are my comments mainly for 0001.

Thanks for the comments!

> 
> 02. maybe_advance_nonremovable_xid
> 
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
> 
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATU
> S is done in get_candidate_xid()->request_publisher_status().
> Can we remove this?

I changed to call the maybe_advance_nonremovable_xid() after changing the phase
in get_candidate_xid/wait_for_publisher_status, so that the code is reachable.

> 
> 
> 05. request_publisher_status
> 
> ```
> +    if (!reply_message)
> +    {
> +        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
> +
> +        reply_message = makeStringInfo();
> +        MemoryContextSwitchTo(oldctx);
> +    }
> +    else
> +        resetStringInfo(reply_message);
> ```
> 
> Same lines exist in two functions: can we provide an inline function?

I personally feel these codes may not worth a separate function since it’s simple.
So didn't change in this version.

> 
> 06. wait_for_publisher_status
> 
> ```
> +    if (!FullTransactionIdIsValid(data->last_phase_at))
> +        data->last_phase_at =
> FullTransactionIdFromEpochAndXid(data->remote_epoch,
> +
> + data->remote_nextxid);
> +
> ```
> 
> Not sure, is there a possibility that data->last_phase_at is valid here? It is
> initialized just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

Oh. I think last_phase_at should be initialized only in the first phase. Fixed.

Other comments look good to me and have been addressed in V13.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

09 December 2024, 13:32:19

On Mon, Dec 9, 2024 at 3:20 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is a summary of tests targeted to the Publisher node in a
> Publisher-Subscriber setup.
> (All tests done with v14 patch-set)
>
> ----------------------------
> Performance Tests:
> ----------------------------
> Test machine details:
> Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :120 - 800GB RAM
>
> Setup:
> - Created two nodes ( 'Pub' and 'Sub'), with logical replication.
> - Configurations for Both Nodes:
>
>     shared_buffers = 40GB
>     max_worker_processes = 32
>     max_parallel_maintenance_workers = 24
>     max_parallel_workers = 32
>     checkpoint_timeout = 1d
>     max_wal_size = 24GB
>     min_wal_size = 15GB
>     autovacuum = off
>
> - Additional setting on Sub: 'track_commit_timestamp = on' (required
> for the feature).
> - Initial data insertion via 'pgbench' with scale factor 100 on both nodes.
>
> Workload:
> - Ran pgbench with 60 clients for the publisher.
> - The duration was 120s, and the measurement was repeated 10 times.
>

You didn't mention it is READONLY or READWRITE tests but I think it is
later. I feel it is better to run these tests for 15 minutes, repeat
them 3 times, and get the median data for those. Also, try to run it
for lower client counts like 2, 16, 32. Overall, the conclusion may be
same but it will rule out the possibility of any anomaly.

With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

11 December 2024, 10:33:32

On Wednesday, December 11, 2024 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 6, 2024 at 1:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, December 5, 2024 6:00 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > A few more comments:
> > > 1.
> > > +static void
> > > +wait_for_local_flush(RetainConflictInfoData *data)
> > > {
> > > ...
> > > +
> > > + data->phase = RCI_GET_CANDIDATE_XID;
> > > +
> > > + maybe_advance_nonremovable_xid(data);
> > > +}
> > >
> > > Isn't it better to reset all the fields of data before the next
> > > round of GET_CANDIDATE_XID phase? If we do that then we don't need
> > > to reset
> > > data->remote_lsn = InvalidXLogRecPtr; and data->last_phase_at =
> > > InvalidFullTransactionId; individually in request_publisher_status()
> > > and
> > > get_candidate_xid() respectively. Also, it looks clean and logical
> > > to me unless I am missing something.
> >
> > The remote_lsn was used to determine whether a status is received, so
> > was reset each time in request_publisher_status. To make it more
> > straightforward, I added a new function parameter 'status_received',
> > which would be set to true when calling
> > maybe_advance_nonremovable_xid() on receving the status. After this
> change, there is no need to reset the remote_lsn.
> >
> 
> As part of the above comment, I had asked for three things (a) avoid setting
> data->remote_lsn = InvalidXLogRecPtr; in request_publisher_status(); (b)
> avoid setting data->last_phase_at =InvalidFullTransactionId; in
> get_candidate_xid(); (c) reset data in
> wait_for_local_flush() after wait is over. You only did (a) in the patch and didn't
> mention anything about (b) or (c). Is that intentional? If so, what is the reason?

I think I misunderstood the intention, so will address in next version.

> 
> *
> +static bool
> +can_advance_nonremovable_xid(RetainConflictInfoData *data) {
> +
> 
> Isn't it better to make this an inline function as it contains just one check?

Agreed. Will address in next version.

> 
> *
> + /*
> + * The non-removable transaction ID for a subscription is centrally
> + * managed by the main apply worker.
> + */
> + if (!am_leader_apply_worker())
> 
> I have tried to improve this comment in the attached.

Thanks, will check and merge the next version.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

16 December 2024, 14:21:20

On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V16 patch set which addressed above comments.
>
> There is a new 0002 patch where I tried to dynamically adjust the interval for
> advancing the transaction ID. Instead of always waiting for
> wal_receiver_status_interval, we can start with a short interval and increase
> it if there is no activity (no xid assigned on subscriber), but not beyond
> wal_receiver_status_interval.
>
> The intention is to more effectively advance xid to avoid retaining too much
> dead tuples. My colleague will soon share detailed performance data and
> analysis related to this enhancement.

I am starting to review the patches, and trying to understand the
concept that how you are preventing vacuum to remove the dead tuple
which might required by the concurrent remote update, so I was looking
at the commit message which explains the idea quite clearly but I have
one question

The process of advancing the non-removable transaction ID in the apply worker
involves:

== copied from commit message of 0001 start==
1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
candidate xid.
2) Send a message to the walsender requesting the publisher status, which
includes the latest WAL write position and information about transactions
that are in the commit phase.
3) Wait for the status from the walsender. After receiving the first status, do
not proceed if there are concurrent remote transactions that are still in the
commit phase. These transactions might have been assigned an earlier commit
timestamp but have not yet written the commit WAL record. Continue to request
the publisher status until all these transactions have completed.
4) Advance the non-removable transaction ID if the current flush location has
reached or surpassed the last received WAL position.
== copied from commit message of 0001 start==

So IIUC in step 2) we send the message and get the list of all the
transactions which are in the commit phase? What do you exactly mean
by a transaction which is in the commit phase?  Can I assume
transactions which are currently running on the publisher?  And in
step 3) we wait for all the transactions to get committed which we saw
running (or in the commit phase) and we anyway don't worry about the
newly started transactions as they would not be problematic for us.
And in step 4) we would wait for all the flush location to reach "last
received WAL position", here my question is what exactly will be the
"last received WAL position" I assume it would be the position
somewhere after the position of the commit WAL of all the transaction
we were interested on the publisher?

At high level the overall idea looks promising to me but wanted to put
more thought on lower level details about what transactions exactly we
are waiting for and what WAL LSN we are waiting to get flushed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

17 December 2024, 06:24:47

On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

> 
> On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the V16 patch set which addressed above comments.
> >
> > There is a new 0002 patch where I tried to dynamically adjust the interval for
> > advancing the transaction ID. Instead of always waiting for
> > wal_receiver_status_interval, we can start with a short interval and increase
> > it if there is no activity (no xid assigned on subscriber), but not beyond
> > wal_receiver_status_interval.
> >
> > The intention is to more effectively advance xid to avoid retaining too much
> > dead tuples. My colleague will soon share detailed performance data and
> > analysis related to this enhancement.
> 
> I am starting to review the patches, and trying to understand the
> concept that how you are preventing vacuum to remove the dead tuple
> which might required by the concurrent remote update, so I was looking
> at the commit message which explains the idea quite clearly but I have
> one question

Thanks for the review!

> 
> The process of advancing the non-removable transaction ID in the apply worker
> involves:
> 
> == copied from commit message of 0001 start==
> 1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
> candidate xid.
> 2) Send a message to the walsender requesting the publisher status, which
> includes the latest WAL write position and information about transactions
> that are in the commit phase.
> 3) Wait for the status from the walsender. After receiving the first status, do
> not proceed if there are concurrent remote transactions that are still in the
> commit phase. These transactions might have been assigned an earlier commit
> timestamp but have not yet written the commit WAL record. Continue to
> request
> the publisher status until all these transactions have completed.
> 4) Advance the non-removable transaction ID if the current flush location has
> reached or surpassed the last received WAL position.
> == copied from commit message of 0001 start==
> 
> So IIUC in step 2) we send the message and get the list of all the
> transactions which are in the commit phase? What do you exactly mean by a
> transaction which is in the commit phase?

I was referring to transactions calling RecordTransactionCommit() and have
entered the commit critical section. In the patch, we checked if the proc has
marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.

> Can I assume transactions which are currently running on the publisher?

I think it's a subset of the running transactions. We only get the transactions
in commit phase with the intention to avoid delays caused by waiting for
long-running transactions to complete, which can result in the long retention
of dead tuples.

We decided to wait for running(committing) transactions due to the WAL/LSN
inversion issue[1]. The original idea is to directly return the latest WAL
write position without checking running transactions. But since there is a gap
between when we acquire the commit_timestamp and the commit LSN, it's possible
the transactions might have been assigned an earlier commit timestamp but have
not yet written the commit WAL record.

> And in step 3) we wait for all the transactions to get committed which we saw
> running (or in the commit phase) and we anyway don't worry about the newly
> started transactions as they would not be problematic for us. And in step 4)
> we would wait for all the flush location to reach "last received WAL
> position", here my question is what exactly will be the "last received WAL
> position" I assume it would be the position somewhere after the position of
> the commit WAL of all the transaction we were interested on the publisher?

Yes, your understanding is correct. It's a position after the position of all
the interesting transactions. In the patch, we get the latest WAL write
position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
have finished and reply it to apply worker.

> At high level the overall idea looks promising to me but wanted to put
> more thought on lower level details about what transactions exactly we
> are waiting for and what WAL LSN we are waiting to get flushed.

Yeah, that makes sense, thanks.

[1]
https://www.postgresql.org/message-id/OS0PR01MB571628594B26B4CC2346F09294592%40OS0PR01MB5716.jpnprd01.prod.outlook.com>

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

17 December 2024, 11:25:42

On Tue, Dec 17, 2024 at 8:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > So IIUC in step 2) we send the message and get the list of all the
> > transactions which are in the commit phase? What do you exactly mean by a
> > transaction which is in the commit phase?
>
> I was referring to transactions calling RecordTransactionCommit() and have
> entered the commit critical section. In the patch, we checked if the proc has
> marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.
>
> > Can I assume transactions which are currently running on the publisher?
>
> I think it's a subset of the running transactions. We only get the transactions
> in commit phase with the intention to avoid delays caused by waiting for
> long-running transactions to complete, which can result in the long retention
> of dead tuples.

Ok

> We decided to wait for running(committing) transactions due to the WAL/LSN
> inversion issue[1]. The original idea is to directly return the latest WAL
> write position without checking running transactions. But since there is a gap
> between when we acquire the commit_timestamp and the commit LSN, it's possible
> the transactions might have been assigned an earlier commit timestamp but have
> not yet written the commit WAL record.

Yes, that makes sense.

> > And in step 3) we wait for all the transactions to get committed which we saw
> > running (or in the commit phase) and we anyway don't worry about the newly
> > started transactions as they would not be problematic for us. And in step 4)
> > we would wait for all the flush location to reach "last received WAL
> > position", here my question is what exactly will be the "last received WAL
> > position" I assume it would be the position somewhere after the position of
> > the commit WAL of all the transaction we were interested on the publisher?
>
> Yes, your understanding is correct. It's a position after the position of all
> the interesting transactions. In the patch, we get the latest WAL write
> position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
> have finished and reply it to apply worker.

Got it, thanks.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

23 December 2024, 09:14:49

Dear Hou,

Thanks for updating the patch. Few comments:

01. worker.c

```
+/*
+ * The minimum (100ms) and maximum (3 minutes) intervals for advancing
+ * non-removable transaction IDs.
+ */
+#define MIN_XID_ADVANCEMENT_INTERVAL 100
+#define MAX_XID_ADVANCEMENT_INTERVAL 180000L
```

Since the max_interval is an integer variable, it can be s/180000L/180000/.


02.  ErrorOnReservedSlotName()

Currently the function is callsed from three points - create_physical_replication_slot(),
create_logical_replication_slot() and CreateReplicationSlot(). 
Can we move them to the ReplicationSlotCreate(), or combine into ReplicationSlotValidateName()?

03. advance_conflict_slot_xmin()

```
    Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
```

Assuming the case that the launcher crashed just after ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
After the restart, the slot can be acquired since SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
is true, but the process would fail the assert because data.xmin is still invalid.

I think we should re-create the slot when the xmin is invalid. Thought?

04. documentation

Should we update "Configuration Settings" section in logical-replication.sgml
because an additional slot is required?

05. check_remote_recovery()

Can we add a test case related with this?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

01 January, 08:36:15

On Thu, Dec 19, 2024 at 4:34 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Sunday, December 15, 2024 9:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >
> > 5. The apply worker needs to at least twice get the publisher status message to
> > advance oldest_nonremovable_xid once. It then uses the remote_lsn of the last
> > such message to ensure that it has been applied locally. Such a remote_lsn
> > could be a much later value than required leading to delay in advancing
> > oldest_nonremovable_xid. How about if while first time processing the
> > publisher_status message on walsender, we get the
> > latest_transaction_in_commit by having a function
> > GetLatestTransactionIdInCommit() instead of
> > GetOldestTransactionIdInCommit() and then simply wait till that proc has
> > written commit WAL (aka wait till it clears DELAY_CHKPT_IN_COMMIT)?
> > Then get the latest LSN wrote and send that to apply worker waiting for the
> > publisher_status message. If this is feasible then we should be able to
> > advance oldest_nonremovable_xid with just one publisher_status message.
> > Won't that be an improvement over current? If so, we can even further try to
> > improve it by just using commit_LSN of the transaction returned by
> > GetLatestTransactionIdInCommit(). One idea is that we can try to use
> > MyProc->waitLSN which we are using in synchronous replication for our
> > purpose. See SyncRepWaitForLSN.
>
> I will do more performance tests on this and address if it improves
> the performance.
>

Did you check this idea? Again, thinking about this, I see a downside
to the new proposal. In the new proposal, the walsender needs to
somehow wait for the transactions in the commit which essentially
means that it may lead delay in decoding and sending the decoded WAL.
But it is still worth checking the impact of such a change, if nothing
else, we can add a short comment in the code to suggest such an
improvement is not worthwhile.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

02 January, 09:30:13

On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>

But why would it prevent the launcher from creating the slot? I think
we should add this check in the function
ReplicationSlotValidateName(). Another related point:

+ErrorOnReservedSlotName(const char *name)
+{
+ if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_RESERVED_NAME),
+ errmsg("replication slot name \"%s\" is reserved",
+    name));

Won't it be sufficient to check using an existing IsReservedName()?
Even, if not, then also we should keep that as part of the check
similar to what we are doing in pg_replication_origin_create().

> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>

Sounds reasonable but OTOH, all other places that create physical
slots (which we are doing here) don't use this trick. So, don't they
need similar reliability? Also, add some comments as to why we are
initially creating the RS_EPHEMERAL slot as we have at other places.

Few other comments on 0003
=======================
1.
+ if (sublist)
+ {
+ bool updated;
+
+ if (!can_advance_xmin)
+ xmin = InvalidFullTransactionId;
+
+ updated = advance_conflict_slot_xmin(xmin);

How will it help to try advancing slot_xmin when xmin is invalid?

2.
@@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;

In ApplyLauncherMain(), if one of the subscriptions is disabled (say
the last one in sublist), then can_advance_xmin will become false in
the above code. Now, later, as quoted in comment-1, the patch
overrides xmin to InvalidFullTransactionId if can_advance_xmin is
false. Won't that lead to the wrong computation of xmin?

3.
+ slot_maybe_exist = true;
+ }
+
+ /*
+ * Drop the slot if we're no longer retaining dead tuples.
+ */
+ else if (slot_maybe_exist)
+ {
+ drop_conflict_slot_if_exists();
+ slot_maybe_exist = false;

Can't we use MyReplicationSlot instead of introducing a new boolean
slot_maybe_exist?

In any case, how does the above code deal with the case where the
launcher is restarted for some reason and there is no subscription
after that? Will it be possible to drop the slot in that case?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

02 January, 12:27:17

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.

Conflict detection of truncated updates is detected as update_missing
and deleted update is detected as update_deleted. I was not sure if
truncated updates should also be detected as update_deleted, as the
document says truncate operation is "It has the same effect as an
unqualified DELETE on each table" at [1].

I tried with the following three node(N1,N2 & N3) setup with
subscriber on N3 subscribing to the publisher pub1 in N1 and publisher
pub2 in N2:
N1 - pub1
N2 - pub2
N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2)

-- Insert a record in N1
insert into t1 values(1);

-- Insert a record in N2
insert into t1 values(1);

-- Now N3 has the above inserts from N1 and N2
N3=# select * from t1;
 c1
----
  1
  1
(2 rows)

-- Truncate t1 from N2
N2=# truncate t1;
TRUNCATE TABLE

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generated a conflict
postgres=# update t1 set c1 = 2;
UPDATE 1
N1=# select * from t1;
 c1
----
  2
(1 row)

--- N3 logs the conflict as update_missing
2025-01-02 12:21:37.388 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-02 12:21:37.388 IST [24803] DETAIL:  Could not find the row to
be updated.
        Remote tuple (2); replica identity full (1).
2025-01-02 12:21:37.388 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 757, finished
at 0/17478D0

-- Insert a record with value 2 in N2
N2=# insert into t1 values(2);
INSERT 0 1

-- Now N3 has the above inserted records:
N3=# select * from t1;
 c1
----
  2
(1 row)

-- Delete this record from N2:
N2=# delete from t1;
DELETE 1

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generate a conflict
postgres=# update t1 set c1 = 3;
UPDATE 1

--- N3 logs the conflict as update_deleted
2025-01-02 12:22:38.036 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-02 12:22:38.036 IST [24803] DETAIL:  The row to be updated was
deleted by a different origin "pg_16388" in transaction 764 at
2025-01-02 12:22:29.025347+05:30.
        Remote tuple (3); replica identity full (2).
2025-01-02 12:22:38.036 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 758, finished
at 0/174D240

I'm not sure if this behavior is expected or not. If this is expected
can we mention this in the documentation for the user to handle the
conflict resolution accordingly in these cases.
Thoughts?

[1] - https://www.postgresql.org/docs/devel/sql-truncate.html

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

02 January, 13:34:25

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few suggestions:
1) If we have a subscription with detect_update_deleted option and we
try to upgrade it with default settings(in case dba forgot to set
track_commit_timestamp), the upgrade will fail after doing a lot of
steps like that mentioned in ok below:
Setting locale and encoding for new cluster                   ok
Analyzing all rows in the new cluster                         ok
Freezing all rows in the new cluster                          ok
Deleting files from new pg_xact                               ok
Copying old pg_xact to new server                             ok
Setting oldest XID for new cluster                            ok
Setting next transaction ID and epoch for new cluster         ok
Deleting files from new pg_multixact/offsets                  ok
Copying old pg_multixact/offsets to new server                ok
Deleting files from new pg_multixact/members                  ok
Copying old pg_multixact/members to new server                ok
Setting next multixact ID and offset for new cluster          ok
Resetting WAL archives                                        ok
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
  postgres
*failure*

We should detect this at an earlier point somewhere like in
check_new_cluster_subscription_configuration and throw an error from
there.

2) Also should we include an additional slot for the
pg_conflict_detection slot while checking max_replication_slots.
Though this error will occur after the upgrade is completed, it may be
better to include the slot during upgrade itself so that the DBA need
not handle this error separately after the upgrade is completed.

3) We have reserved the pg_conflict_detection name in this version, so
if there was a replication slot with the name pg_conflict_detection in
the older version, the upgrade will fail at a very later stage like an
earlier upgrade shown. I feel we should check if the old cluster has
any slot with the name pg_conflict_detection and throw an error
earlier itself:
+void
+ErrorOnReservedSlotName(const char *name)
+{
+       if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+               ereport(ERROR,
+                               errcode(ERRCODE_RESERVED_NAME),
+                               errmsg("replication slot name \"%s\"
is reserved",
+                                          name));
+}

4) We should also mention something like below in the documentation so
the user can be aware of it:
The slot name cannot be created with pg_conflict_detection, as this is
reserved for logical replication conflict detection.

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

03 January, 08:52:51

On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch set which addressed all other comments.
>

Some more miscellaneous comments:
=============================
1.
@@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
  * modifying it.  This makes checkpoint's determination of which xacts
  * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
  */
- Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
  START_CRIT_SECTION();
- MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+ MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;

  /*
  * Insert the commit XLOG record.
@@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
  */
  if (markXidCommitted)
  {
- MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+ MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
  END_CRIT_SECTION();

The comments related to this change should be updated in EndPrepare()
and RecordTransactionCommitPrepared(). They still refer to the
DELAY_CHKPT_START flag. We should update the comments explaining why a
similar change is not required for prepare or commit_prepare, if there
is one.

2.
 static bool
 tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
- TypeCacheEntry **eq)
+ TypeCacheEntry **eq, Bitmapset *columns)
 {
  int attrnum;

@@ -337,6 +340,14 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
  if (att->attisdropped || att->attgenerated)
  continue;

+ /*
+ * Ignore columns that are not listed for checking.
+ */
+ if (columns &&
+ !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber,
+    columns))
+ continue;

Update the comment atop tuples_equal to reflect this change.

3.
+FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
+ TransactionId *delete_xid,
+ RepOriginId *delete_origin,
+ TimestampTz *delete_time)
...
...
+ /* Try to find the tuple */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
+ {
+ bool dead = false;
+ TransactionId xmax;
+ TimestampTz localts;
+ RepOriginId localorigin;
+
+ if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
+ continue;
+
+ tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
+ buf = hslot->buffer;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
HEAPTUPLE_RECENTLY_DEAD)
+ dead = true;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!dead)
+ continue;

Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
matter to detect update_delete conflict?

4. In FindMostRecentlyDeletedTupleInfo(), add comments to state why we
need to use SnapshotAny.

5.
+
+      <varlistentry
id="sql-createsubscription-params-with-detect-update-deleted">
+        <term><literal>detect_update_deleted</literal>
(<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether the detection of <xref
linkend="conflict-update-deleted"/>
+          is enabled. The default is <literal>false</literal>. If set to
+          true, the dead tuples on the subscriber that are still useful for
+          detecting <xref linkend="conflict-update-deleted"/>
+          are retained,

One of the purposes of retaining dead tuples is to detect
update_delete conflict. But, I also see the following in 0001's commit
message: "Since the mechanism relies on a single replication slot, it
not only assists in retaining dead tuples but also preserves commit
timestamps and origin data. These information will be displayed in the
additional logs generated for logical replication conflicts.
Furthermore, the preserved commit timestamps and origin data are
essential for consistently detecting update_origin_differs conflicts."
which indicates there are other cases where retaining dead tuples can
help. So, I was thinking about whether to name this new option as
retain_dead_tuples or something along those lines?

BTW, it is not clear how retaining dead tuples will help the detection
update_origin_differs. Will it happen when the tuple is inserted or
updated on the subscriber and then when we try to update the same
tuple due to remote update, the commit_ts information of the xact is
not available because the same is already removed by vacuum? This
should happen for the update case for the new row generated by the
update operation as that will be used in comparison. Can you please
show it be a test case even if it is manual?

Can't it happen for delete_origin_differs as well for the same reason?

6. I feel we should keep 0004 as a later patch. We can ideally
consider committing 0001, 0002, 0003, 0005, and 0006 (or part of 0006
to get some tests that are relevant) as one unit and then the patch to
detect and report update_delete conflict. What do you think?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

03 January, 09:35:36

On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.
>

Thank you for updating the patches!

I have one comment on the 0001 patch:

+       /*
+        * The changes made by this and later transactions are still
non-removable
+        * to allow for the detection of update_deleted conflicts when applying
+        * changes in this logical replication worker.
+        *
+        * Note that this info cannot directly protect dead tuples from being
+        * prematurely frozen or removed. The logical replication launcher
+        * asynchronously collects this info to determine whether to advance the
+        * xmin value of the replication slot.
+        *
+        * Therefore, FullTransactionId that includes both the
transaction ID and
+        * its epoch is used here instead of a single Transaction ID. This is
+        * critical because without considering the epoch, the transaction ID
+        * alone may appear as if it is in the future due to transaction ID
+        * wraparound.
+        */
+       FullTransactionId oldest_nonremovable_xid;

The last paragraph of the comment mentions that we need to use
FullTransactionId to properly compare XIDs even after the XID
wraparound happens. But once we set the oldest-nonremovable-xid it
prevents XIDs from being wraparound, no? I mean that workers'
oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
xmin) are never away from more than 2^31 XIDs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

03 January, 12:03:49

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few comments:
1) In case there are no logical replication workers, the launcher
process just logs a warning "out of logical replication worker slots"
and continues. Whereas in case of "pg_conflict_detection" replication
slot creation launcher throw an error and the launcher exits, can we
throw a warning in this case too:
2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
"max_replication_slots".
2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
replication launcher" (PID 4280) exited with exit code 1

2) Currently, we do not detect when the track_commit_timestamp setting
is disabled for a subscription immediately after the worker starts.
Instead, it is detected later during conflict detection. Since
changing the track_commit_timestamp GUC requires a server restart, it
would be beneficial for DBAs if the error were raised immediately.
This way, DBAs would be aware of the issue as they monitor the server
restart and can take the necessary corrective actions without delay.

3) Tab completion missing for CREATE SUBSCRIPTION does not include
detect_update_deleted option:
postgres=# create subscription sub3 CONNECTION 'dbname=postgres
host=localhost port=5432' publication pub1 with (
BINARY              COPY_DATA           DISABLE_ON_ERROR    FAILOVER
         PASSWORD_REQUIRED   SLOT_NAME           SYNCHRONOUS_COMMIT
CONNECT             CREATE_SLOT         ENABLED             ORIGIN
         RUN_AS_OWNER        STREAMING           TWO_PHASE

4) Tab completion missing for ALTER SUBSCRIPTION does not include
detect_update_deleted option:
ALTER SUBSCRIPTION sub3 SET (
BINARY              FAILOVER            PASSWORD_REQUIRED   SLOT_NAME
         SYNCHRONOUS_COMMIT
DISABLE_ON_ERROR    ORIGIN              RUN_AS_OWNER        STREAMING
         TWO_PHASE

5) Copyright year can be updated to 2025:
+++ b/src/test/subscription/t/035_confl_update_deleted.pl
@@ -0,0 +1,169 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test the CREATE SUBSCRIPTION 'detect_update_deleted' parameter and its
+# interaction with the xmin value of replication slots.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;

6) This include is not required, I was able to compile without it:
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -173,12 +173,14 @@
 #include "replication/logicalrelation.h"
 #include "replication/logicalworker.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

03 January, 12:46:16

On Fri, Jan 3, 2025 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have one comment on the 0001 patch:
>
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
>
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID
> wraparound happens. But once we set the oldest-nonremovable-xid it
> prevents XIDs from being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.
>

I also think that the slot's non-removal-xid should ensure that we
never allow xid to advance to a level where it can cause a wraparound
for the oldest-nonremovable-xid value stored in each worker because
the slot's value is the minimum of all workers. Now, if both of us are
missing something then it is probably better to write some more
detailed comments as to how this can happen.

Along the same lines, I was thinking whether
RetainConflictInfoData->last_phase_at should be FullTransactionId but
I think that is correct because we can't stop wraparound from
happening on remote_node, right?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

03 January, 13:47:57

On Fri, Jan 3, 2025 at 2:34 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Few comments:
> 1) In case there are no logical replication workers, the launcher
> process just logs a warning "out of logical replication worker slots"
> and continues. Whereas in case of "pg_conflict_detection" replication
> slot creation launcher throw an error and the launcher exits, can we
> throw a warning in this case too:
> 2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
> 2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
> "max_replication_slots".
> 2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
> replication launcher" (PID 4280) exited with exit code 1
>

This case is not the same because if give just WARNING and allow to
proceed then we won't be able to protect dead rows from removal. We
don't want the apply workers to keep working and making progress till
this slot is created. Am, I missing something? If not, we probably
need to ensure this if not already ensured. Also, we should mention in
the docs that the 'max_replication_slots' setting should consider this
additional slot.

> 2) Currently, we do not detect when the track_commit_timestamp setting
> is disabled for a subscription immediately after the worker starts.
> Instead, it is detected later during conflict detection.
>

I am not sure if an ERROR is required in the first place. Shouldn't we
simply not detect the update_delete in that case? It should be
documented that to detect this conflict 'track_commit_timestamp'
should be enabled. Don't we do the same thing for *_origin_differs
type of conflicts?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

03 January, 14:00:41

On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Conflict detection of truncated updates is detected as update_missing
> and deleted update is detected as update_deleted. I was not sure if
> truncated updates should also be detected as update_deleted, as the
> document says truncate operation is "It has the same effect as an
> unqualified DELETE on each table" at [1].
>

This is expected behavior because TRUNCATE would immediately reclaim
space and remove all the data. So, there is no way to retain the
removed row.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

06 January, 06:15:09

On Fri, Dec 20, 2024 at 12:41 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
>  - To validate this further, we modified the patch to check only each transaction's commit_time and advance the
non-removableXID if the commit_time is greater than candidate_xid_time. The benchmark results[4] remained consistent,
showingsimilar performance reduction. This confirms that the performance impact on the subscriber side is a reasonable
behaviorif we want to detect the update_deleted conflict reliably. 
>
>
>
> We have also tested similar scenarios in physical streaming replication, to see the effect of enabling the
hot_standby_feedbackand recovery_min_apply_delay. The benchmark results[5] showed performance reduction in these cases
aswell, though impact was less compared to the update_deleted scenario because the physical walreceiver does not need
towait for specified WAL to be applied before sending the hot standby feedback message. However, as the
recovery_min_apply_delayincreased, a similar TPS reduction (~50%) was observed, aligning with the behavior seen in the
update_deletedcase. 
>

The first impression after seeing such a performance dip will be not
to use such a setting but as the primary reason is that one
purposefully wants to retain dead tuples both in physical replication
and pub-sub environment, it is an expected outcome. Now, it is
possible that in real world people may not use exactly the setup we
have used to check the worst-case performance. For example, for a
pub-sub setup, one could imagine that writes happen on two nodes N1,
and N2 (both will be publisher nodes) and then all the changes from
both nodes will be assembled in the third node N3 (a subscriber node).
Or, the subscriber node, may not be set up for aggressive writes, Or,
one would be okay not to detect update_delete conflicts with complete
accuracy.

>
>
> Based on the above, I think the performance reduction observed with the update_deleted patch is expected and
necessarybecause the patch's main goal is to retain dead tuples for reliable conflict detection. Reducing this
retentionperiod would compromise the accuracy of update_deleted detection. 
>

The point related to dead tuple accumulation (or database bloat) with
this setting should be documented similarly to what we document for
hot_standby_feedback. See hot_standby_feedback description in docs
[1].

[1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

06 January, 14:22:08

On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> I have one comment on the 0001 patch:

Thanks for the comments!

> 
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when
> applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance
> the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
> 
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID wraparound
> happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.

I think the issue is that the launcher may create the replication slot after
the apply worker has already set the 'oldest_nonremovable_xid' because the
launcher are doing that asynchronously. So, Before the slot is created, there's
a window where transaction IDs might wrap around. If initially the apply worker
has computed a candidate_xid (755) and the xid wraparound before the launcher
creates the slot, causing the new current xid to be (740), then the old
candidate_xid(755) looks like a xid in the future, and the launcher could
advance the xmin to 755 which cause the dead tuples to be removed prematurely.
(We are trying to reproduce this to ensure that it's a real issue and will
share after finishing)

We thought of another approach, which is to create/drop this slot first as
soon as one enables/disables detect_update_deleted (E.g. create/drop slot
during DDL). But it seems complicate to control the concurrent slot
create/drop. For example, if one backend A enables detect_update_deteled, it
will create a slot. But if another backend B is disabling the
detect_update_deteled at the same time, then the newly created slot may be
dropped by backend B. I thought about checking the number of subscriptions that
enables detect_update_deteled before dropping the slot in backend B, but the
subscription changes caused by backend A may not visable yet (e.g. not
committed yet).

Does that make sense to you, or do you have some other ideas?

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Nisha Moond

Date:

07 January, 07:55:50

On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>

I tried to reproduce the issue described above, where an
xid_wraparound occurs before the launcher creates the conflict slot,
and the apply worker retains a very old xid (from before the
wraparound) as its oldest_nonremovable_xid.

In this scenario, the launcher will not update the apply worker's
older epoch xid (oldest_nonremovable_xid = 755) as the conflict slot's
xmin. This is because advance_conflict_slot_xmin() ensures proper
handling by comparing the full 64-bit xids. However, this could lead
to real issues if 32-bit TransactionID were used instead of 64-bit
FullTransactionID. The detailed test steps and results are as below:

Setup:  A Publisher-Subscriber setup with logical replication.

Steps done to reproduce the test scenario -
On Sub -
1) Created a subscription with detect_update_deleted=off, so no
conflict slot to start with.
2) Attached gdb to the launcher and put a breakpoint at
advance_conflict_slot_xmin().
3) Run "alter subscription ..... (detect_update_deleted=ON);"
4) Stopped the launcher at the start of the
"advance_conflict_slot_xmin()",  and blocked the creation of the
conflict slot.
5) Attached another gdb session to the apply worker and made sure it
has set an oldest_nonremovable_xid . In
"maybe_advance_nonremovable_xid()" -

  (gdb) p MyLogicalRepWorker->oldest_nonremovable_xid
  $3 = {value = 760}
  -- so apply worker's oldest_nonremovable_xid = 760

6) Consumed ~4.2 billion xids to let the xid_wraparound happen. After
the wraparound, the next_xid was "705", which is less than "760".
7) Released the launcher from gdb, but the apply_worker still stopped in gdb.
8) The slot gets created with xmin=705 :

  postgres=# select slot_name, slot_type, active, xmin, catalog_xmin,
restart_lsn, inactive_since, confirmed_flush_lsn from
pg_replication_slots;
         slot_name       | slot_type | active | xmin | catalog_xmin |
restart_lsn | inactive_since | confirmed_flush_lsn

-----------------------+-----------+--------+------+--------------+-------------+----------------+---------------------
  pg_conflict_detection | physical  | t      |  705 |              |
          |                |
  (1 row)

Next, when launcher tries to advance the slot's xmin in
advance_conflict_slot_xmin() with new_xmin as the apply worker's
oldest_nonremovable_xid(760), it returns without updating the slot's
xmin because of below check -
````
  if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
    return false;
````
we are comparing the full xids (64-bit) in
FullTransactionIdPrecedesOrEquals() and in this case the values are:
  new_xmin=760
  full_xmin=4294968001 (w.r.t. xid=705)

As "760 <= 4294968001", the launcher will return from here and not
update the slot's xmin to "760".  Above check will always be true in
such scenarios.
Note: The launcher would have updated the slot's xmin to 760 if 32-bit
XIDs were being compared, i.e., "760 <= 705".

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

07 January, 09:00:19

On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)

The slot's first xmin is calculated by
GetOldestSafeDecodingTransactionId(false). The initial computed
cancidate_xid could be newer than this xid?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

07 January, 09:40:50

On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > I have one comment on the 0001 patch:
> >
> > Thanks for the comments!
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts
> > > + when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from
> being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to
> > > + advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction
> ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID
> > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > prevents XIDs from being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot
> > after the apply worker has already set the 'oldest_nonremovable_xid'
> > because the launcher are doing that asynchronously. So, Before the
> > slot is created, there's a window where transaction IDs might wrap
> > around. If initially the apply worker has computed a candidate_xid
> > (755) and the xid wraparound before the launcher creates the slot,
> > causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher
> > could advance the xmin to 755 which cause the dead tuples to be removed
> prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and
> > will share after finishing)
> 
> The slot's first xmin is calculated by
> GetOldestSafeDecodingTransactionId(false). The initial computed
> cancidate_xid could be newer than this xid?

I think the issue occurs when the slot is created after an XID wraparound. As a
result, GetOldestSafeDecodingTransactionId() returns the current XID
(after wraparound), which appears older than the computed candidate_xid (e.g.,
oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
issue in [1]. What do you think ?

[1] https://www.postgresql.org/message-id/CABdArM6P0zoEVRN%2B3YHNET_oOaAVOKc-EPUnXiHkcBJ-uDKQVw%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

07 January, 10:05:06

On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > >
> > > > I have one comment on the 0001 patch:
> > >
> > > Thanks for the comments!
> > >
> > > >
> > > > +       /*
> > > > +        * The changes made by this and later transactions are still
> > > > non-removable
> > > > +        * to allow for the detection of update_deleted conflicts
> > > > + when
> > > > applying
> > > > +        * changes in this logical replication worker.
> > > > +        *
> > > > +        * Note that this info cannot directly protect dead tuples from
> > being
> > > > +        * prematurely frozen or removed. The logical replication launcher
> > > > +        * asynchronously collects this info to determine whether to
> > > > + advance
> > > > the
> > > > +        * xmin value of the replication slot.
> > > > +        *
> > > > +        * Therefore, FullTransactionId that includes both the
> > > > transaction ID and
> > > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > > +        * critical because without considering the epoch, the transaction
> > ID
> > > > +        * alone may appear as if it is in the future due to transaction ID
> > > > +        * wraparound.
> > > > +        */
> > > > +       FullTransactionId oldest_nonremovable_xid;
> > > >
> > > > The last paragraph of the comment mentions that we need to use
> > > > FullTransactionId to properly compare XIDs even after the XID
> > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > xmin) are never away from more than 2^31 XIDs.
> > >
> > > I think the issue is that the launcher may create the replication slot
> > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > because the launcher are doing that asynchronously. So, Before the
> > > slot is created, there's a window where transaction IDs might wrap
> > > around. If initially the apply worker has computed a candidate_xid
> > > (755) and the xid wraparound before the launcher creates the slot,
> > > causing the new current xid to be (740), then the old
> > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > could advance the xmin to 755 which cause the dead tuples to be removed
> > prematurely.
> > > (We are trying to reproduce this to ensure that it's a real issue and
> > > will share after finishing)
> >
> > The slot's first xmin is calculated by
> > GetOldestSafeDecodingTransactionId(false). The initial computed
> > cancidate_xid could be newer than this xid?
>
> I think the issue occurs when the slot is created after an XID wraparound. As a
> result, GetOldestSafeDecodingTransactionId() returns the current XID
> (after wraparound), which appears older than the computed candidate_xid (e.g.,
> oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> issue in [1]. What do you think ?

I agree that the scenario Nisha shared could happen with the current
patch. On the other hand, I think that if slot's initial xmin is
always newer than or equal to the initial computed non-removable-xid
(i.e., the oldest of workers' oldest_nonremovable_xid values), we can
always use slot's first xmin. And I think it might be true while I'm
concerned the fact that worker's oldest_nonremoable_xid and the slot's
initial xmin is calculated differently (GetOldestActiveTransactionId()
and GetOldestSafeDecodingTransactionId(), respectively). That way,
subsequent comparisons between slot's xmin and computed candidate_xid
won't need to take care of the epoch. IOW, the worker's
non-removable-xid values effectively are not used until the slot is
created.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

07 January, 11:49:37

On Fri, Jan 3, 2025 at 11:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 5.
> +
> +      <varlistentry
> id="sql-createsubscription-params-with-detect-update-deleted">
> +        <term><literal>detect_update_deleted</literal>
> (<type>boolean</type>)</term>
> +        <listitem>
> +         <para>
> +          Specifies whether the detection of <xref
> linkend="conflict-update-deleted"/>
> +          is enabled. The default is <literal>false</literal>. If set to
> +          true, the dead tuples on the subscriber that are still useful for
> +          detecting <xref linkend="conflict-update-deleted"/>
> +          are retained,
>
> One of the purposes of retaining dead tuples is to detect
> update_delete conflict. But, I also see the following in 0001's commit
> message: "Since the mechanism relies on a single replication slot, it
> not only assists in retaining dead tuples but also preserves commit
> timestamps and origin data. These information will be displayed in the
> additional logs generated for logical replication conflicts.
> Furthermore, the preserved commit timestamps and origin data are
> essential for consistently detecting update_origin_differs conflicts."
> which indicates there are other cases where retaining dead tuples can
> help. So, I was thinking about whether to name this new option as
> retain_dead_tuples or something along those lines?
>

The other possible option name could be retain_conflict_info.
Sawada-San, and others, do you have any preference for the name of
this option?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

07 January, 12:10:49

On Tuesday, January 7, 2025 3:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > >
> > > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > I have one comment on the 0001 patch:
> > > >
> > > > Thanks for the comments!
> > > >
> > > > >
> > > > > +       /*
> > > > > +        * The changes made by this and later transactions are still
> > > > > non-removable
> > > > > +        * to allow for the detection of update_deleted conflicts
> > > > > + when
> > > > > applying
> > > > > +        * changes in this logical replication worker.
> > > > > +        *
> > > > > +        * Note that this info cannot directly protect dead tuples from
> > > being
> > > > > +        * prematurely frozen or removed. The logical replication
> launcher
> > > > > +        * asynchronously collects this info to determine whether to
> > > > > + advance
> > > > > the
> > > > > +        * xmin value of the replication slot.
> > > > > +        *
> > > > > +        * Therefore, FullTransactionId that includes both the
> > > > > transaction ID and
> > > > > +        * its epoch is used here instead of a single Transaction ID.
> This is
> > > > > +        * critical because without considering the epoch, the
> transaction
> > > ID
> > > > > +        * alone may appear as if it is in the future due to transaction
> ID
> > > > > +        * wraparound.
> > > > > +        */
> > > > > +       FullTransactionId oldest_nonremovable_xid;
> > > > >
> > > > > The last paragraph of the comment mentions that we need to use
> > > > > FullTransactionId to properly compare XIDs even after the XID
> > > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > > xmin) are never away from more than 2^31 XIDs.
> > > >
> > > > I think the issue is that the launcher may create the replication slot
> > > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > > because the launcher are doing that asynchronously. So, Before the
> > > > slot is created, there's a window where transaction IDs might wrap
> > > > around. If initially the apply worker has computed a candidate_xid
> > > > (755) and the xid wraparound before the launcher creates the slot,
> > > > causing the new current xid to be (740), then the old
> > > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > > could advance the xmin to 755 which cause the dead tuples to be
> removed
> > > prematurely.
> > > > (We are trying to reproduce this to ensure that it's a real issue and
> > > > will share after finishing)
> > >
> > > The slot's first xmin is calculated by
> > > GetOldestSafeDecodingTransactionId(false). The initial computed
> > > cancidate_xid could be newer than this xid?
> >
> > I think the issue occurs when the slot is created after an XID wraparound. As
> a
> > result, GetOldestSafeDecodingTransactionId() returns the current XID
> > (after wraparound), which appears older than the computed candidate_xid
> (e.g.,
> > oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> > issue in [1]. What do you think ?
> 
> I agree that the scenario Nisha shared could happen with the current
> patch. On the other hand, I think that if slot's initial xmin is
> always newer than or equal to the initial computed non-removable-xid
> (i.e., the oldest of workers' oldest_nonremovable_xid values), we can
> always use slot's first xmin. And I think it might be true while I'm
> concerned the fact that worker's oldest_nonremoable_xid and the slot's
> initial xmin is calculated differently (GetOldestActiveTransactionId()
> and GetOldestSafeDecodingTransactionId(), respectively). That way,
> subsequent comparisons between slot's xmin and computed candidate_xid
> won't need to take care of the epoch. IOW, the worker's
> non-removable-xid values effectively are not used until the slot is
> created.

I might be missing something, so could you please elaborate a bit more on this
idea?

Initially, I thought you meant delaying the initialization of slot.xmin until
after the worker computes the oldest_nonremovable_xid. However, I think the
same issue would occur with this approach as well [1], with the difference
being that the slot would directly use a future XID as xmin, which seems
inappropriate to me.

Or do you mean opposite that we delay the initialization of
oldest_nonremovable_xid after the creation of the slot ?

[1]
> So, Before the slot is created, there's a window where transaction IDs might
> wrap around. If initially the apply worker has computed a candidate_xid (755)
> and the xid wraparound before the launcher creates the slot, causing the new
> current xid to be (740), then the old candidate_xid(755) looks like a xid in
> the future, and the launcher could advance the xmin to 755 which cause the
> dead tuples to be removed prematurely.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

07 January, 13:49:12

On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>
> We thought of another approach, which is to create/drop this slot first as
> soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> during DDL). But it seems complicate to control the concurrent slot
> create/drop. For example, if one backend A enables detect_update_deteled, it
> will create a slot. But if another backend B is disabling the
> detect_update_deteled at the same time, then the newly created slot may be
> dropped by backend B. I thought about checking the number of subscriptions that
> enables detect_update_deteled before dropping the slot in backend B, but the
> subscription changes caused by backend A may not visable yet (e.g. not
> committed yet).
>

This means that for the transaction whose changes are not yet visible,
we may have already created the slot and the backend B would end up
dropping it. Is it possible that during the change of this new option
via DDL, we take AccessExclusiveLock on pg_subscription as we do in
DropSubscription() to ensure that concurrent transactions can't drop
the slot? Will that help in solving the above scenario?

The second idea could be that each worker first checks whether a slot
exists along with a subscription flag (new option). Checking the
existence of a slot each time would be costly, so we somehow need to
cache it. But if we do that then we need to invent some cache
invalidation mechanism for the slot. I am not sure if we can design a
race-free mechanism for that. I mean we need to think of a solution
for race conditions between the launcher and apply workers to ensure
that after dropping the slot, launcher doesn't recreate the slot (say
if some subscription enables this option) before all the workers can
clear their existing values of oldest_nonremovable_xid.

The third idea to avoid the race condition could be that in the
function InitializeLogRepWorker() after CommitTransactionCommand(), we
check if the retain_dead_tuples flag is true for MySubscription then
we check whether the system slot exists. If exits then go ahead,
otherwise, wait till the slot is created. It could be some additional
cycles during worker start up but it is a one-time effort and that too
only when the flag is set. In addition to this, we anyway need to
create the slot in the launcher before launching the workers, and
after re-reading the subscription, the change in retain_dead_tuples
flag (off->on) should cause the worker restart.

Now, in the third idea, the issue can still arise if, after waiting
for the slot to be created, the user sets the retain_dead_tuples to
false and back to true again immediately. Because the launcher may
have noticed the "retain_dead_tuples=false" operation and dropped the
slot, while the apply worker has not noticed and still holds an old
candidate_xid. The xid may wraparound in this window before setting
the retain_dead_tuples back to true. And, the apply worker would not
restart because after it calls maybe_reread_subscription(), the
retain_dead_tuples would have been set back to true again. Again, to
avoid this race condition, the launcher can wait for each worker to
reset the oldest_nonremovamble_xid before dropping the slot.

Even after doing the above, the third idea could still have another
race condition:
1. The launcher creates the replication slot and starts a worker with
retain_dead_tuples = true, the worker is waiting for publish status
and has not set oldest_nonremovable_xid.
2. The user set the option retain_dead_tuples to false, the launcher
noticed that and drop the replication slot.
3. The worker received the status and set oldest_nonremovable_xid to a
valid value (say 750).
4. Xid wraparound happened at this point and say new_available_xid becomes 740
5. User set retain_dead_tuples = true again.

After the above steps, the apply worker holds an old
oldest_nonremovable_xid (750) and will not restart if it does not call
maybe_reread_subscription() before step 5. So, such a case can again
create a problem of incorrect slot->xmin value. We can probably try to
find some way to avoid this race condition as well but I haven't
thought more about this as this idea sounds a bit risky and bug-prone
to me.

Among the above ideas, the first idea of taking AccessExclusiveLock on
pg_subscription sounds safest to me. I haven't evaluated the changes
for the first approach so I could be missing something that makes it
difficult to achieve but I think it is worth investigating unless we
have better ideas or we think that the current approach used in patch
to use FullTransactionId is okay.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

07 January, 14:11:14

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

I was doing backward compatibility test by creating publication in
PG17 and subscription with the patch on HEAD:
Currently, we are able to create subscription with
detect_update_deleted option for a publication on PG17:
postgres=# create subscription sub1 connection 'dbname=postgres
host=localhost port=5432' publication pub1 with
(detect_update_deleted=true);
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION

This should not be allowed now as the subscriber will now request
publisher status from the publisher for which handling is not
available in the publisher:
+static void
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+       pq_sendbyte(request_message, 'p');
+       pq_sendint64(request_message, GetCurrentTimestamp());
...
+}

I felt this should not be allowed.

Regards,
Vignesh

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

07 January, 15:34:23

On Thursday, January 2, 2025 6:34 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> Few suggestions:
> 1) If we have a subscription with detect_update_deleted option and we
> try to upgrade it with default settings(in case dba forgot to set
> track_commit_timestamp), the upgrade will fail after doing a lot of
> steps like that mentioned in ok below:
> Setting locale and encoding for new cluster                   ok
> Analyzing all rows in the new cluster                         ok
> Freezing all rows in the new cluster                          ok
> Deleting files from new pg_xact                               ok
> Copying old pg_xact to new server                             ok
> Setting oldest XID for new cluster                            ok
> Setting next transaction ID and epoch for new cluster         ok
> Deleting files from new pg_multixact/offsets                  ok
> Copying old pg_multixact/offsets to new server                ok
> Deleting files from new pg_multixact/members                  ok
> Copying old pg_multixact/members to new server                ok
> Setting next multixact ID and offset for new cluster          ok
> Resetting WAL archives                                        ok
> Setting frozenxid and minmxid counters in new cluster         ok
> Restoring global objects in the new cluster                   ok
> Restoring database schemas in the new cluster
>   postgres
> *failure*
> 
> We should detect this at an earlier point somewhere like in
> check_new_cluster_subscription_configuration and throw an error from
> there.
> 
> 2) Also should we include an additional slot for the
> pg_conflict_detection slot while checking max_replication_slots.
> Though this error will occur after the upgrade is completed, it may be
> better to include the slot during upgrade itself so that the DBA need
> not handle this error separately after the upgrade is completed.

Thanks for the comments!

I added the suggested changes but didn't add more tests to verify each error
message in this version, because it seems a rare case to me, so I am not sure
if it's worth increasing the testing time for these errors. But I am OK to add
if people think it's worth the effort and I will also test this locally.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

07 January, 15:49:55

On Thursday, January 2, 2025 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> Sounds reasonable but OTOH, all other places that create physical
> slots (which we are doing here) don't use this trick. So, don't they
> need similar reliability?

I have not figured the reason for existing physical slots' handling,
but will think more.

> Also, add some comments as to why we are
> initially creating the RS_EPHEMERAL slot as we have at other places.

Added.

> 
> Few other comments on 0003
> =======================
> 1.
> + if (sublist)
> + {
> + bool updated;
> +
> + if (!can_advance_xmin)
> + xmin = InvalidFullTransactionId;
> +
> + updated = advance_conflict_slot_xmin(xmin);
> 
> How will it help to try advancing slot_xmin when xmin is invalid?

It was intended to create the slot without updating the xmin in this case,
but the function name seems misleading. So, I will think more on this and
modify it in next version because it may also be affected by the discussion
in [1].

> 
> 2.
> @@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
>   long elapsed;
> 
>   if (!sub->enabled)
> + {
> + can_advance_xmin = false;
> 
> In ApplyLauncherMain(), if one of the subscriptions is disabled (say
> the last one in sublist), then can_advance_xmin will become false in
> the above code. Now, later, as quoted in comment-1, the patch
> overrides xmin to InvalidFullTransactionId if can_advance_xmin is
> false. Won't that lead to the wrong computation of xmin?

advance_conflict_slot_xmin() would skip updating the slot.xmin
if the input value is invalid. But I will think how to improve this
in next version.

> 
> 3.
> + slot_maybe_exist = true;
> + }
> +
> + /*
> + * Drop the slot if we're no longer retaining dead tuples.
> + */
> + else if (slot_maybe_exist)
> + {
> + drop_conflict_slot_if_exists();
> + slot_maybe_exist = false;
> 
> Can't we use MyReplicationSlot instead of introducing a new boolean
> slot_maybe_exist?
> 
> In any case, how does the above code deal with the case where the
> launcher is restarted for some reason and there is no subscription
> after that? Will it be possible to drop the slot in that case?

Since the initial value of slot_maybe_exist is true, so I think the launcher would
always check the slot once and drop the slot if not needed even if the
launcher restarted.

[1] https://www.postgresql.org/message-id/CAA4eK1Li8XLJ5f-pYvPJ8pXxyA3G-QsyBLNzHY940amF7jm%3D3A%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

08 January, 09:59:40

On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Few comments:
1) All other options are ordered, we can mention retain_conflict_info
after password_required to keep it consistent, I think it got
misplaced because of the name change from detect_update_deleted to
retain_conflict_info:
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index bbd08770c3..9d07fbf07a 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -2278,9 +2278,10 @@ match_previous_words(int pattern_id,
                COMPLETE_WITH("(", "PUBLICATION");
        /* ALTER SUBSCRIPTION <name> SET ( */
        else if (Matches("ALTER", "SUBSCRIPTION", MatchAny, MatchAnyN,
"SET", "("))
-               COMPLETE_WITH("binary", "disable_on_error",
"failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+               COMPLETE_WITH("binary", "retain_conflict_info",
"disable_on_error",
+                                         "failover", "origin",
"password_required",
+                                         "run_as_owner", "slot_name",
"streaming",
+                                         "synchronous_commit", "two_phase");

2) Similarly here too:
        /* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
        else if (Matches("CREATE", "SUBSCRIPTION", MatchAnyN, "WITH", "("))
                COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-                                         "disable_on_error",
"enabled", "failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+                                         "retain_conflict_info",
"disable_on_error", "enabled",

3) Now that the option detect_update_deleted is changed to
retain_conflict_info, we can change this to "Retain conflict info":
+               if (pset.sversion >= 180000)
+                       appendPQExpBuffer(&buf,
+                                                         ",
subretainconflictinfo AS \"%s\"\n",
+
gettext_noop("Detect update deleted"));

4) The corresponding test changes also should be updated:
+++ b/src/test/regress/expected/subscription.out
@@ -116,18 +116,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION
'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the
replication slot, enable the subscription, and refresh the
subscription.
 \dRs+ regress_testsub4
-
                                           List of subscriptions
-       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Synchronous commit |
      Conninfo           | Skip LSN

-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | off                |
dbname=regress_doesnotexist | 0/0
+
                                                       List of
subscriptions
+       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Detect update deleted |
Synchronous commit |          Conninfo           | Skip LSN

+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+-----------------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | f                     |
off                | dbname=regress_doesnotexist | 0/0

5) This part of code is not very easy to understand that it is done
for handling wrap around, could we add some comments here:
+       if (!TimestampDifferenceExceeds(data->candidate_xid_time, now,
+
 data->xid_advance_interval))
+               return;
+
+       data->candidate_xid_time = now;
+
+       oldest_running_xid = GetOldestActiveTransactionId();
+       next_full_xid = ReadNextFullTransactionId();
+       epoch = EpochFromFullTransactionId(next_full_xid);
+
+       /* Compute the epoch of the oldest_running_xid */
+       if (oldest_running_xid > XidFromFullTransactionId(next_full_xid))
+               epoch--;

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

From

Nisha Moond

Date:

08 January, 10:48:37

On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
>

Here are a couple of initial review comments on v19 patch set:

1) The subscription option 'retain_conflict_info' remains set to
"true" for a subscription even after restarting the server with
'track_commit_timestamp=off', which can lead to incorrect behavior.
  Steps to reproduce:
   1. Start the server with 'track_commit_timestamp=ON'.
   2. Create a subscription with (retain_conflict_info=ON).
   3. Restart the server with 'track_commit_timestamp=OFF'.

 - The apply worker starts successfully, and the subscription retains
'retain_conflict_info=true'. However, in this scenario, the
update_deleted conflict detection will not function correctly without
'track_commit_timestamp'.
```
postgres=# show track_commit_timestamp;
 track_commit_timestamp
------------------------
 off
(1 row)

postgres=# select subname, subretainconflictinfo from pg_subscription;
 subname | subretainconflictinfo
---------+-----------------------
 sub21   | t
 sub22   | t
```

2) With the new parameter name change to "retain_conflict_info", the
error message for both the 'CREATE SUBSCRIPTION' and 'ALTER
SUBSCRIPTION' commands needs to be updated accordingly.

  postgres=# create subscription sub11 connection 'dbname=postgres'
publication pub1 with (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled
  postgres=# alter subscription sub12 set (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled

 - Change the message to something similar - "retaining conflict info
requires "track_commit_timestamp" to be enabled".

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

08 January, 11:44:29

On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID wraparound
> > > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > > being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot after
> > the apply worker has already set the 'oldest_nonremovable_xid' because the
> > launcher are doing that asynchronously. So, Before the slot is created, there's
> > a window where transaction IDs might wrap around. If initially the apply worker
> > has computed a candidate_xid (755) and the xid wraparound before the launcher
> > creates the slot, causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher could
> > advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and will
> > share after finishing)
> >
> > We thought of another approach, which is to create/drop this slot first as
> > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > during DDL). But it seems complicate to control the concurrent slot
> > create/drop. For example, if one backend A enables detect_update_deteled, it
> > will create a slot. But if another backend B is disabling the
> > detect_update_deteled at the same time, then the newly created slot may be
> > dropped by backend B. I thought about checking the number of subscriptions that
> > enables detect_update_deteled before dropping the slot in backend B, but the
> > subscription changes caused by backend A may not visable yet (e.g. not
> > committed yet).
> >
>
> This means that for the transaction whose changes are not yet visible,
> we may have already created the slot and the backend B would end up
> dropping it. Is it possible that during the change of this new option
> via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> DropSubscription() to ensure that concurrent transactions can't drop
> the slot? Will that help in solving the above scenario?

If we create/stop the slot during DDL, how do we support rollback DDLs?

>
> The second idea could be that each worker first checks whether a slot
> exists along with a subscription flag (new option). Checking the
> existence of a slot each time would be costly, so we somehow need to
> cache it. But if we do that then we need to invent some cache
> invalidation mechanism for the slot. I am not sure if we can design a
> race-free mechanism for that. I mean we need to think of a solution
> for race conditions between the launcher and apply workers to ensure
> that after dropping the slot, launcher doesn't recreate the slot (say
> if some subscription enables this option) before all the workers can
> clear their existing values of oldest_nonremovable_xid.
>
> The third idea to avoid the race condition could be that in the
> function InitializeLogRepWorker() after CommitTransactionCommand(), we
> check if the retain_dead_tuples flag is true for MySubscription then
> we check whether the system slot exists. If exits then go ahead,
> otherwise, wait till the slot is created. It could be some additional
> cycles during worker start up but it is a one-time effort and that too
> only when the flag is set. In addition to this, we anyway need to
> create the slot in the launcher before launching the workers, and
> after re-reading the subscription, the change in retain_dead_tuples
> flag (off->on) should cause the worker restart.
>
> Now, in the third idea, the issue can still arise if, after waiting
> for the slot to be created, the user sets the retain_dead_tuples to
> false and back to true again immediately. Because the launcher may
> have noticed the "retain_dead_tuples=false" operation and dropped the
> slot, while the apply worker has not noticed and still holds an old
> candidate_xid. The xid may wraparound in this window before setting
> the retain_dead_tuples back to true. And, the apply worker would not
> restart because after it calls maybe_reread_subscription(), the
> retain_dead_tuples would have been set back to true again. Again, to
> avoid this race condition, the launcher can wait for each worker to
> reset the oldest_nonremovamble_xid before dropping the slot.
>
> Even after doing the above, the third idea could still have another
> race condition:
> 1. The launcher creates the replication slot and starts a worker with
> retain_dead_tuples = true, the worker is waiting for publish status
> and has not set oldest_nonremovable_xid.
> 2. The user set the option retain_dead_tuples to false, the launcher
> noticed that and drop the replication slot.
> 3. The worker received the status and set oldest_nonremovable_xid to a
> valid value (say 750).
> 4. Xid wraparound happened at this point and say new_available_xid becomes 740
> 5. User set retain_dead_tuples = true again.
>
> After the above steps, the apply worker holds an old
> oldest_nonremovable_xid (750) and will not restart if it does not call
> maybe_reread_subscription() before step 5. So, such a case can again
> create a problem of incorrect slot->xmin value. We can probably try to
> find some way to avoid this race condition as well but I haven't
> thought more about this as this idea sounds a bit risky and bug-prone
> to me.
>
> Among the above ideas, the first idea of taking AccessExclusiveLock on
> pg_subscription sounds safest to me. I haven't evaluated the changes
> for the first approach so I could be missing something that makes it
> difficult to achieve but I think it is worth investigating unless we
> have better ideas or we think that the current approach used in patch
> to use FullTransactionId is okay.

Thank you for considering some ideas. As I mentioned above, we might
need to consider a case like where 'CREATE SUBSCRIPTION ..
(retain_conflict_info = true)' is rolled back. Having said that, this
comment is just for simplifying the logic. If using TransactionId
instead makes other parts complex, it would not make sense. I'm okay
with leaving this part and improving the comment for
oldest_nonremovable_xid, say, by mentioning that there is a window for
XID wraparound happening between workers computing their
oldst_nonremovable_xid and pg_conflict_detection slot being created.

BTW while reviewing the code, I realized that changing
retain_conflict_info value doesn't have the worker relaunch and we
don't clear the worker's oldest_nonremovable_xid value in this case.
Is it okay? I'm concerned about a case like where
RetainConflictInfoPhase state transition is paused by disabling
retain_conflict_info and resume by re-enabling it with an old
RetainConflictInfoData value.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

08 January, 11:54:47

On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > We thought of another approach, which is to create/drop this slot first as
> > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > during DDL). But it seems complicate to control the concurrent slot
> > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > will create a slot. But if another backend B is disabling the
> > > detect_update_deteled at the same time, then the newly created slot may be
> > > dropped by backend B. I thought about checking the number of subscriptions that
> > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > subscription changes caused by backend A may not visable yet (e.g. not
> > > committed yet).
> > >
> >
> > This means that for the transaction whose changes are not yet visible,
> > we may have already created the slot and the backend B would end up
> > dropping it. Is it possible that during the change of this new option
> > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > DropSubscription() to ensure that concurrent transactions can't drop
> > the slot? Will that help in solving the above scenario?
>
> If we create/stop the slot during DDL, how do we support rollback DDLs?
>

We will prevent changing this setting in a transaction block as we
already do for slot related case. See use of
PreventInTransactionBlock() in subscriptioncmds.c.

>
> Thank you for considering some ideas. As I mentioned above, we might
> need to consider a case like where 'CREATE SUBSCRIPTION ..
> (retain_conflict_info = true)' is rolled back. Having said that, this
> comment is just for simplifying the logic. If using TransactionId
> instead makes other parts complex, it would not make sense. I'm okay
> with leaving this part and improving the comment for
> oldest_nonremovable_xid, say, by mentioning that there is a window for
> XID wraparound happening between workers computing their
> oldst_nonremovable_xid and pg_conflict_detection slot being created.
>

Fair enough. Let us see what you think about my above response first.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

08 January, 12:31:55

On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is further performance test analysis with v16 patch-set.
>
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.

Nice. It means that frequently getting in-commit-phase transactions by
the subscriber didn't have a negative impact on the publisher's
performance.

>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 

Assuming that the performance dip happened due to dead tuple retention
for the conflict detection, would TPS on other databases also be
affected?

>
>
> [3] Test with pgbench run on both publisher and subscriber.
>
>
>
> Test setup:
>
> - Tests performed on pgHead + v16 patches
>
> - Created a pub-sub replication system.
>
> - Parameters for both instances were:
>
>
>
>    share_buffers = 30GB
>
>    min_wal_size = 10GB
>
>    max_wal_size = 20GB
>
>    autovacuum = false

Since you disabled autovacuum on the subscriber, dead tuples created
by non-hot updates are accumulated anyway regardless of
detect_update_deleted setting, is that right?

> Test Run:
>
> - Ran pgbench(read-write) on both the publisher and the subscriber with 30 clients for a duration of 120 seconds,
collectingdata over 5 runs. 
>
> - Note that pgbench was running for different tables on pub and sub.
>
> (The scripts used for test "case1-2_measure.sh" and case1-2_setup.sh" are attached).
>
>
>
> Results:
>
>
>
> Run#                   pub TPS              sub TPS
>
> 1                         32209   13704
>
> 2                         32378   13684
>
> 3                         32720   13680
>
> 4                         31483   13681
>
> 5                         31773   13813
>
> median               32209   13684
>
> regression          7%         -53%

What was the TPS on the subscriber when detect_update_deleted = false?
And how much were the tables bloated compared to when
detect_update_deleted = false?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

08 January, 12:52:48

On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> >
> > Here is further performance test analysis with v16 patch-set.
> >
> >
> > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
> >
> >
> >
> > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> >
> >
> >
> > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> >
> >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
> Nice. It means that frequently getting in-commit-phase transactions by
> the subscriber didn't have a negative impact on the publisher's
> performance.
>
> >
> >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> >
> >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
> Assuming that the performance dip happened due to dead tuple retention
> for the conflict detection, would TPS on other databases also be
> affected?
>

As we use slot->xmin to retain dead tuples, shouldn't the impact be
global (means on all databases)? Or, maybe I am missing something.

> >
> >
> > [3] Test with pgbench run on both publisher and subscriber.
> >
> >
> >
> > Test setup:
> >
> > - Tests performed on pgHead + v16 patches
> >
> > - Created a pub-sub replication system.
> >
> > - Parameters for both instances were:
> >
> >
> >
> >    share_buffers = 30GB
> >
> >    min_wal_size = 10GB
> >
> >    max_wal_size = 20GB
> >
> >    autovacuum = false
>
> Since you disabled autovacuum on the subscriber, dead tuples created
> by non-hot updates are accumulated anyway regardless of
> detect_update_deleted setting, is that right?
>

I think hot-pruning mechanism during the update operation will remove
dead tuples even when autovacuum is disabled.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

08 January, 13:33:05

On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> > >
> > > Here is further performance test analysis with v16 patch-set.
> > >
> > >
> > > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a
pub-subsetup, no performance degradation was observed on either node. 
> > >
> > >
> > >
> > > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> > >
> > >
> > >
> > > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> > >
> > >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
> >
> > Nice. It means that frequently getting in-commit-phase transactions by
> > the subscriber didn't have a negative impact on the publisher's
> > performance.
> >
> > >
> > >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> > >
> > >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
> >
> > Assuming that the performance dip happened due to dead tuple retention
> > for the conflict detection, would TPS on other databases also be
> > affected?
> >
>
> As we use slot->xmin to retain dead tuples, shouldn't the impact be
> global (means on all databases)?

I think so too.

>
> > >
> > >
> > > [3] Test with pgbench run on both publisher and subscriber.
> > >
> > >
> > >
> > > Test setup:
> > >
> > > - Tests performed on pgHead + v16 patches
> > >
> > > - Created a pub-sub replication system.
> > >
> > > - Parameters for both instances were:
> > >
> > >
> > >
> > >    share_buffers = 30GB
> > >
> > >    min_wal_size = 10GB
> > >
> > >    max_wal_size = 20GB
> > >
> > >    autovacuum = false
> >
> > Since you disabled autovacuum on the subscriber, dead tuples created
> > by non-hot updates are accumulated anyway regardless of
> > detect_update_deleted setting, is that right?
> >
>
> I think hot-pruning mechanism during the update operation will remove
> dead tuples even when autovacuum is disabled.

True, but why did it disable autovacuum? It seems that
case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates
less likely to happen.

I understand that a certain performance dip happens due to dead tuple
retention, which is fine, but I'm surprised that the TPS decreased by
50% within 120 seconds. The TPS goes even worse for a longer test? I
did a quick benchmark where I completely disabled removing dead tuples
(by autovacuum=off and a logical slot) and ran pgbench but I didn't
see such a precipitous dip.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

08 January, 14:00:24

On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> <nisha.moond412@gmail.com> wrote:
> > > >
> > > >
> > > > [3] Test with pgbench run on both publisher and subscriber.
> > > >
> > > >
> > > >
> > > > Test setup:
> > > >
> > > > - Tests performed on pgHead + v16 patches
> > > >
> > > > - Created a pub-sub replication system.
> > > >
> > > > - Parameters for both instances were:
> > > >
> > > >
> > > >
> > > >    share_buffers = 30GB
> > > >
> > > >    min_wal_size = 10GB
> > > >
> > > >    max_wal_size = 20GB
> > > >
> > > >    autovacuum = false
> > >
> > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > by non-hot updates are accumulated anyway regardless of
> > > detect_update_deleted setting, is that right?
> > >
> >
> > I think hot-pruning mechanism during the update operation will remove
> > dead tuples even when autovacuum is disabled.
> 
> True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> doesn't specify fillfactor, which makes hot-updates less likely to happen.

IIUC, we disable autovacuum as a general practice in read-write tests for
stable TPS numbers.

> 
> I understand that a certain performance dip happens due to dead tuple
> retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> 120 seconds. The TPS goes even worse for a longer test?

We will try to increase the time and run the test again.

> I did a quick
> benchmark where I completely disabled removing dead tuples (by
> autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> precipitous dip.

I think a logical slot only retain the dead tuples on system catalog,
so the TPS on user table would not be affected that much.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

08 January, 14:03:07

On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Consider a LR setup with retain_conflict_info=true for a table t1:
Publisher:
insert into t1 values(1);
-- Have a open transaction before delete operation in subscriber
begin;

Subscriber:
-- delete the record that was replicated
delete from t1;

-- Now commit the transaction in publisher
Publisher:
update t1 set c1 = 2;
commit;

In normal case update_deleted conflict is detected
2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated
was deleted locally in transaction 751 at 2025-01-08
15:41:29.811566+05:30.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBCA0

Now execute the same above case by having a presetup to consume all
the replication slots in the system by executing
pg_create_logical_replication_slot before the subscription is created,
in this case the conflict is not detected correctly.
2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row
to be updated.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBC68
2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
"max_replication_slots".

This is because even though we say create subscription is successful,
the launcher has not yet created the replication slot.

There are few observations from this test:
1) Create subscription does not wait for the slot to be created by the
launcher and starts applying the changes. Should create a subscription
wait till the slot is created by the launcher process.
2) Currently launcher is exiting continuously and trying to create
replication slots. Should the launcher wait for
wal_retrieve_retry_interval configuration before trying to create the
slot instead of filling the logs continuously.
3) If we try to create a similar subscription with
retain_conflict_info and disable_on_error option and there is an error
in replication slot creation, currently the subscription does not get
disabled. Should we consider disable_on_error for these cases and
disable the subscription if we are not able to create the slots.

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

09 January, 04:48:21

On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > <nisha.moond412@gmail.com> wrote:
> > > > >
> > > > >
> > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > >
> > > > >
> > > > >
> > > > > Test setup:
> > > > >
> > > > > - Tests performed on pgHead + v16 patches
> > > > >
> > > > > - Created a pub-sub replication system.
> > > > >
> > > > > - Parameters for both instances were:
> > > > >
> > > > >
> > > > >
> > > > >    share_buffers = 30GB
> > > > >
> > > > >    min_wal_size = 10GB
> > > > >
> > > > >    max_wal_size = 20GB
> > > > >
> > > > >    autovacuum = false
> > > >
> > > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > > by non-hot updates are accumulated anyway regardless of
> > > > detect_update_deleted setting, is that right?
> > > >
> > >
> > > I think hot-pruning mechanism during the update operation will remove
> > > dead tuples even when autovacuum is disabled.
> >
> > True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> > doesn't specify fillfactor, which makes hot-updates less likely to happen.
>
> IIUC, we disable autovacuum as a general practice in read-write tests for
> stable TPS numbers.

Okay. TBH I'm not sure what we can say with these results. At a
glance, in a typical bi-directional-like setup,  we can interpret
these results as that if users turn retain_conflict_info on the TPS
goes 50% down.  But I'm not sure this 50% dip is the worst case that
users possibly face. It could be better in practice thanks to
autovacuum, or it also could go even worse due to further bloats if we
run the test longer.

Suppose that users had 50% performance dip due to dead tuple retention
for update_deleted detection, is there any way for users to improve
the situation? For example, trying to advance slot.xmin more
frequently might help to reduce dead tuple accumulation. I think it
would be good if we could have a way to balance between the publisher
performance and the subscriber performance.

In test case 3, we observed a -53% performance dip, which is worse
than the results of test case 5 with wal_receiver_status_interval =
100s. Given that in test case 5 with wal_receiver_status_interval =
100s we cannot remove dead tuples for the most of the whole 120s test
time, probably we could not remove dead tuples for a long time also in
test case 3. I expected that the apply worker gets remote transaction
XIDs and tries to advance slot.xmin more frequently, so this
performance dip surprised me. I would like to know how many times the
apply worker gets remote transaction XIDs and succeeds in advance
slot.xmin during the test.

>
> >
> > I understand that a certain performance dip happens due to dead tuple
> > retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> > 120 seconds. The TPS goes even worse for a longer test?
>
> We will try to increase the time and run the test again.
>
> > I did a quick
> > benchmark where I completely disabled removing dead tuples (by
> > autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> > precipitous dip.
>
> I think a logical slot only retain the dead tuples on system catalog,
> so the TPS on user table would not be affected that much.

You're right, I missed it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

09 January, 06:26:31

On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
> 
> Okay. TBH I'm not sure what we can say with these results. At a glance, in a
> typical bi-directional-like setup,  we can interpret these results as that if
> users turn retain_conflict_info on the TPS goes 50% down.  But I'm not sure
> this 50% dip is the worst case that users possibly face. It could be better in
> practice thanks to autovacuum, or it also could go even worse due to further
> bloats if we run the test longer.

I think it shouldn't go worse because ideally the amount of bloat would not
increase beyond what we see here due to this patch unless there is some
misconfiguration that leads to one of the node not working properly (say it is
down). However, my colleague is running longer tests and we will share the
results soon.

> Suppose that users had 50% performance dip due to dead tuple retention for
> update_deleted detection, is there any way for users to improve the situation?
> For example, trying to advance slot.xmin more frequently might help to reduce
> dead tuple accumulation. I think it would be good if we could have a way to
> balance between the publisher performance and the subscriber performance.

AFAICS, most of the time in each xid advancement is spent on waiting for the
target remote_lsn to be applied and flushed, so increasing the frequency could
not help. This can be proved to be reasonable in the testcase 4 shared by
Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
commit_ts of incoming transaction to exceed the candidate_xid_time, the
regression is still the same. I think it indicates that we indeed need to wait
for this amount of time before applying all the transactions that have earlier
commit timestamp. IOW, the performance impact on the subscriber side is a
reasonable behavior if we want to detect the update_deleted conflict reliably.

[1] https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

09 January, 06:50:58

On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com>

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
>
...
> In test case 3, we observed a -53% performance dip, which is worse than the
> results of test case 5 with wal_receiver_status_interval = 100s. Given that
> in test case 5 with wal_receiver_status_interval = 100s we cannot remove dead
> tuples for the most of the whole 120s test time, probably we could not remove
> dead tuples for a long time also in test case 3. I expected that the apply
> worker gets remote transaction XIDs and tries to advance slot.xmin more
> frequently, so this performance dip surprised me.
 
As noted in my previous email[1], the delay primarily occurs during the final
phase (RCI_WAIT_FOR_LOCAL_FLUSH), where we wait for concurrent transactions
from the publisher to be applied and flushed locally (e.g., last_flushpos <
data->remote_lsn). I think that the interval between each transaction ID
advancement is brief, the duration of each advancement itself is significant.
 
> I would like to know how many times the apply worker gets remote transaction
> XIDs and succeeds in advance slot.xmin during the test.
 
my colleague will collect and share the data soon.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57164C9A65F29875AE63F0BD94132%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

09 January, 08:50:21

On Wed, Jan 8, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > We thought of another approach, which is to create/drop this slot first as
> > > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > > during DDL). But it seems complicate to control the concurrent slot
> > > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > > will create a slot. But if another backend B is disabling the
> > > > detect_update_deteled at the same time, then the newly created slot may be
> > > > dropped by backend B. I thought about checking the number of subscriptions that
> > > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > > subscription changes caused by backend A may not visable yet (e.g. not
> > > > committed yet).
> > > >
> > >
> > > This means that for the transaction whose changes are not yet visible,
> > > we may have already created the slot and the backend B would end up
> > > dropping it. Is it possible that during the change of this new option
> > > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > > DropSubscription() to ensure that concurrent transactions can't drop
> > > the slot? Will that help in solving the above scenario?
> >
> > If we create/stop the slot during DDL, how do we support rollback DDLs?
> >
>
> We will prevent changing this setting in a transaction block as we
> already do for slot related case. See use of
> PreventInTransactionBlock() in subscriptioncmds.c.
>

On further thinking, even if we prevent this command in a transaction
block, there is still a small chance of rollback. Say, we created the
slot as the last operation after making database changes, but still,
the transaction can fail in the commit code path. So, it is still not
bulletproof. However, we already create a remote_slot at the end of
CREATE SUBSCRIPTION, so, if by any chance the transaction fails in the
commit code path, we will end up having a dangling slot on the remote
node. The same can happen in the DROP SUBSCRIPTION code path as well.
We can follow that or the other option is to allow creation of the
slot by the backend and let drop be handled by the launcher which can
even take care of dangling slots. However, I feel it will be better to
give the responsibility to the launcher for creating and dropping the
slot as the patch is doing and use the FullTransactionId for each
worker. What do you think?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

09 January, 12:14:55

On Wednesday, January 8, 2025 7:03 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

> Consider a LR setup with retain_conflict_info=true for a table t1:
> Publisher:
> insert into t1 values(1);
> -- Have a open transaction before delete operation in subscriber begin;
> 
> Subscriber:
> -- delete the record that was replicated delete from t1;
> 
> -- Now commit the transaction in publisher
> Publisher:
> update t1 set c1 = 2;
> commit;
> 
> In normal case update_deleted conflict is detected
> 2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on relation
> "public.t1": conflict=update_deleted
> 2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated was
> deleted locally in transaction 751 at 2025-01-08 15:41:29.811566+05:30.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBCA0
> 
> Now execute the same above case by having a presetup to consume all the
> replication slots in the system by executing pg_create_logical_replication_slot
> before the subscription is created, in this case the conflict is not detected
> correctly.
> 2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on relation
> "public.t1": conflict=update_missing
> 2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row to be
> updated.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBC68
> 2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
> 2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
> "max_replication_slots".
> 
> This is because even though we say create subscription is successful, the
> launcher has not yet created the replication slot.

I think some detection miss in the beginning after enabling the option is
acceptable. Because even if we let the launcher to create the slot before
starting workers, some dead tuples could have been already removed during this
period, so update_missing could still be detected. I have added some documents
to clarify that the information can be safely retained only after the slot is
created.

> 
> There are few observations from this test:
> 1) Create subscription does not wait for the slot to be created by the launcher
> and starts applying the changes. Should create a subscription wait till the slot
> is created by the launcher process.

I think the DDL could not wait for the slot creation, because the launcher would
not create the slot until the DDL is committed. Instead, I have changed the
code to create the slot before starting workers, so that at least the worker
would not unnecessarily maintain the oldest non-removable xid.

> 2) Currently launcher is exiting continuously and trying to create replication
> slots. Should the launcher wait for wal_retrieve_retry_interval configuration
> before trying to create the slot instead of filling the logs continuously.

Since the launcher already have a 5s (bgw_restart_time) restart interval, I
feel it would not consume the too much resources in this case.

> 3) If we try to create a similar subscription with retain_conflict_info and
> disable_on_error option and there is an error in replication slot creation,
> currently the subscription does not get disabled. Should we consider
> disable_on_error for these cases and disable the subscription if we are not able
> to create the slots.

Currently, since only ERRORs in apply worker would trigger disable_on_error, I
am not sure if It's worth the effort to teach the apply to catch launcher's
error because it doesn't seem like a common scenario.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

09 January, 12:15:44

On Wednesday, January 8, 2025 3:49 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> 
> On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
> >
> 
> Here are a couple of initial review comments on v19 patch set:
> 
> 1) The subscription option 'retain_conflict_info' remains set to "true" for a
> subscription even after restarting the server with
> 'track_commit_timestamp=off', which can lead to incorrect behavior.
>   Steps to reproduce:
>    1. Start the server with 'track_commit_timestamp=ON'.
>    2. Create a subscription with (retain_conflict_info=ON).
>    3. Restart the server with 'track_commit_timestamp=OFF'.
> 
>  - The apply worker starts successfully, and the subscription retains
> 'retain_conflict_info=true'. However, in this scenario, the update_deleted
> conflict detection will not function correctly without
> 'track_commit_timestamp'.
> ```

IIUC, track_commit_timestamp is a GUC that designed mainly for conflict
detection, so it seems an unreasonable behavior to me if user enable this when
creating the sub but disable is afterwards. Besides, we documented that
update_deleted conflict would not be detected when track_commit_timestamp is
not enabled, so I am not sure if it's worth more effort adding checks for this
case.

> 
> 2) With the new parameter name change to "retain_conflict_info", the error
> message for both the 'CREATE SUBSCRIPTION' and 'ALTER SUBSCRIPTION'
> commands needs to be updated accordingly.
> 
>   postgres=# create subscription sub11 connection 'dbname=postgres'
> publication pub1 with (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
>   postgres=# alter subscription sub12 set (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
> 
>  - Change the message to something similar - "retaining conflict info requires
> "track_commit_timestamp" to be enabled".

After thinking more, I changed this to a warning for now, because to detect
all necessary conflicts, user must enable the option anyway, and the same has
been documented for update/delete_origin_differs conflicts as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

31 January, 01:39:28

On Thu, Jan 23, 2025 at 3:47 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 22, 2025 7:54 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> > On Saturday, January 18, 2025 11:45 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > > I think invalidating the slot is OK and we could also let the apply
> > > worker to automatic recovery as suggested in [1].
> > >
> > > Here is the V24 patch set. I modified 0004 patch to implement the slot
> > > Invalidation part. Since the automatic recovery could be an
> > > optimization and the discussion is in progress, I didn't implement that part.
> >
> > The implementation is in progress and I will include it in next version.
> >
> > Here is the V25 patch set that includes the following change:
> >
> > 0001
> >
> > * Per off-list discussion with Amit, I added few comments to mention the
> > reason of skipping advancing xid when table sync is in progress and to mention
> > that the advancement will not be delayed if changes are filtered out on
> > publisher via row/table filter.
> >
> > 0004
> >
> > * Fixed a bug that the launcher would advance the slot.xmin when some apply
> >   workers have not yet started.
> >
> > * Fixed a bug that the launcher did not advance the slot.xmin even if one of the
> >   apply worker has stopped conflict retention due to the lag.
> >
> > * Add a retain_conflict_info column in the pg_stat_subscription view to
> >   indicate whether the apply worker is effectively retaining conflict
> >   information. The value is set to true only if retain_conflict_info is enabled
> >   for the associated subscription, and the retention duration for conflict
> >   detection by the apply worker has not exceeded
> >   max_conflict_retention_duration. Thanks Kuroda-san for contributing codes
> >   off-list.
>
> Here is V25 patch set which includes the following changes:
>
> 0004
> * Addressed Nisha's comments[1].
> * Fixed a cfbot failure[2] in the doc.

I have one question about the 0004 patch; it implemented
max_conflict_retntion_duration as a subscription parameter. But the
launcher invalidates the pg_conflict_detection slot only if all
subscriptions with retain_conflict_info stopped retaining dead tuples
due to the max_conflict_retention_duration parameter. Therefore, even
if users set the parameter to a low value to avoid table bloats, it
would not make sense if other subscriptions set it to a larger value.
Is my understanding correct?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

01 February, 08:07:07

On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I have one question about the 0004 patch; it implemented
> > > max_conflict_retntion_duration as a subscription parameter. But the
> > > launcher invalidates the pg_conflict_detection slot only if all
> > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > if users set the parameter to a low value to avoid table bloats, it
> > > would not make sense if other subscriptions set it to a larger value.
> > > Is my understanding correct?
> > >
> >
> > Yes, your understanding is correct. I think this could be helpful
> > during resolution because the worker for which the duration has
> > exceeded cannot detect conflicts reliably but others can. So, this
> > info can be useful while performing resolutions. Do you have an
> > opinion/suggestion on this matter?
>
> I imagined a scenario like where two apply workers are running and
> have different max_conflict_retention_duration values (say '5 min' and
> '15 min'). Suppose both workers are roughly the same behind the
> publisher(s), when both workers cannot advance the workers' xmin
> values for 5 min or longer, one worker stops retaining dead tuples.
> However, the pg_conflict_detection slot is not invalidated yet since
> another worker is still using it, so both workers would continue to be
> getting slower. The subscriber would end up retaining dead tuples
> until both workers are behind for 15 min or longer, before
> invalidating the slot. In this case, stopping dead tuple retention on
> the first worker would help neither advance the slot's xmin nor
> improve another worker's performance.

Won't the same be true for 'retain_conflict_info' option as well? I
mean even if one worker is retaining dead tuples, the performance of
others will also be impacted.

>
> I was not sure of the point of
> making the max_conflict_retention_duration a per-subscription
> parameter.
>

The idea is to keep it at the same level as the other related
parameter 'retain_conflict_info'. It could be useful for cases where
publishers are from two different nodes (NP1 and  NP2) and we have
separate subscriptions for both nodes. Now, it is possible that users
won't expect conflicts on the tables from one of the nodes NP1 then
she could choose to enable 'retain_conflict_info' and
'max_conflict_retention_duration' only for the subscription pointing
to publisher NP2.

Now, say the publisher node that can generate conflicts (NP2) has
fewer writes and the corresponding apply worker could easily catch up
and almost always be in sync with the publisher. In contrast, the
other node that has no conflicts has a large number of writes. In such
cases, giving new options at the subscription level will be helpful.

If we want to provide it at the global level, then the performance or
dead tuple control may not be any better than the current patch but
won't allow the provision for the above kinds of cases. Second, adding
two new GUCs is another thing I want to prevent. But OTOH, the
implementation could be slightly simpler if we provide these options
as GUC though I am not completely sure of that point. Having said
that, I am open to changing it to a non-subscription level. Do you
think it would be better to provide one or both of these parameters as
GUCs or do you have something else in mind?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

05 February, 03:29:48

On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I have one question about the 0004 patch; it implemented
> > > > max_conflict_retntion_duration as a subscription parameter. But the
> > > > launcher invalidates the pg_conflict_detection slot only if all
> > > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > > if users set the parameter to a low value to avoid table bloats, it
> > > > would not make sense if other subscriptions set it to a larger value.
> > > > Is my understanding correct?
> > > >
> > >
> > > Yes, your understanding is correct. I think this could be helpful
> > > during resolution because the worker for which the duration has
> > > exceeded cannot detect conflicts reliably but others can. So, this
> > > info can be useful while performing resolutions. Do you have an
> > > opinion/suggestion on this matter?
> >
> > I imagined a scenario like where two apply workers are running and
> > have different max_conflict_retention_duration values (say '5 min' and
> > '15 min'). Suppose both workers are roughly the same behind the
> > publisher(s), when both workers cannot advance the workers' xmin
> > values for 5 min or longer, one worker stops retaining dead tuples.
> > However, the pg_conflict_detection slot is not invalidated yet since
> > another worker is still using it, so both workers would continue to be
> > getting slower. The subscriber would end up retaining dead tuples
> > until both workers are behind for 15 min or longer, before
> > invalidating the slot. In this case, stopping dead tuple retention on
> > the first worker would help neither advance the slot's xmin nor
> > improve another worker's performance.
>
> Won't the same be true for 'retain_conflict_info' option as well? I
> mean even if one worker is retaining dead tuples, the performance of
> others will also be impacted.

I guess the situation might be a bit different. It's a user's choice
to disable retain_conflict_info, and it should be done manually. That
is, in this case, I think users will be able to figure out that both
apply workers are the same behind the publishers and they need to
disable retain_conflict_info on both subscriptions in order to remove
accumulated dead tuples (which is the cause of performance dip).

On the other hand, ISTM max_conflict_retentation_duration is something
like a switch to recover the system performance by automatically
disabling retain_conflict_info (and it will automatically go back to
be enabled again). I guess users who use the
max_conflict_retention_duration would expect that the system
performance will tend to recover by automatically disabling
reatin_conflict_info if the apply worker is lagging for longer than
the specified value. However, there are cases where this cannot be
expected.

>
> >
> > I was not sure of the point of
> > making the max_conflict_retention_duration a per-subscription
> > parameter.
> >
>
> The idea is to keep it at the same level as the other related
> parameter 'retain_conflict_info'. It could be useful for cases where
> publishers are from two different nodes (NP1 and  NP2) and we have
> separate subscriptions for both nodes. Now, it is possible that users
> won't expect conflicts on the tables from one of the nodes NP1 then
> she could choose to enable 'retain_conflict_info' and
> 'max_conflict_retention_duration' only for the subscription pointing
> to publisher NP2.
>
> Now, say the publisher node that can generate conflicts (NP2) has
> fewer writes and the corresponding apply worker could easily catch up
> and almost always be in sync with the publisher. In contrast, the
> other node that has no conflicts has a large number of writes. In such
> cases, giving new options at the subscription level will be helpful.
>
> If we want to provide it at the global level, then the performance or
> dead tuple control may not be any better than the current patch but
> won't allow the provision for the above kinds of cases. Second, adding
> two new GUCs is another thing I want to prevent. But OTOH, the
> implementation could be slightly simpler if we provide these options
> as GUC though I am not completely sure of that point. Having said
> that, I am open to changing it to a non-subscription level. Do you
> think it would be better to provide one or both of these parameters as
> GUCs or do you have something else in mind?

It makes sense to me to have the retain_conflict_info as a
subscription-level parameter. I was thinking of making only
max_conflict_retention_duration a global parameter, but I might be
missing something. With a subscription-level
max_conflict_retention_duration, how can users choose the setting
values for each subscription, and is there a case that can be covered
only by a subscription-level max_conflict_retention_duration?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

05 February, 08:24:09

On Sat, Feb 1, 2025 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I have one question about the 0004 patch; it implemented
> > > > max_conflict_retntion_duration as a subscription parameter. But the
> > > > launcher invalidates the pg_conflict_detection slot only if all
> > > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > > if users set the parameter to a low value to avoid table bloats, it
> > > > would not make sense if other subscriptions set it to a larger value.
> > > > Is my understanding correct?
> > > >
> > >
> > > Yes, your understanding is correct. I think this could be helpful
> > > during resolution because the worker for which the duration has
> > > exceeded cannot detect conflicts reliably but others can. So, this
> > > info can be useful while performing resolutions. Do you have an
> > > opinion/suggestion on this matter?
> >
> > I imagined a scenario like where two apply workers are running and
> > have different max_conflict_retention_duration values (say '5 min' and
> > '15 min'). Suppose both workers are roughly the same behind the
> > publisher(s), when both workers cannot advance the workers' xmin
> > values for 5 min or longer, one worker stops retaining dead tuples.
> > However, the pg_conflict_detection slot is not invalidated yet since
> > another worker is still using it, so both workers would continue to be
> > getting slower. The subscriber would end up retaining dead tuples
> > until both workers are behind for 15 min or longer, before
> > invalidating the slot. In this case, stopping dead tuple retention on
> > the first worker would help neither advance the slot's xmin nor
> > improve another worker's performance.
>
> Won't the same be true for 'retain_conflict_info' option as well? I
> mean even if one worker is retaining dead tuples, the performance of
> others will also be impacted.


+1

>
> >
> > I was not sure of the point of
> > making the max_conflict_retention_duration a per-subscription
> > parameter.
> >
>
> The idea is to keep it at the same level as the other related
> parameter 'retain_conflict_info'. It could be useful for cases where
> publishers are from two different nodes (NP1 and  NP2) and we have
> separate subscriptions for both nodes. Now, it is possible that users
> won't expect conflicts on the tables from one of the nodes NP1 then
> she could choose to enable 'retain_conflict_info' and
> 'max_conflict_retention_duration' only for the subscription pointing
> to publisher NP2.
>
> Now, say the publisher node that can generate conflicts (NP2) has
> fewer writes and the corresponding apply worker could easily catch up
> and almost always be in sync with the publisher. In contrast, the
> other node that has no conflicts has a large number of writes. In such
> cases, giving new options at the subscription level will be helpful.
>
> If we want to provide it at the global level, then the performance or
> dead tuple control may not be any better than the current patch but
> won't allow the provision for the above kinds of cases. Second, adding
> two new GUCs is another thing I want to prevent. But OTOH, the
> implementation could be slightly simpler if we provide these options
> as GUC though I am not completely sure of that point. Having said
> that, I am open to changing it to a non-subscription level. Do you
> think it would be better to provide one or both of these parameters as
> GUCs or do you have something else in mind?
>

I agree with this analogy. It seems that
'max_conflict_retention_duration' is quite similar to
'retain_conflict_info'. In both cases, the slot for retaining dead
tuples is shared among all subscribers. However, these subscribers may
be receiving data from different publishers and even different nodes.
Therefore, the decision on whether to wait and for how long should be
made at the subscriber level.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

05 February, 09:30:32

On Wed, Feb 5, 2025 at 6:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > I was not sure of the point of
> > > making the max_conflict_retention_duration a per-subscription
> > > parameter.
> > >
> >
> > The idea is to keep it at the same level as the other related
> > parameter 'retain_conflict_info'. It could be useful for cases where
> > publishers are from two different nodes (NP1 and  NP2) and we have
> > separate subscriptions for both nodes. Now, it is possible that users
> > won't expect conflicts on the tables from one of the nodes NP1 then
> > she could choose to enable 'retain_conflict_info' and
> > 'max_conflict_retention_duration' only for the subscription pointing
> > to publisher NP2.
> >
> > Now, say the publisher node that can generate conflicts (NP2) has
> > fewer writes and the corresponding apply worker could easily catch up
> > and almost always be in sync with the publisher. In contrast, the
> > other node that has no conflicts has a large number of writes. In such
> > cases, giving new options at the subscription level will be helpful.
> >
> > If we want to provide it at the global level, then the performance or
> > dead tuple control may not be any better than the current patch but
> > won't allow the provision for the above kinds of cases. Second, adding
> > two new GUCs is another thing I want to prevent. But OTOH, the
> > implementation could be slightly simpler if we provide these options
> > as GUC though I am not completely sure of that point. Having said
> > that, I am open to changing it to a non-subscription level. Do you
> > think it would be better to provide one or both of these parameters as
> > GUCs or do you have something else in mind?
>
> It makes sense to me to have the retain_conflict_info as a
> subscription-level parameter. I was thinking of making only
> max_conflict_retention_duration a global parameter, but I might be
> missing something. With a subscription-level
> max_conflict_retention_duration, how can users choose the setting
> values for each subscription, and is there a case that can be covered
> only by a subscription-level max_conflict_retention_duration?
>

Users can configure depending on the workload of the publisher
considering the publishers are different nodes as explained in my
previous response. Also, I think it will help in resolutions where the
worker for which the duration for updating the worker_level xmin has
not exceeded the max_conflict_retention_duration can reliably detect
update_delete. Then this parameter will only be required for
subscriptions that have enabled retain_conflict_info. I am not
completely sure if these are reasons enough to keep at the
subscription level but OTOH Dilip also seems to favor keeping
max_conflict_retention_duration at susbcription-level.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

05 February, 10:44:50

On Thu, Jan 23, 2025 at 5:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
I was reviewing v26 patch set and have some comments so far I reviewed
0001 so most of the comments/question are from this patch.

comments on v26-0001

1.
+ next_full_xid = ReadNextFullTransactionId();
+ epoch = EpochFromFullTransactionId(next_full_xid);
+
+ /*
+ * Adjust the epoch if the next transaction ID is less than the oldest
+ * running transaction ID. This handles the case where transaction ID
+ * wraparound has occurred.
+ */
+ if (oldest_running_xid > XidFromFullTransactionId(next_full_xid))
+ epoch--;
+
+ full_xid = FullTransactionIdFromEpochAndXid(epoch, oldest_running_xid);

I think you can directly use the 'AdjustToFullTransactionId()'
function here, maybe we can move that somewhere else and make that
non-static function.


2.
+ /*
+ * We expect the publisher and subscriber clocks to be in sync using time
+ * sync service like NTP. Otherwise, we will advance this worker's
+ * oldest_nonremovable_xid prematurely, leading to the removal of rows
+ * required to detect update_delete conflict.
+ *
+ * XXX Consider waiting for the publisher's clock to catch up with the
+ * subscriber's before proceeding to the next phase.
+ */
+ if (TimestampDifferenceExceeds(data->reply_time,
+    data->candidate_xid_time, 0))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("oldest_nonremovable_xid transaction ID may be advanced prematurely"),
+ errdetail("The clock on the publisher is behind that of the subscriber."));


I don't fully understand the purpose of this check. Based on the
comments in RetainConflictInfoData, if I understand correctly,
candidate_xid_time represents the time when the candidate is
determined, and reply_time indicates the time of the reply from the
publisher. Why do we expect these two timestamps to have zero
difference to ensure clock synchronization?

3.
+ /*
+ * Use last_recv_time when applying changes in the loop; otherwise, get
+ * the latest timestamp.
+ */
+ now = data->last_recv_time ? data->last_recv_time : GetCurrentTimestamp();

Can you explain in the comment what's the logic behind using
last_recv_time here?  Why not just compare 'candidate_xid_time' vs
current timestamp?

4.
Comment of v26-0004 doesn't clearly explain that once retention
stopped after reaching 'max_conflict_retention_duration' will it
resume back?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

06 February, 23:47:52

On Tue, Feb 4, 2025 at 10:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2025 at 6:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > I was not sure of the point of
> > > > making the max_conflict_retention_duration a per-subscription
> > > > parameter.
> > > >
> > >
> > > The idea is to keep it at the same level as the other related
> > > parameter 'retain_conflict_info'. It could be useful for cases where
> > > publishers are from two different nodes (NP1 and  NP2) and we have
> > > separate subscriptions for both nodes. Now, it is possible that users
> > > won't expect conflicts on the tables from one of the nodes NP1 then
> > > she could choose to enable 'retain_conflict_info' and
> > > 'max_conflict_retention_duration' only for the subscription pointing
> > > to publisher NP2.
> > >
> > > Now, say the publisher node that can generate conflicts (NP2) has
> > > fewer writes and the corresponding apply worker could easily catch up
> > > and almost always be in sync with the publisher. In contrast, the
> > > other node that has no conflicts has a large number of writes. In such
> > > cases, giving new options at the subscription level will be helpful.
> > >
> > > If we want to provide it at the global level, then the performance or
> > > dead tuple control may not be any better than the current patch but
> > > won't allow the provision for the above kinds of cases. Second, adding
> > > two new GUCs is another thing I want to prevent. But OTOH, the
> > > implementation could be slightly simpler if we provide these options
> > > as GUC though I am not completely sure of that point. Having said
> > > that, I am open to changing it to a non-subscription level. Do you
> > > think it would be better to provide one or both of these parameters as
> > > GUCs or do you have something else in mind?
> >
> > It makes sense to me to have the retain_conflict_info as a
> > subscription-level parameter. I was thinking of making only
> > max_conflict_retention_duration a global parameter, but I might be
> > missing something. With a subscription-level
> > max_conflict_retention_duration, how can users choose the setting
> > values for each subscription, and is there a case that can be covered
> > only by a subscription-level max_conflict_retention_duration?
> >
>
> Users can configure depending on the workload of the publisher
> considering the publishers are different nodes as explained in my
> previous response. Also, I think it will help in resolutions where the
> worker for which the duration for updating the worker_level xmin has
> not exceeded the max_conflict_retention_duration can reliably detect
> update_delete. Then this parameter will only be required for
> subscriptions that have enabled retain_conflict_info. I am not
> completely sure if these are reasons enough to keep at the
> subscription level but OTOH Dilip also seems to favor keeping
> max_conflict_retention_duration at susbcription-level.

I'd like to confirm what users would expect of this
max_conflict_retention_duration option and it works as expected. IIUC
users would want to use this option when they want to balance between
the reliable update_deleted conflict detection and the performance. I
think they want to detect updated_deleted reliably as much as possible
but, at the same time, would like to avoid a huge performance dip
caused by it. IOW, once the apply lag becomes larger than the limit,
they would expect to prioritize the performance (recovery) over the
reliable update_deleted conflict detection.

With the subscription-level max_conflict_retention_duration, users can
set it to '5min' to a subscription, SUB1, while not setting it to
another subscription, SUB2, (assuming here that both subscriptions set
retain_conflict_info = true). This setting works fine if SUB2 could
easily catch up while SUB1 is delaying, because in this case, SUB1
would stop updating its xmin when delaying for 5 min or longer so the
slot's xmin can advance based only on SUB2's xmin. Which is good
because it ultimately allow vacuum to remove dead tuples and
contributes to better performance. On the other hand, in cases where
SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
its xmin, the slot's xmin would not be able to advance. IIUC
pg_conflict_detection slot won't be invalidated as long as there is at
least one subscription that sets retain_conflict_info = true and
doesn't set max_conflict_retention_duration, even if other
subscriptions set max_conflict_retention_duration.

I'm not really sure that these behaviors are the expected behavior of
users who set max_conflict_retention_duration to some subscriptions.
Or I might have set the wrong expectation or assumption on this
parameter. I'm fine with a subscription-level
max_conflict_retention_duration if it's clear this option works as
expected by users who want to use it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Masahiko Sawada

Date:

07 February, 21:44:14

On Thu, Feb 6, 2025 at 9:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
>
> Yes, this understanding is correct.
>
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >
>
> Right.
>
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >
>
> It seems you are not convinced to provide this parameter at the
> subscription level and anyway providing it as GUC will simplify the
> implementation and it would probably be easier for users to configure
> rather than giving it at the subscription level for all subscriptions
> that have set retain_conflict_info set to true. I guess in the future
> we can provide it at the subscription level as well if there is a
> clear use case for it. Does that make sense to you?

Yes, that makes sense to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

10 February, 07:56:07

On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
>
> Yes, this understanding is correct.
>
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >

That seems like a valid point.

>
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >
>
> It seems you are not convinced to provide this parameter at the
> subscription level and anyway providing it as GUC will simplify the
> implementation and it would probably be easier for users to configure
> rather than giving it at the subscription level for all subscriptions
> that have set retain_conflict_info set to true. I guess in the future
> we can provide it at the subscription level as well if there is a
> clear use case for it. Does that make sense to you?

Would it make sense to introduce a GUC parameter for this value, where
subscribers can overwrite it for their specific subscriptions, but
only up to the limit set by the GUC? This would allow flexibility in
certain cases --subscribers could opt to wait for a shorter duration
than the GUC value if needed.

Although a concrete use case isn't immediately clear, consider a
hypothetical scenario: Suppose a subscriber connected to Node1 must
wait for long period to avoid an incorrect conflict decision. In such
cases, it would rely on the default high value set by the GUC.
However, since Node1 is generally responsive and rarely has
long-running transactions, this long wait would only be necessary in
rare cases. On the other hand, a subscriber pulling from Node2 may not
require as stringent conflict detection. If Node2 frequently has
long-running transactions, waiting too long could lead to excessive
delays.

The idea here is that the Node1 subscriber can wait for the full
max_conflict_retention_duration set by the GUC when necessary, while
the Node2 subscriber can choose a shorter wait time to avoid
unnecessary delays caused by frequent long transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

10 February, 12:15:22

On Mon, Feb 10, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > > I'm not really sure that these behaviors are the expected behavior of
> > > users who set max_conflict_retention_duration to some subscriptions.
> > > Or I might have set the wrong expectation or assumption on this
> > > parameter. I'm fine with a subscription-level
> > > max_conflict_retention_duration if it's clear this option works as
> > > expected by users who want to use it.
> > >
> >
> > It seems you are not convinced to provide this parameter at the
> > subscription level and anyway providing it as GUC will simplify the
> > implementation and it would probably be easier for users to configure
> > rather than giving it at the subscription level for all subscriptions
> > that have set retain_conflict_info set to true. I guess in the future
> > we can provide it at the subscription level as well if there is a
> > clear use case for it. Does that make sense to you?
>
> Would it make sense to introduce a GUC parameter for this value, where
> subscribers can overwrite it for their specific subscriptions, but
> only up to the limit set by the GUC? This would allow flexibility in
> certain cases --subscribers could opt to wait for a shorter duration
> than the GUC value if needed.
>
> Although a concrete use case isn't immediately clear, consider a
> hypothetical scenario: Suppose a subscriber connected to Node1 must
> wait for long period to avoid an incorrect conflict decision. In such
> cases, it would rely on the default high value set by the GUC.
> However, since Node1 is generally responsive and rarely has
> long-running transactions, this long wait would only be necessary in
> rare cases. On the other hand, a subscriber pulling from Node2 may not
> require as stringent conflict detection. If Node2 frequently has
> long-running transactions, waiting too long could lead to excessive
> delays.
>
> The idea here is that the Node1 subscriber can wait for the full
> max_conflict_retention_duration set by the GUC when necessary, while
> the Node2 subscriber can choose a shorter wait time to avoid
> unnecessary delays caused by frequent long transactions.
>

I see that users can have some cases like this where it can be helpful
to provide the option to set max_conflict_retention_duration both at
GUC as well as a subscription parameter. However, I suggest let's go a
bit slower in adding more options for this particular stuff. In the
first version of this work, let's add a GUC and then let it bake for
some time after which we can discuss again adding a subscription
option based on some feedback from the field.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

Dilip Kumar

Date:

10 February, 15:19:49

On Mon, Feb 10, 2025 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 10, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > > I'm not really sure that these behaviors are the expected behavior of
> > > > users who set max_conflict_retention_duration to some subscriptions.
> > > > Or I might have set the wrong expectation or assumption on this
> > > > parameter. I'm fine with a subscription-level
> > > > max_conflict_retention_duration if it's clear this option works as
> > > > expected by users who want to use it.
> > > >
> > >
> > > It seems you are not convinced to provide this parameter at the
> > > subscription level and anyway providing it as GUC will simplify the
> > > implementation and it would probably be easier for users to configure
> > > rather than giving it at the subscription level for all subscriptions
> > > that have set retain_conflict_info set to true. I guess in the future
> > > we can provide it at the subscription level as well if there is a
> > > clear use case for it. Does that make sense to you?
> >
> > Would it make sense to introduce a GUC parameter for this value, where
> > subscribers can overwrite it for their specific subscriptions, but
> > only up to the limit set by the GUC? This would allow flexibility in
> > certain cases --subscribers could opt to wait for a shorter duration
> > than the GUC value if needed.
> >
> > Although a concrete use case isn't immediately clear, consider a
> > hypothetical scenario: Suppose a subscriber connected to Node1 must
> > wait for long period to avoid an incorrect conflict decision. In such
> > cases, it would rely on the default high value set by the GUC.
> > However, since Node1 is generally responsive and rarely has
> > long-running transactions, this long wait would only be necessary in
> > rare cases. On the other hand, a subscriber pulling from Node2 may not
> > require as stringent conflict detection. If Node2 frequently has
> > long-running transactions, waiting too long could lead to excessive
> > delays.
> >
> > The idea here is that the Node1 subscriber can wait for the full
> > max_conflict_retention_duration set by the GUC when necessary, while
> > the Node2 subscriber can choose a shorter wait time to avoid
> > unnecessary delays caused by frequent long transactions.
> >
>
> I see that users can have some cases like this where it can be helpful
> to provide the option to set max_conflict_retention_duration both at
> GUC as well as a subscription parameter. However, I suggest let's go a
> bit slower in adding more options for this particular stuff. In the
> first version of this work, let's add a GUC and then let it bake for
> some time after which we can discuss again adding a subscription
> option based on some feedback from the field.

I am fine with that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

20 February, 10:20:27

On Friday, February 7, 2025 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
> 
> Yes, this understanding is correct.
> 
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >
> 
> Right.
> 
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >

Here is the v28 patch set, which converts the subscription option
max_conflict_retention_duration into a GUC. Other logic remains unchanged.

Best Regards,
Hou zj

Attachment

Re: Conflict detection for update_deleted in logical replication

From

vignesh C

Date:

12 March, 14:36:23

On Thu, 20 Feb 2025 at 12:50, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Here is the v28 patch set, which converts the subscription option
> max_conflict_retention_duration into a GUC. Other logic remains unchanged.

After discussing with Hou internally, I have moved this to the next
CommitFest since it will not be committed in the current release. This
also allows reviewers to focus on the remaining patches in the current
CommitFest.

Regards,
Vignesh

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

26 March, 13:47:40

On Wed, Mar 12, 2025 at 7:36 PM vignesh C wrote:

> 
> On Thu, 20 Feb 2025 at 12:50, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Here is the v28 patch set, which converts the subscription option
> > max_conflict_retention_duration into a GUC. Other logic remains
> unchanged.
> 
> After discussing with Hou internally, I have moved this to the next CommitFest
> since it will not be committed in the current release. This also allows reviewers
> to focus on the remaining patches in the current CommitFest.

Thanks!

Here's a rebased version of the patch series.

Best Regards,
Hou zj

On Thu, Apr 17, 2025 at 12:19 PM shveta malik wrote:

> 
> On Wed, Apr 16, 2025 at 10:30 AM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > On Wed, Mar 26, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Here's a rebased version of the patch series.
> > >
> >

> 
> Thanks Hou-San for the patches. I am going through this long thread and
> patches. One doubt I have is whenever there is a chance of conflict-slot update
> (either xmin or possibility of its invalidation), apply-worker gives a wake-up call
> to the launcher (ApplyLauncherWakeup). Shouldn't that suffice to wake-up
> launcher irrespective of its nap-time? Do we actually need to introduce
> MIN/MAX_NAPTIME_PER_SLOT_UPDATE in the launcher and the logic
> around it?

Thanks for reviewing. After rethinking, I agree that the wakeup is
sufficient, so I removed the nap-time logic in this version.

> Few comments for patch004:
> Config.sgml:
> 1)
> +       <para>
> +        Maximum duration (in milliseconds) for which conflict
> +        information can be retained for conflict detection by the apply worker.
> +        The default value is <literal>0</literal>, indicating that conflict
> +        information is retained until it is no longer needed for detection
> +        purposes.
> +       </para>
> 
> IIUC, the above is not entirely accurate. Suppose the subscriber manages to
> catch up and sets oldest_nonremovable_xid to 100, which is then updated in
> slot. After this, the apply worker takes a nap and begins a new xid update cycle.
> Now, let’s say the next candidate_xid is 200, but this time the subscriber fails
> to keep up and exceeds max_conflict_retention_duration. As a result, it sets
> oldest_nonremovable_xid to InvalidTransactionId, and the launcher skips
> updating the slot’s xmin. 

If the time exceeds the max_conflict_retention_duration, the launcher would
Invalidate the slot, instead of skipping updating it. So the conflict info(e.g.,
dead tuples) would not be retained anymore.

> However, the previous xmin value (100) is still there
> in the slot, causing its data to be retained beyond the
> max_conflict_retention_duration. The xid 200 which actually honors
> max_conflict_retention_duration was never marked for retention. If my
> understanding is correct, then the documentation doesn’t fully capture this
> scenario.

As mentioned above, the strategy here is to invalidate the slot.

> 
> 2)
> +        The replication slot
> +        <quote><literal>pg_conflict_detection</literal></quote> that
> used to
> +        retain conflict information will be invalidated if all apply workers
> +        associated with the subscription, where
> 
> Subscription --> subscriptions
> 
> 3)
> Name stop_conflict_retention in MyLogicalRepWorker is confusing. Shall it be
> stop_conflict_info_retention?

Changed.

Here is V30 patch set includes the following changes:

* Addressed above comments.
* Added the retention timeout check in wait_for_local_flush(), as suggested by Nisha[1].
* Improved upgrade codes and added a test for upgrade of retain_conflict_info option,
  as suggested by Kuroda-san[2].

[1] https://www.postgresql.org/message-id/CABdArM4Ft8q3dZv4Bqw%3DrbS5_LFMXDJMRr3vC8a_KMCX1qatpg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/OSCPR01MB14966269726272F2F2B2BD3B0F5B22%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

On Fri, May 16, 2025 at 5:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Please find some more comments on the 0001 patch:
1.
We need to know about such transactions
+ * for conflict detection and resolution in logical replication. See
+ * GetOldestTransactionIdInCommit and its use.

Do we need to mention resolution in the above sentence? This patch is
all about detecting conflict reliably.

2. In wait_for_publisher_status(), we use remote_epoch,
remote_nextxid, and remote_oldestxid to compute full transaction id's.
Why can't we send FullTransactionIds for remote_oldestxid and
remote_nextxid from publisher? If these are required, maybe a comment
somewhere for that would be good.

3.
/*
+ * Note it is important to set committs value after marking ourselves as
+ * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
+ * we want to ensure all such transactions are finished before we allow
+ * the logical replication client to advance its xid which is used to hold
+ * back dead rows for conflict detection. See
+ * maybe_advance_nonremovable_xid.
+ */
+ committs = GetCurrentTimestamp();

How does setting committs after setting DELAY_CHKPT_IN_COMMIT help in
advancing client-side xid? IIUC, on client-side, we simply wait for
such an xid to be finished based on the remote_oldestxid and
remote_nextxid sent via the server. So, the above comment is not
completely clear to me. I have updated this and related comments in
the attached diff patch to make it clear. See if that makes sense to
you.

4.
In 0001's commit message, we have: "Furthermore, the preserved commit
timestamps and origin data are essential for
consistently detecting update_origin_differs conflicts." But it is not
clarified how this patch helps in consistently detecting
update_origin_differs conflict?

5. I have tried to add some more details in comments on why
oldest_nonremovable_xid needs to be FullTransactionId. See attached.

--
With Regards,
Amit Kapila.

Attachment

v30_0001_amit.1.patch.txt

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

22 May, 05:58:42

On Tue, May 20, 2025 at 6:30 PM shveta malik wrote:
> 
> Few more comments mostly for patch001:

Thanks for the comments!

> 
> 4)
> For this feature, since we are only interested in remote UPDATEs happening
> concurrently, so shall we ask primary to provide oldest "UPDATE"
> transaction-id in commit-phase rather than any operation's transaction-id?
> This may avoid unnecessarily waiting and pinging at subscriber's end in order
> to advance oldest_nonremovable-xid.
> Thoughts?

It is possible, but considering the potential complexity/cost to track UPDATE
operations in top-level and sub-transactions, coupled with its limited benefit
for workloads featuring frequent UPDATEs on publishers such as observed during
TPC-B performance tests, I have opted to document this possibility in comments
instead of implementing it in the patch set.

> 
> 5)
> +
> +/*
> + * GetOldestTransactionIdInCommit()
> + *
> + * Similar to GetOldestActiveTransactionId but returns the oldest
> transaction ID
> + * that is currently in the commit phase.
> + */
> +TransactionId
> +GetOldestTransactionIdInCommit(void)
> 
> If there is no transaction currently in 'commit' phase, this function will then
> return the next transaction-id. Please mention this in the comments. I think the
> doc 'protocol-replication.html' should also be updated for the same.

I added this info in the doc. But since we have merged this function with
GetOldestActiveTransactionId() which has the same behavior, so I am
not adding more code comments for the existing function.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

23 May, 14:07:37

On Thu, May 22, 2025 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V31 patch set which addressed comments in [1]~[8].
>

Few comments:
1.
<para>
+             The oldest transaction ID along that is currently in the commit
+             phase on the server, along with its epoch.

The first 'along' in the sentence looks redundant. I've removed this
in the attached.

2.
+ data.remote_oldestxid = FullTransactionIdFromU64(pq_getmsgint64(&s));
+ data.remote_nextxid = FullTransactionIdFromU64(pq_getmsgint64(&s));

Shouldn't we need to typecast the result of pq_getmsgint64(&s) with
uint64 as we do at similar other places in pg_snapshot_recv?

3.

+ pq_sendint64(&output_message,
U64FromFullTransactionId(fullOldestXidInCommit));
+ pq_sendint64(&output_message, U64FromFullTransactionId(nextFullXid));

Similarly, here also we should typecase with uint64

4.
+ * XXX In phase RCI_REQUEST_PUBLISHER_STATUS, a potential enhancement could be
+ * requesting transaction information specifically for those containing
+ * UPDATEs. However, this approach introduces additional complexities in
+ * tracking UPDATEs for transactions on the publisher, and it may not
+ * effectively address scenarios with frequent UPDATEs.

I think, as the patch needs the oldest_nonremovable_xid idea even to
detect update_origin_differs and delete_origin_differs reliably, as
mentioned in 0001's commit message, is it sufficient to track update
transactions? Don't we need to track it even for deletes? I have
removed this note for now and updated the comment to mention it is
required to detect update_origin_differs and delete_origin_differs
conflicts reliably.

Apart from the above comments, I made a few other cosmetic changes in
the attached.

--
With Regards,
Amit Kapila.

Attachment

v31_0001_amit.1.patch.txt

Re: Conflict detection for update_deleted in logical replication

From

Xuneng Zhou

Date:

23 May, 18:51:01

Hi Zhijie,

Thanks for the effort on the patches. I did a quick look on them before diving into the logic and discussion. Below are a few minor typos found in version 31.

⸻

1. Spelling of “non-removable”

[PATCH v31 1/7]

In docs and code “removeable” vs. “removable” are used alternatively and omitted the hyphen in “non-removable”.

2. Double “arise” in SGML

[PATCH v31 7/7]

In doc/src/sgml/logical-replication.sgml under the <varlistentry id="conflict-update-deleted">, have duplicate arise:

+ are enabled. Note that if a tuple cannot be found due to the table being
+ truncated only a <literal>update_missing</literal> conflict will arise.
+ arise

3. Commit-message typos

[PATCH v31 1/7] (typo “tranasction”)

Subject: [PATCH v30 1/7] Maintain the oldest non removeable tranasction ID by
apply worker

Attaching the V31 patch set which addressed comments in [1]~[8].

The comments in [9] concerning the new GUC in patch 0004 is still under review
and will be addressed in the next version.

[1]https://www.postgresql.org/message-id/CAJpy0uD6SgD7w839Wzezdj0JT2OnA%2BxCxddM15%3Dgb5nRqYAv%2BA%40mail.gmail.com
[2]https://www.postgresql.org/message-id/CAJpy0uCYqG16zCjiCK4og6yqR7zP2rB1oOT7%3DAnDdVePo-8RrA%40mail.gmail.com
[3]https://www.postgresql.org/message-id/CAA4eK1KemsW0EXaSy2Y-M-vVy5Gh4onNG%2B%2BkKs7ugY%2B3N-g-Yw%40mail.gmail.com
[4]https://www.postgresql.org/message-id/CAA4eK1%2Br9V6DpH9gYRa2xOx167FapbuKdc4gESr8Etxpx2zrqw%40mail.gmail.com
[5]https://www.postgresql.org/message-id/CAJpy0uArh0A9yOxoamD0RWM-7K9kyoUMNh5C2%2BPFTbGFoxf5wg%40mail.gmail.com
[6]https://www.postgresql.org/message-id/CAJpy0uDL4oLdhYup44a2%3D1OeyUSsKhg8Y30-h1uxcf%3Dmki57uA%40mail.gmail.com
[7]https://www.postgresql.org/message-id/CAA4eK1%2BVNaGi-GU6awgFKmTgidLTHo2HDuzV1%2BaT8sjn8QtPxg%40mail.gmail.com
[8]https://www.postgresql.org/message-id/CAA4eK1%2B%3DZAf0T2iMg2%2BZF4cJdUk%3DUViqpiOg_kPa8jgK%2Bg94aw%40mail.gmail.com
[9]https://www.postgresql.org/message-id/CAA4eK1LLaXzsKOaPwGTiikOYySYK%2BTy_x3EXg-g%3D7M_CLn4WiQ%40mail.gmail.com

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

26 May, 10:21:38

On Fri, May 23, 2025 at 11:51 PM Xuneng Zhou wrote:

> Thanks for the effort on the patches. I did a quick look on them before
> diving into the logic and discussion. Below are a few minor typos found in
> version 31.

Thanks for the comments! I have fixed these typos in latest version.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

02 June, 09:39:17

On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V32 patch set which addressed comments in [1]~[5].
>

Review comments:
===============
*
+advance_conflict_slot_xmin(FullTransactionId new_xmin)
+{
+ FullTransactionId full_xmin;
+ FullTransactionId next_full_xid;
+
+ Assert(MyReplicationSlot);
+ Assert(FullTransactionIdIsValid(new_xmin));
+
+ next_full_xid = ReadNextFullTransactionId();
+
+ /*
+ * Compute FullTransactionId for the current xmin. This handles the case
+ * where transaction ID wraparound has occurred.
+ */
+ full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
+ MyReplicationSlot->data.xmin);
+
+ if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
+ return;

The above code suggests that the launcher could compute a new xmin
that is less than slot's xmin. At first, this looks odd to me, but
IIUC, this can happen when the user toggles retain_conflict_info flag
at some random time when the launcher is trying to compute the new
xmin value for the slot. One of the possible combinations of steps for
this race could be as follows:

1. The subscriber has two subscriptions, A and B. Subscription A has
retain_conflict_info as true, and B has retain_conflict_info as false
2. Say the launcher calls get_subscription_list(), and worker A is
already alive.
3. Assuming the apply worker will restart on changing
retain_conflict_info, the user enables retain_conflict_info for
subscription B.
4. The launcher processes the subscription B first in the first cycle,
and starts worker B. Say, worker B gets 759 as candidate_xid.
5. The launcher creates the conflict detection slot, xmin = 759
6. Say a new txn happens, worker A gets 760 as candidate_xid and
updates it to oldest_nonremovable_xid.
7. The launcher processes the subscription A in the first cycle, and
the final xmin value is 760, because it only checks the
oldest_nonremovable_xid from worker A. The launcher then updates the
value to slot.xmin.
8. In the next cycle, the launcher finds that worker B has an older
oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
launher would have retreated the slot's xmin unless we had the above
check in the quoted code.

I think the above race is possible because the system lets the changed
subscription values of a subscription take effect asynchronously by
workers. The one more similar race condition handled by the patch is
as follows:

*
...
+ * It's necessary to use FullTransactionId here to mitigate potential race
+ * conditions. Such scenarios might occur if the replication slot is not
+ * yet created by the launcher while the apply worker has already
+ * initialized this field. During this period, a transaction ID wraparound
+ * could falsely make this ID appear as if it originates from the future
+ * w.r.t the transaction ID stored in the slot maintained by launcher. See
+ * advance_conflict_slot_xmin.
...
+ FullTransactionId oldest_nonremovable_xid;

This case can happen if the user disables and immediately enables the
retain_conflict_info option. In this case, the launcher may drop the
slot after noticing the disable. But the apply worker may not notice
the disable and it only notices that the retain_conflict_info is still
enabled, so it will keep maintaining oldest_nonremovable_xid when the
slot is not created.

It is okay to handle both the race conditions, but I am worried we may
miss some such race conditions which could lead to difficult-to-find
bugs. So, at least for the first version of this option (aka for
patches 0001 to 0003), we can add a condition that allows us to change
retain_conflict_info only on disabled subscriptions. This will
simplify the patch. We can make a separate patch to allow changing
retain_conflict_info option for enabled subscriptions. That will make
it easier to evaluate such race conditions and the solutions more
deeply.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

04 June, 13:25:40

On Mon, Jun 2, 2025 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
>
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin)
> +{
> + FullTransactionId full_xmin;
> + FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
> + return;
>
> The above code suggests that the launcher could compute a new xmin
> that is less than slot's xmin. At first, this looks odd to me, but
> IIUC, this can happen when the user toggles retain_conflict_info flag
> at some random time when the launcher is trying to compute the new
> xmin value for the slot. One of the possible combinations of steps for
> this race could be as follows:
>
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false
> 2. Say the launcher calls get_subscription_list(), and worker A is
> already alive.
> 3. Assuming the apply worker will restart on changing
> retain_conflict_info, the user enables retain_conflict_info for
> subscription B.
> 4. The launcher processes the subscription B first in the first cycle,
> and starts worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759
> 6. Say a new txn happens, worker A gets 760 as candidate_xid and
> updates it to oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and
> the final xmin value is 760, because it only checks the
> oldest_nonremovable_xid from worker A. The launcher then updates the
> value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above
> check in the quoted code.
>
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by
> workers. The one more similar race condition handled by the patch is
> as follows:
>
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential race
> + * conditions. Such scenarios might occur if the replication slot is not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID wraparound
> + * could falsely make this ID appear as if it originates from the future
> + * w.r.t the transaction ID stored in the slot maintained by launcher. See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
>
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the
> slot after noticing the disable. But the apply worker may not notice
> the disable and it only notices that the retain_conflict_info is still
> enabled, so it will keep maintaining oldest_nonremovable_xid when the
> slot is not created.
>

Another case to handle is similar to above with only difference that
no transaction ID wraparound has happened. In such a case, the
launcher may end-up using worker's oldest_nonremovable_xid from the
previous cycle before user disabled-enabled retain_conflict_info. This
may result in the slot moving backward  in absence of suggested check
in advance_conflict_slot_xmin(),

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

04 June, 13:42:36

On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote:
> 
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
> 
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin) {
> +FullTransactionId full_xmin;  FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the
> + case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return;
> 
> The above code suggests that the launcher could compute a new xmin that is
> less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen
> when the user toggles retain_conflict_info flag at some random time when the
> launcher is trying to compute the new xmin value for the slot. One of the
> possible combinations of steps for this race could be as follows:
> 
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the
> launcher calls get_subscription_list(), and worker A is already alive.
> 3. Assuming the apply worker will restart on changing retain_conflict_info, the
> user enables retain_conflict_info for subscription B.
> 4. The launcher processes the subscription B first in the first cycle, and starts
> worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn
> happens, worker A gets 760 as candidate_xid and updates it to
> oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and the final
> xmin value is 760, because it only checks the oldest_nonremovable_xid from
> worker A. The launcher then updates the value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above check in
> the quoted code.
> 
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by workers.
> The one more similar race condition handled by the patch is as follows:
> 
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential
> + race
> + * conditions. Such scenarios might occur if the replication slot is
> + not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID
> + wraparound
> + * could falsely make this ID appear as if it originates from the
> + future
> + * w.r.t the transaction ID stored in the slot maintained by launcher.
> + See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
> 
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the slot after
> noticing the disable. But the apply worker may not notice the disable and it only
> notices that the retain_conflict_info is still enabled, so it will keep maintaining
> oldest_nonremovable_xid when the slot is not created.
> 
> It is okay to handle both the race conditions, but I am worried we may miss
> some such race conditions which could lead to difficult-to-find bugs. So, at
> least for the first version of this option (aka for patches 0001 to 0003), we can
> add a condition that allows us to change retain_conflict_info only on disabled
> subscriptions. This will simplify the patch.

Agreed.

> We can make a separate patch to
> allow changing retain_conflict_info option for enabled subscriptions. That will
> make it easier to evaluate such race conditions and the solutions more deeply.

I will prepare a separate patch soon.

Here is the V33 patch set which includes the following changes:

0001:
* Renaming and typo fixes based on Shveta's comments [1]
* Comment changes suggested by Amit [2]
* Changed oldest_nonremoable_xid from FullTransactionID to TransactionID.
* Code refactoring in drop_conflict_slot_if_exists()

0002:
* Documentation updates suggested by Amit [2]
* Code modifications to adapt to TransactionID oldest_nonremoable_xid

0003:
* Documentation improvements suggested by Shveta [3]
* Added restriction: disallow changing retain_conflict_info when sub
  is enabled or worker is alive

0004:
* Simplified race condition handling due to the new restriction from 0003

0005:
* Code updates to accommodate both the TransactionID type for
  oldest_nonremoable_xid and the new restriction from 0003

0006:
* New test case for the restriction introduced in 0003

0007:
No changes


[1] https://www.postgresql.org/message-id/CAJpy0uBSsRuVOeuo-i8R_aO0CMiORHTnEBZ9z-TDq941WqhyLA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KUTHbgroBRNp8_dy3Lrc%2BetPm19O1RcyRcDBgCp7EFcg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAJpy0uAJUTmSx7fAE3gbnBUzp9ZDOgkLrP5gdoysKUGbvf7vGg%40mail.gmail.com


Best Regards,
Hou zj

On Fri, Jun 6, 2025 at 1:49 PM shveta malik wrote:
> 
> On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V33 patch set which includes the following changes:
> >
> 
> please find few comments for patch003:

Thanks for the comments!

> 
> 1)
> + /*
> + * Skip the track_commit_timestamp check by passing it as
> + * true, since it has already been validated during CREATE
> + * SUBSCRIPTION and ALTER SUBSCRIPTION SET commands.
> + */
> + CheckSubConflictInfoRetention(sub->retainconflictinfo,
> +   true, opts.enabled);
> +
> 
> Is there a special reason for disabling WARNING while enabling the
> subscription? If rci subscription was created in disabled state and
> track_commit_timestamp was enabled at that time, then there will be no
> WARNING. But while enabling the sub at a later stage, it may be
> possible that track_commit_timestamp is off but rci as ON.

I feel reporting a WARNING related to track_commit_timestamp during
subscription enable DDL is a bit unnatural, since it's not directly related to the
this DDL. Also, I think we do not intend to capture scenarios where
track_commit_timestamp is disabled afterwards.

> 
> 2)
> 
>   * The worker has not yet started, so there is no valid
>   * non-removable transaction ID available for advancement.
>   */
> + if (sub->retainconflictinfo)
> + can_advance_xmin = false;
> 
> Shall we change comment to:
> Only advance xmin when all workers for rci enabled subscriptions are
> up and running.

Adjusted according to your suggestion.

> 
> 
> 3)
> 
>   Assert(MyReplicationSlot);
> - Assert(TransactionIdIsValid(new_xmin));
>   Assert(TransactionIdPrecedesOrEquals(MyReplicationSlot->data.xmin,
>   new_xmin));
> 
> + if (!TransactionIdIsValid(new_xmin))
> + return;
> 
> 
> a)
> Why have we replaced Assert with 'if' check? In which scenario do we
> expect new_xmin as Invalid here?

I think it's not needed now, so removed.

> 4)
> DisableSubscriptionAndExit:
> + /*
> + * Skip the track_commit_timestamp check by passing it as true, since it
> + * has already been validated during CREATE SUBSCRIPTION and ALTER
> + * SUBSCRIPTION SET commands.
> + */
> + CheckSubConflictInfoRetention(MySubscription->retainconflictinfo, true,
> +   false);
> 
> This comment makes sense during alter-sub enable, here shall we change it
> to:
> Skip the track_commit_timestamp check by passing it as true as it is
> not needed to be checked during subscription-disable.

Changed.

> 
> 
> 5)
> postgres=# alter subscription sub3 set (retain_conflict_info=false);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> 
> Do we need this restriction during disable of rci as well?

I prefer to maintain the restriction on both enabling and disabling operations
for the sake of simplicity, since the primary aim of this restriction is to
keep the logic straightforward and eliminate the need to think and address all
potential race conditions. I think restricting only the enable operation is
also OK and would not introducing new issues, but it might be more prudent to
keep things simple in the first version. Once the main patches stabilize, we
can consider easing or removing the entire restriction.

> 
> 6)
> +     <para>
> +      If the <link
> linkend="sql-createsubscription-params-with-retain-conflict-info"><literal
> >retain_conflict_info</literal></link>
> +      option is altered to <literal>false</literal> and no other subscription
> +      has this option enabled, the additional replication slot that was created
> +      to retain conflict information will be dropped.
> +     </para>
> 
> It will be good to mention the slot name as well.

Added.

> 
> 
> 7)
> + * Verify that the max_active_replication_origins and max_replication_slots
> + * configurations specified are enough for creating the subscriptions. This is
> + * required to create the replication origin and the conflict detection slot
> + * for each subscription.
>   */
> 
> We shall rephrase the comment, it gives the feeling that a 'conflict
> detection slot' is needed for each subscription.

Right, changed.

Here is the V34 patch set which includes the following changes:

0001:
* pgindent

0002:
* pgindent

0003:
* pgindent
* Addressed above comments from Shvete
* Improved the comments atop of the new restrcition.
* Ensured that the worker restarts when the retain_conflict_info was enabled
  during startup regardless of the existence of the slot.

  In V33, we relied on the existence of slot to decide whether the worker needs
  to restart on startup option change. But we found that even if the slot
  exists when launching the apply worker with(retain_conflict_info=false), the
  slot could be removed soon by the launcher since the launcher might find
  there is no subscription that enables retain_conflict_info. So the worker
  could start to maintain the oldest_nonremovable_xid when the slot is not yet
  created.

0004:
* pgindent
* Fixed some inaccurate and wrong descriptions in the document.

0005:
* pgindent

0006:
* pgindent

0007:
* pgindent

0008:
* A new patch to remove the restriction on altering retain_conflict_info when
the subscription is enabled, and resolves race condition issues caused by the
new option value being asynchronously acknowledged by the launcher and apply
workers. It changed the oldest_nonremovable_xid to FullTransactionID so that
even if the warparound happens, we can correctly identity if the transaction ID
a old or new one. Additioanly, it adds a safeguard check when advancing
slot.xmin to prevent backward movement.

The 0008 is kept as .txt to prevent the BF failure from testing it at this stage.

Best Regards,
Hou zj

Attachment

Re: Conflict detection for update_deleted in logical replication

From

Amit Kapila

Date:

06 June, 14:33:47

On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V33 patch set which includes the following changes:
>

Few comments:
1.
+ if (sub->enabled)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set option %s for enabled subscription",
+ "retain_conflict_info")));

Isn't it better to call CheckAlterSubOption() for this check, as we do
for failover and two_phase options?

2.
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
ERROR:  cannot set option retain_conflict_info for enabled subscription
postgres=# Alter subscription sub1 disable;
ALTER SUBSCRIPTION
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

The above looks odd to me because first we didn't allow setting the
option for enabled subscription, and then when the user disabled the
subscription, a WARNING is issued. Isn't it better to give NOTICE
like: "enable the subscription to avoid accumulating deleted rows for
detecting conflicts" in the above case?

Also in this,
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
WARNING:  information for detecting conflicts cannot be fully retained
when "track_commit_timestamp" is disabled
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

What do we mean by this WARNING message? If track_commit_timestamp is
not enabled, we won't be able to detect certain conflicts, including
update_delete, but how can it lead to not retaining information
required for conflict detection? BTW, shall we consider giving ERROR
instead of WARNING for this case because without
track_commit_timestamp, there is no benefit in retaining deleted rows?
If we just want to make the user aware to enable
track_commit_timestamp to detect conflicts, then we can even consider
making this a NOTICE.

postgres=# Alter subscription sub1 Enable;
ALTER SUBSCRIPTION
postgres=# Alter subscription sub1 set (retain_conflict_info=false);
ERROR:  cannot set option retain_conflict_info for enabled subscription
postgres=# Alter subscription sub1 disable;
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

Here, we should have a WARNING like: "deleted rows to detect conflicts
would not be removed till the subscription is enabled"; this should be
followed by errdetail like: "Consider setting retain_conflict_info to
false."

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

10 June, 09:25:34

On Fri, Jun 6, 2025 at 7:34 PM Amit Kapila wrote:

> 
> On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V33 patch set which includes the following changes:
> >
> 
> Few comments:
> 1.
> + if (sub->enabled)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot set option %s for enabled subscription",
> + "retain_conflict_info")));
> 
> Isn't it better to call CheckAlterSubOption() for this check, as we do
> for failover and two_phase options?

Moved.

> 
> 2.
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> postgres=# Alter subscription sub1 disable;
> ALTER SUBSCRIPTION
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> The above looks odd to me because first we didn't allow setting the
> option for enabled subscription, and then when the user disabled the
> subscription, a WARNING is issued. Isn't it better to give NOTICE
> like: "enable the subscription to avoid accumulating deleted rows for
> detecting conflicts" in the above case?

Yes, a NOTICE would be better.

I think we normally only describe the current situation of the operation in a
NOTICE message, and the suggested message sounds like a hint.
So I used the following message:

"deleted rows will continue to accumulate for detecting conflicts until the subscription is enabled"

> 
> Also in this,
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> WARNING:  information for detecting conflicts cannot be fully retained
> when "track_commit_timestamp" is disabled
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> What do we mean by this WARNING message? If track_commit_timestamp is
> not enabled, we won't be able to detect certain conflicts, including
> update_delete, but how can it lead to not retaining information
> required for conflict detection? BTW, shall we consider giving ERROR
> instead of WARNING for this case because without
> track_commit_timestamp, there is no benefit in retaining deleted rows?
> If we just want to make the user aware to enable
> track_commit_timestamp to detect conflicts, then we can even consider
> making this a NOTICE.

I think it's an unexpected case that track_commit_timestamp is not enabled, so
NOTICE may not be appropriate. Giving ERROR is also OK, but since user can
change the track_commit_timestamp setting at anytime after creating/modifying a
subscription, we can't catch all cases, so we considered simply issuing a
warning directly and document this case.

> postgres=# Alter subscription sub1 Enable;
> ALTER SUBSCRIPTION
> postgres=# Alter subscription sub1 set (retain_conflict_info=false);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> postgres=# Alter subscription sub1 disable;
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> Here, we should have a WARNING like: "deleted rows to detect conflicts
> would not be removed till the subscription is enabled"; this should be
> followed by errdetail like: "Consider setting retain_conflict_info to
> false."

Changed as suggested.

Here is the V35 patch set which includes the following changes:

0001:
No change.

0002:
* Added an errdetail for reserved slot name error per off-list discussion with Shveta.
* Moves the codes in launcher's foreach loop to a new function to improve the readability.

0003:
* Addressed all above comments sent by Amit.
* Adjusted some comments per off-list discussion with Amit.
* Check track_commit_timestamp when enabling the subscription. This is to avoid
 passing track_commit_timestamp as a parameter to the check function.

0004:
Rebased

0005:
Rebased

0006:
Rebased

0007:
Rebased

0008:
Rebased

Best Regards,
Hou zj

On Fri, Jun 20, 2025 at 4:48 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V39 patch set which includes the following changes:
>

1.
-static void
-create_conflict_slot_if_not_exists(void)
+void
+ApplyLauncherCreateConflictDetectionSlot(void)

I am not so sure about adding ApplyLauncher in front of this function
name. I see most others exposed from this file add such a prefix, but
this one looks odd to me as it has nothing specific to the launcher,
though we use it in launcher? How about
CreateConflictDetectionSlot(void)?

2.
 static void
 create_logical_replication_slots(void)
 {
+ if (!count_old_cluster_logical_slots())
+ return;
+

Doing this count twice (once here and once at the caller of
create_logical_replication_slots) seems redundant.

Apart from the above, attached please find a diff patch atop 0001,
0002, 0003. I think the first three patches look in a reasonable shape
now, can we merge them (0001, 0002, 0003)?

--
With Regards,
Amit Kapila.

Attachment

v39-amit_1.diff.txt

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

24 June, 13:20:54

On Mon, Jun 23, 2025 at 4:20 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Here is the V40 patch set

Thanks for the patches. Few comments:

1)
In get_subscription_info(), we are doing COUNT of rci-subscriptions
using below query:
SELECT count(*) AS nsub, COUNT(CASE WHEN subretainconflictinfo THEN 1
END) AS retain_conflict_info FROM pg_catalog.pg_subscription;

And then we are doing:
cluster->sub_retain_conflict_info = (strcmp(PQgetvalue(res, 0,
i_retain_conflict_info), "1") == 0);

i.e. get the value and compare with "1".  If the count of such subs is
say 2, won't it fail and will set sub_retain_conflict_info as 0?

2)
create_logical_replication_slots(void)
{
+ if (!count_old_cluster_logical_slots())
+ return;
+

We shall get rid of count_old_cluster_logical_slots() here as the
caller is checking it already.

3)
We can move the 'old_cluster.sub_retain_conflict_info' check from
create_conflict_detection_slot() to its caller. Then it will be more
informative and consistent with how we check migrate_logical_slots
outside of create_conflict_detection_slot()

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

25 June, 06:08:33

On Tue, Jun 24, 2025 at 6:22 PM shveta malik wrote:
> 
> On Mon, Jun 23, 2025 at 4:20 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > Here is the V40 patch set
> 
> Thanks for the patches. Few comments:
> 
> 1)
> In get_subscription_info(), we are doing COUNT of rci-subscriptions using
> below query:
> SELECT count(*) AS nsub, COUNT(CASE WHEN subretainconflictinfo THEN 1
> END) AS retain_conflict_info FROM pg_catalog.pg_subscription;
> 
> And then we are doing:
> cluster->sub_retain_conflict_info = (strcmp(PQgetvalue(res, 0,
> i_retain_conflict_info), "1") == 0);
> 
> i.e. get the value and compare with "1".  If the count of such subs is say 2,
> won't it fail and will set sub_retain_conflict_info as 0?

Right, it could return wrong results. I have changed it to count(*) xx > 0
so that it can return directly boolean value.

> 
> 2)
> create_logical_replication_slots(void)
> {
> + if (!count_old_cluster_logical_slots())
> + return;
> +
> 
> We shall get rid of count_old_cluster_logical_slots() here as the caller is
> checking it already.

Removed.

> 
> 3)
> We can move the 'old_cluster.sub_retain_conflict_info' check from
> create_conflict_detection_slot() to its caller. Then it will be more informative
> and consistent with how we check migrate_logical_slots outside of
> create_conflict_detection_slot()

Moved.

Here is the V41 patch set which includes the following changes:

0001:
* Rebased due to recent commit fd51941.
* Addressed the comments above.
* Improved some documentation stuff.
* Improved the status message when creating
  conflict detection slot in pg_upgrade

0002:
No change

0003:
No change

0004:
No change

0005:
No change

0006:
Rebased due to recent commit fd51941.

Best Regards,
Hou zj

On Wed, Jun 25, 2025 at 7:27 PM Amit Kapila wrote:
> 
> On Wed, Jun 25, 2025 at 8:38 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V41 patch set which includes the following changes:
> >
> 
> Few comments on 0004
> ===================
> 1.
> +
> +# Remember the next transaction ID to be assigned my $next_xid =
> +$node_A->safe_psql('postgres', "SELECT txid_current() + 1;");
> +
> +# Confirm that the xmin value is updated ok( $node_A->poll_query_until(
> +'postgres',  "SELECT xmin = $next_xid from pg_replication_slots WHERE
> +slot_name =
> 'pg_conflict_detection'"
> + ),
> + "the xmin value of slot 'pg_conflict_detection' is updated on Node
> + A");
> +
> 
> Why use an indirect way to verify that the vacuum can now remove rows?
> Even if we want to check that the conflict slot is getting updated properly, we
> should verify that the vacuum has removed the deleted rows. Also, please
> improve comments for this test, as it is not very clear why you are expecting the
> latest xid value of conflict_slot.

I agree that testing VACUUM is straightforward. But I think there is a gap
between applying remote changes and updating slot.xmin in the launcher.
Therefore, it's necessary to wait for the launcher to update the slot before
testing whether VACUUM can remove the dead tuple.

I have improved the comments and added the VACUUM test as
suggested after the slot.xmin test.

> 
> 2.
> +# Alter failover for enabled subscription my ($cmdret, $stdout,
> +$stderr) = $node_A->psql('postgres',  "ALTER SUBSCRIPTION
> $subname_AB
> +SET (retain_conflict_info = true)"); ok( $stderr =~
> +   /ERROR:  cannot set option \"retain_conflict_info\" for enabled
> subscription/,
> + "altering retain_conflict_info is not allowed for enabled
> + subscription");
> +
> +# Disable the subscription
> +($cmdret, $stdout, $stderr) = $node_A->psql('postgres',  "ALTER
> +SUBSCRIPTION $subname_AB DISABLE;"); ok( $stderr =~
> +   /WARNING:  deleted rows to detect conflicts would not be removed
> until the subscription is enabled/,
> + "A warning is raised on disabling the subscription if
> retain_conflict_info is enabled");
> +
> +# Alter failover for disabled subscription ($cmdret, $stdout, $stderr)
> += $node_A->psql('postgres',  "ALTER SUBSCRIPTION $subname_AB SET
> +(retain_conflict_info = true);"); ok( $stderr =~
> +   /NOTICE:  deleted rows to detect conflicts would not be removed
> until the subscription is enabled/,
> + "altering retain_conflict_info is allowed for disabled subscription");
> 
> In all places, the comments use failover as an option name, whereas it is
> testing retain_conflict_info.

Changed.

> 
> 3. It is better to merge the 0004 into 0001 as it tests the core part of the
> functionality added by 0001.

Merged.

Here is the V41 patch set which includes the following changes:

0001:
* ran pgindent
* Merge the original v41-0004 tap-test.
* Addressed the comments above.
* Addressed the Shveta's comments[1].

0002:
* ran pgindent

0003:
* ran pgindent

0004:
* ran pgindent

0005:
* ran pgindent

[1] https://www.postgresql.org/message-id/CAJpy0uAg1mTcy00nR%3DVAx1nTJYRkQF84YOY4_YKh8L53A1t6sA%40mail.gmail.com

Best Regards,
Hou zj

Attachment

Re: Conflict detection for update_deleted in logical replication

From

shveta malik

Date:

26 June, 11:28:31

On Thu, Jun 26, 2025 at 8:31 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks for the comments. All of them look good to me and
> have been addressed in V42.
>

Thank You for the patches. Few comments.

t/035_conflicts.pl:

1)
Both the subscriptions subname_BA and subname_AB have rci enabled
during CREATE sub itself. And later in the second test, we are trying
to enable rci of subname_AB  to test WARNING and NOTICE, but rci is
already enabled. Shall we have one CREATE sub with rci enabled while
another CREATE sub with default rci. And then we try to enable rci of
the second sub later and check pg_conflict_detection slot has been
created once we enabled rci. This way, it will cover more scenarios.

2)
+$node_B->safe_psql('postgres', "UPDATE tab SET b = 3 WHERE a = 1;");
+$node_A->safe_psql('postgres', "DELETE FROM tab WHERE a = 1;");
+
+$node_A->wait_for_catchup($subname_BA);

Can you please help me understand why we are doing  wait_for_catchup
here? Do we want DELETE to be replicated from A to B? IMO, this step
is not essential for our test as we have node_A->poll_query  until
xmin = $next_xid in pg_conflict_detection and that should suffice to
ensure both DELETE and UPDATE are replicated from one to other.

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

From

"Zhijie Hou (Fujitsu)"

Date:

27 June, 05:28:10

On Thu, Jun 26, 2025 at 4:28 PM shveta malik wrote:
> 
> On Thu, Jun 26, 2025 at 8:31 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for the comments. All of them look good to me and
> > have been addressed in V42.
> >
> 
> Thank You for the patches. Few comments.
> 
> t/035_conflicts.pl:
> 
> 1)
> Both the subscriptions subname_BA and subname_AB have rci enabled
> during CREATE sub itself. And later in the second test, we are trying
> to enable rci of subname_AB  to test WARNING and NOTICE, but rci is
> already enabled. Shall we have one CREATE sub with rci enabled while
> another CREATE sub with default rci. And then we try to enable rci of
> the second sub later and check pg_conflict_detection slot has been
> created once we enabled rci. This way, it will cover more scenarios.

Agreed and changed as suggested. I removed the test for WARNING since the
message is the same as the NOITCE and it seems not worthwhile to disable
the subscription again to verify one message.

> 
> 2)
> +$node_B->safe_psql('postgres', "UPDATE tab SET b = 3 WHERE a = 1;");
> +$node_A->safe_psql('postgres', "DELETE FROM tab WHERE a = 1;");
> +
> +$node_A->wait_for_catchup($subname_BA);
> 
> Can you please help me understand why we are doing  wait_for_catchup
> here? Do we want DELETE to be replicated from A to B? IMO, this step
> is not essential for our test as we have node_A->poll_query  until
> xmin = $next_xid in pg_conflict_detection and that should suffice to
> ensure both DELETE and UPDATE are replicated from one to other.

I think this step belongs to a later patch to ensure the DELETE operation is
replicated to Node B, allowing us to verify the `delete_origin_differ`
conflicts detected there. So, I moved it to the later patches.

Here is the V43 patch set which includes the following changes:

0001:
* Addressed the comments above.

0002:
No change.

0003:
No change.

0004:
* Moved some tests from 0001 to here.

0005:
No change.


Best Regards,
Hou zj

On Tue, Jul 1, 2025 at 5:07 PM Dilip Kumar wrote:
> 
> On Tue, Jul 1, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jul 1, 2025 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jul 1, 2025 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > > >
> > > > On Mon, Jun 30, 2025 at 6:59 PM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > On Mon, Jun 30, 2025 at 7:22 PM Amit Kapila wrote:
> > > > > >
> > > >
> > > > I was looking at 0001, it mostly looks fine to me except this one
> > > > case.  So here we need to ensure that commits must be acquired
> > > > after marking the flag, don't you think we need to ensure strict
> > > > statement ordering using memory barrier, or we think it's not
> > > > required and if so why?
> > > >
> >
> > Good point. I also think we need a barrier here, but a write barrier
> > should be sufficient as we want ordering of two store operations.
> 
> +1
> 
> > > > RecordTransactionCommitPrepared()
> > > > {
> > > > ..
> > > > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> > > > +
> > > > + /*
> > > > + * Note it is important to set committs value after marking
> > > > + ourselves as
> > > > + * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This
> > > > + is because
> > > > + * we want to ensure all transactions that have acquired commit
> > > > + timestamp
> > > > + * are finished before we allow the logical replication client to
> > > > + advance
> > > > + * its xid which is used to hold back dead rows for conflict detection.
> > > > + * See maybe_advance_nonremovable_xid.
> > > > + */
> > > > + committs = GetCurrentTimestamp();
> > > > }
> > >
> > > I'm unsure whether the function call inherently acts as a memory
> > > barrier, preventing the compiler from reordering these operations.
> > > This needs to be confirmed.
> > >
> >
> > As per my understanding, function calls won't be a memory barrier. In
> > this regard, we need a similar change in RecordTransactionCommit as
> > well.
> 
> Right, we need this in RecordTransactionCommit() as well.

Thanks for the comments! I also agree that the barrier is needed.

Here is V45 patch set.

I modified 0001, added write barriers, and improved some comments.

Best Regards,
Hou zj