Thread: Re: Conflict detection for update_deleted in logical replication

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Hi hackers,
>
> I am starting a new thread to discuss and propose the conflict detection for
> update_deleted scenarios during logical replication. This conflict occurs when
> the apply worker cannot find the target tuple to be updated, as the tuple might
> have been removed by another origin.
>
> ---
> BACKGROUND
> ---
>
> Currently, when the apply worker cannot find the target tuple during an update,
> an update_missing conflict is logged. However, to facilitate future automatic
> conflict resolution, it has been agreed[1][2] that we need to detect both
> update_missing and update_deleted conflicts. Specifically, we will detect an
> update_deleted conflict if any dead tuple matching the old key value of the
> update operation is found; otherwise, it will be classified as update_missing.
>
> Detecting both update_deleted and update_missing conflicts is important for
> achieving eventual consistency in a bidirectional cluster, because the
> resolution for each conflict type can differs. For example, for an
> update_missing conflict, a feasible solution might be converting the update to
> an insert and applying it. While for an update_deleted conflict, the preferred
> approach could be to skip the update or compare the timestamps of the delete
> transactions with the remote update transaction's and choose the most recent
> one. For additional context, please refer to [3], which gives examples about
> how these differences could lead to data divergence.
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL. If the
> slots in feedback_slots are removed, a final message containing
> InvalidXLogRecPtr will be sent to inform the walsender to forget about the
> slot.xmin.
>
> To detect update_deleted conflicts during update operations, if the target row
> cannot be found, we perform an additional scan of the table using snapshotAny.
> This scan aims to locate the most recently deleted row that matches the old
> column values from the remote update operation and has not yet been removed by
> VACUUM. If any such tuples are found, we report the update_deleted conflict
> along with the origin and transaction information that deleted the tuple.
>
> Please refer to the attached POC patch set which implements above design. The
> patch set is split into some parts to make it easier for the initial review.
> Please note that each patch is interdependent and cannot work independently.
>
> Thanks a lot to Kuroda-San and Amit for the off-list discussion.
>
> Suggestions and comments are highly appreciated !
>

Thank You Hou-San for explaining the design. But to make it easier to
understand, would you be able to explain the sequence/timeline of the
*new* actions performed by the walsender and the apply processes for
the given example along with new feedback_slot config needed

Node A: (Procs: walsenderA, applyA)
  T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM

Node B: (Procs: walsenderB, applyB)
  T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
following flow:

Node A executed T1                                    - 10.00 AM
T1 replicated and applied on Node B                            - 10.0001 AM
Node B executed T3                                    - 10.01 AM
Node A executed T2 (741)                                - 10.02 AM
T2 replicated and applied on Node B    (delete_missing)                - 10.03 AM
T3 replicated and applied on Node A    (new action, detect update_deleted)        - 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A.                - 10.05 AM
                    
 

(new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM

Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
remove the dead tuple on Node A.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > > ---
> > > THE DESIGN
> > > ---
> > >
> > > To achieve the above, we plan to allow the logical walsender to
> > > maintain and advance the slot.xmin to protect the data in the user
> > > table and introduce a new logical standby feedback message. This
> > > message reports the WAL position that has been replayed on the logical
> > > standby *AND* the changes occurring on the logical standby before the
> > > WAL position are also replayed to the walsender's node (where the
> > > walsender is running). After receiving the new feedback message, the
> > > walsender will advance the slot.xmin based on the flush info, similar
> > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > > of logical slot are unused during logical replication, so I think it's safe and
> > won't cause side-effect to reuse the xmin for this feature.
> > >
> > > We have introduced a new subscription option
> > > (feedback_slots='slot1,...'), where these slots will be used to check
> > > condition (b): the transactions on logical standbys occurring before
> > > the replay of Node A's DELETE are replayed on Node A as well.
> > > Therefore, on Node B, users should specify the slots corresponding to
> > > Node A in this option. The apply worker will get the oldest confirmed
> > > flush LSN among the specified slots and send the LSN as a feedback
> > message to the walsender. -- I also thought of making it an automaic way, e.g.
> > > let apply worker select the slots that acquired by the walsenders
> > > which connect to the same remote server(e.g. if apply worker's
> > > connection info or some other flags is same as the walsender's
> > > connection info). But it seems tricky because if some slots are
> > > inactive which means the walsenders are not there, the apply worker
> > > could not find the correct slots to check unless we save the host along with
> > the slot's persistence data.
> > >
> > > The new feedback message is sent only if feedback_slots is not NULL.
> > > If the slots in feedback_slots are removed, a final message containing
> > > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > > the slot.xmin.
> > >
> > > To detect update_deleted conflicts during update operations, if the
> > > target row cannot be found, we perform an additional scan of the table using
> > snapshotAny.
> > > This scan aims to locate the most recently deleted row that matches
> > > the old column values from the remote update operation and has not yet
> > > been removed by VACUUM. If any such tuples are found, we report the
> > > update_deleted conflict along with the origin and transaction information
> > that deleted the tuple.
> > >
> > > Please refer to the attached POC patch set which implements above
> > > design. The patch set is split into some parts to make it easier for the initial
> > review.
> > > Please note that each patch is interdependent and cannot work
> > independently.
> > >
> > > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> > >
> > > Suggestions and comments are highly appreciated !
> > >
> >
> > Thank You Hou-San for explaining the design. But to make it easier to
> > understand, would you be able to explain the sequence/timeline of the
> > *new* actions performed by the walsender and the apply processes for the
> > given example along with new feedback_slot config needed
> >
> > Node A: (Procs: walsenderA, applyA)
> >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> >
> > Node B: (Procs: walsenderB, applyB)
> >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
>
> Thanks for reviewing! Let me elaborate further on the example:
>
> On node A, feedback_slots should include the logical slot that used to replicate changes
> from Node A to Node B. On node B, feedback_slots should include the logical
> slot that replicate changes from Node B to Node A.
>
> Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
> following flow:
>
> Node A executed T1                                                                      - 10.00 AM
> T1 replicated and applied on Node B                                                     - 10.0001 AM
> Node B executed T3                                                                      - 10.01 AM
> Node A executed T2 (741)                                                                - 10.02 AM
> T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM

Not related to this feature, but do you mean delete_origin_differ here?

> T3 replicated and applied on Node A     (new action, detect update_deleted)             - 10.04 AM
>
> (new action) Apply worker on Node B has confirmed that T2 has been applied
> locally and the transactions before T2 (e.g., T3) has been replicated and
> applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
> replayed T2), thus send the new feedback message to Node A.                             - 10.05 AM
>
> (new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM
>
> Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
> remove the dead tuple on Node A.
>

Thanks for the example. Can you please review below and let me know if
my understanding is correct.

1)
In a bidirectional replication setup, the user has to create slots in
a way that NodeA's sub's slot is Node B's feedback_slot and Node B's
sub's slot is Node A's feedback slot. And then only this feature will
work well, is it correct to say?

2)
Now coming back to multiple feedback_slots in a subscription, is the
below correct:

Say Node A has publications and subscriptions as follow:
------------------
A_pub1

A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)


Say Node B has publications and subscriptions as follow:
------------------
B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)

B_pub1
B_pub2
B_pub3

Then what will be the feedback_slot configuration for all
subscriptions of A and B? Is below correct:
------------------
A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

3)
If the above is true, then do we have a way to make sure that the user
 has given this configuration exactly the above way? If users end up
giving feedback_slots as some random slot  (say A_slot4 or incomplete
list), do we validate that? (I have not looked at code yet, just
trying to understand design first).

4)
Now coming to this:

> The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender.

 There will be one apply worker on B which will be due to B_sub1, so
will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't
it be sufficient to check confimed_lsn of say slot A_sub1 alone which
has subscribed to table 't' on which delete has been performed? Rest
of the  lots (A_sub2, A_sub3) might have subscribed to different
tables?

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1                                                                      - 10.00 AM
> > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > Node B executed T3                                                                      - 10.01 AM
> > Node A executed T2 (741)                                                                - 10.02 AM
> > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A     (new action, detect
> update_deleted)             - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> ------------------
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> ------------------
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A and
> B? Is below correct:
> ------------------
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which server the slot is replicating
changes to, which could be tricky as the slot currently doesn't have any info
to identify the remote server. Besides, the slot could be inactive temporarily
due to some subscriber side error, in which case we cannot verify the
subscription that used it.

> 
> 4)
> Now coming to this:
> 
> > The apply worker will get the oldest
> > confirmed flush LSN among the specified slots and send the LSN as a
> > feedback message to the walsender.
> 
>  There will be one apply worker on B which will be due to B_sub1, so will it
> check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> sufficient to check confimed_lsn of say slot A_sub1 alone which has
> subscribed to table 't' on which delete has been performed? Rest of the  lots
> (A_sub2, A_sub3) might have subscribed to different tables?

I think it's theoretically correct to only check the A_sub1. We could document
that user can do this by identifying the tables that each subscription
replicates, but it may not be user friendly.

Best Regards,
Hou zj


Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 2:45 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thank You Hou-San for explaining the design. But to make it easier
> > > > to understand, would you be able to explain the sequence/timeline of
> > > > the
> > > > *new* actions performed by the walsender and the apply processes for
> > > > the given example along with new feedback_slot config needed
> > > >
> > > > Node A: (Procs: walsenderA, applyA)
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > > >
> > > > Node B: (Procs: walsenderB, applyB)
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> > >
> > > Thanks for reviewing! Let me elaborate further on the example:
> > >
> > > On node A, feedback_slots should include the logical slot that used to
> > > replicate changes from Node A to Node B. On node B, feedback_slots
> > > should include the logical slot that replicate changes from Node B to Node A.
> > >
> > > Assume the slot.xmin on Node A has been initialized to a valid
> > > number(740) before the following flow:
> > >
> > > Node A executed T1                                                                      - 10.00 AM
> > > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > > Node B executed T3                                                                      - 10.01 AM
> > > Node A executed T2 (741)                                                                - 10.02 AM
> > > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> >
> > Not related to this feature, but do you mean delete_origin_differ here?
>
> Oh sorry, It's a miss. I meant delete_origin_differ.
>
> >
> > > T3 replicated and applied on Node A     (new action, detect
> > update_deleted)             - 10.04 AM
> > >
> > > (new action) Apply worker on Node B has confirmed that T2 has been
> > > applied locally and the transactions before T2 (e.g., T3) has been
> > > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> > >= lsn of the local
> > > replayed T2), thus send the new feedback message to Node A.
> > - 10.05 AM
> > >
> > > (new action) Walsender on Node A received the message and would
> > > advance the slot.xmin.- 10.06 AM
> > >
> > > Then, after the slot.xmin is advanced to a number greater than 741,
> > > the VACUUM would be able to remove the dead tuple on Node A.
> > >
> >
> > Thanks for the example. Can you please review below and let me know if my
> > understanding is correct.
> >
> > 1)
> > In a bidirectional replication setup, the user has to create slots in a way that
> > NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> > A's feedback slot. And then only this feature will work well, is it correct to say?
>
> Yes, your understanding is correct.
>
> >
> > 2)
> > Now coming back to multiple feedback_slots in a subscription, is the below
> > correct:
> >
> > Say Node A has publications and subscriptions as follow:
> > ------------------
> > A_pub1
> >
> > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> >
> >
> > Say Node B has publications and subscriptions as follow:
> > ------------------
> > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> >
> > B_pub1
> > B_pub2
> > B_pub3
> >
> > Then what will be the feedback_slot configuration for all subscriptions of A and
> > B? Is below correct:
> > ------------------
> > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
>
> Right. The above configurations are correct.

Okay. It seems difficult to understand configuration from user's perspective.

> >
> > 3)
> > If the above is true, then do we have a way to make sure that the user  has
> > given this configuration exactly the above way? If users end up giving
> > feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> > validate that? (I have not looked at code yet, just trying to understand design
> > first).
>
> The patch doesn't validate if the feedback slots belong to the correct
> subscriptions on remote server. It only validates if the slot is an existing,
> valid, logical slot. I think there are few challenges to validate it further.
> E.g. We need a way to identify the which server the slot is replicating
> changes to, which could be tricky as the slot currently doesn't have any info
> to identify the remote server. Besides, the slot could be inactive temporarily
> due to some subscriber side error, in which case we cannot verify the
> subscription that used it.

Okay, I understand the challenges here.

> >
> > 4)
> > Now coming to this:
> >
> > > The apply worker will get the oldest
> > > confirmed flush LSN among the specified slots and send the LSN as a
> > > feedback message to the walsender.
> >
> >  There will be one apply worker on B which will be due to B_sub1, so will it
> > check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> > sufficient to check confimed_lsn of say slot A_sub1 alone which has
> > subscribed to table 't' on which delete has been performed? Rest of the  lots
> > (A_sub2, A_sub3) might have subscribed to different tables?
>
> I think it's theoretically correct to only check the A_sub1. We could document
> that user can do this by identifying the tables that each subscription
> replicates, but it may not be user friendly.
>

Sorry, I fail to understand how user can identify the tables and give
feedback_slots accordingly? I thought feedback_slots is a one time
configuration when replication is setup (or say setup changes in
future); it can not keep on changing with each query. Or am I missing
something?

IMO, it is something which should be identified internally. Since the
query is on table 't1', feedback-slot which is for 't1' shall be used
to check lsn. But on rethinking,this optimization may not be worth the
effort, the identification part could be tricky, so it might be okay
to check all the slots.

~~

Another query is about 3 node setup. I couldn't figure out what would
be feedback_slots setting when it is not bidirectional, as in consider
the case where there are three nodes A,B,C. Node C is subscribing to
both Node A and Node B. Node A and Node B are the ones doing
concurrent "update" and "delete" which will both be replicated to Node
C. In this case what will be the feedback_slots setting on Node C? We
don't have any slots here which will be replicating changes from Node
C to Node A and Node C to Node B. This is given in [3] in your first
email ([1])

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > ------------------
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > ------------------
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > ------------------
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It would be pretty complicate although possible, so I
prefer to not mention it in the first place if it could not bring much
benefits.

> 
> IMO, it is something which should be identified internally. Since the query is on
> table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> rethinking,this optimization may not be worth the effort, the identification part
> could be tricky, so it might be okay to check all the slots.

I agree that identifying these internally would add complexity.

> 
> ~~
> 
> Another query is about 3 node setup. I couldn't figure out what would be
> feedback_slots setting when it is not bidirectional, as in consider the case
> where there are three nodes A,B,C. Node C is subscribing to both Node A and
> Node B. Node A and Node B are the ones doing concurrent "update" and
> "delete" which will both be replicated to Node C. In this case what will be the
> feedback_slots setting on Node C? We don't have any slots here which will be
> replicating changes from Node C to Node A and Node C to Node B. This is given
> in [3] in your first email ([1])

Thanks for pointing this, the link was a bit misleading. I think the solution
proposed in this thread is only used to allow detecting update_deleted reliably
in a bidirectional cluster.  For non- bidirectional cases, it would be more
tricky to predict the timing till when should we retain the dead tuples.


> 
> [1]:
> https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A
> 6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 5:56 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thanks for the example. Can you please review below and let me know
> > > > if my understanding is correct.
> > > >
> > > > 1)
> > > > In a bidirectional replication setup, the user has to create slots
> > > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > > B's sub's slot is Node A's feedback slot. And then only this feature will
> > work well, is it correct to say?
> > >
> > > Yes, your understanding is correct.
> > >
> > > >
> > > > 2)
> > > > Now coming back to multiple feedback_slots in a subscription, is the
> > > > below
> > > > correct:
> > > >
> > > > Say Node A has publications and subscriptions as follow:
> > > > ------------------
> > > > A_pub1
> > > >
> > > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > > >
> > > >
> > > > Say Node B has publications and subscriptions as follow:
> > > > ------------------
> > > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > > >
> > > > B_pub1
> > > > B_pub2
> > > > B_pub3
> > > >
> > > > Then what will be the feedback_slot configuration for all
> > > > subscriptions of A and B? Is below correct:
> > > > ------------------
> > > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> > >
> > > Right. The above configurations are correct.
> >
> > Okay. It seems difficult to understand configuration from user's perspective.
>
> Right. I think we could give an example in the document to make it clear.
>
> >
> > > >
> > > > 3)
> > > > If the above is true, then do we have a way to make sure that the
> > > > user  has given this configuration exactly the above way? If users
> > > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > > incomplete list), do we validate that? (I have not looked at code
> > > > yet, just trying to understand design first).
> > >
> > > The patch doesn't validate if the feedback slots belong to the correct
> > > subscriptions on remote server. It only validates if the slot is an
> > > existing, valid, logical slot. I think there are few challenges to validate it
> > further.
> > > E.g. We need a way to identify the which server the slot is
> > > replicating changes to, which could be tricky as the slot currently
> > > doesn't have any info to identify the remote server. Besides, the slot
> > > could be inactive temporarily due to some subscriber side error, in
> > > which case we cannot verify the subscription that used it.
> >
> > Okay, I understand the challenges here.
> >
> > > >
> > > > 4)
> > > > Now coming to this:
> > > >
> > > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > > specified slots and send the LSN as a feedback message to the
> > > > > walsender.
> > > >
> > > >  There will be one apply worker on B which will be due to B_sub1, so
> > > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > > alone which has subscribed to table 't' on which delete has been
> > > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> > different tables?
> > >
> > > I think it's theoretically correct to only check the A_sub1. We could
> > > document that user can do this by identifying the tables that each
> > > subscription replicates, but it may not be user friendly.
> > >
> >
> > Sorry, I fail to understand how user can identify the tables and give
> > feedback_slots accordingly? I thought feedback_slots is a one time
> > configuration when replication is setup (or say setup changes in future); it can
> > not keep on changing with each query. Or am I missing something?
>
> I meant that user have all the publication information(including the tables
> added in a publication) that the subscription subscribes to, and could also
> have the slot_name, so I think it's possible to identify the tables that each
> subscription includes and add the feedback_slots correspondingly before
> starting the replication. It would be pretty complicate although possible, so I
> prefer to not mention it in the first place if it could not bring much
> benefits.
>
> >
> > IMO, it is something which should be identified internally. Since the query is on
> > table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> > rethinking,this optimization may not be worth the effort, the identification part
> > could be tricky, so it might be okay to check all the slots.
>
> I agree that identifying these internally would add complexity.
>
> >
> > ~~
> >
> > Another query is about 3 node setup. I couldn't figure out what would be
> > feedback_slots setting when it is not bidirectional, as in consider the case
> > where there are three nodes A,B,C. Node C is subscribing to both Node A and
> > Node B. Node A and Node B are the ones doing concurrent "update" and
> > "delete" which will both be replicated to Node C. In this case what will be the
> > feedback_slots setting on Node C? We don't have any slots here which will be
> > replicating changes from Node C to Node A and Node C to Node B. This is given
> > in [3] in your first email ([1])
>
> Thanks for pointing this, the link was a bit misleading. I think the solution
> proposed in this thread is only used to allow detecting update_deleted reliably
> in a bidirectional cluster.  For non- bidirectional cases, it would be more
> tricky to predict the timing till when should we retain the dead tuples.
>

So in brief, this solution is only for bidrectional setup? For
non-bidirectional, feedback_slots is non-configurable and thus
irrelevant.

Irrespective of above, if user ends up setting feedback_slot to some
random but existing slot which is not at all consuming changes, then
it may so happen that the node will never send feedback msg to another
node resulting in accumulation of dead tuples on another node. Is that
a possibility?

thanks
Shveta



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random but
> existing slot which is not at all consuming changes, then it may so happen that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Sep 11, 2024 at 11:07 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > >
> > > > Another query is about 3 node setup. I couldn't figure out what
> > > > would be feedback_slots setting when it is not bidirectional, as in
> > > > consider the case where there are three nodes A,B,C. Node C is
> > > > subscribing to both Node A and Node B. Node A and Node B are the
> > > > ones doing concurrent "update" and "delete" which will both be
> > > > replicated to Node C. In this case what will be the feedback_slots
> > > > setting on Node C? We don't have any slots here which will be
> > > > replicating changes from Node C to Node A and Node C to Node B. This
> > > > is given in [3] in your first email ([1])
> > >
> > > Thanks for pointing this, the link was a bit misleading. I think the
> > > solution proposed in this thread is only used to allow detecting
> > > update_deleted reliably in a bidirectional cluster.  For non-
> > > bidirectional cases, it would be more tricky to predict the timing till when
> > should we retain the dead tuples.
> > >
> >
> > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > feedback_slots is non-configurable and thus irrelevant.
>
> Right.
>

One possible idea to address the non-bidirectional case raised by
Shveta is to use a time-based cut-off to remove dead tuples. As
mentioned earlier in my email [1], we can define a new GUC parameter
say vacuum_committs_age which would indicate that we will allow rows
to be removed only if the modified time of the tuple as indicated by
committs module is greater than the vacuum_committs_age. We could keep
this parameter a table-level option without introducing a GUC as this
may not apply to all tables. I checked and found that some other
replication solutions like GoldenGate also allowed similar parameters
(tombstone_deletes) to be specified at table level [2]. The other
advantage of allowing it at table level is that it won't hamper the
performance of hot-pruning or vacuum in general. Note, I am careful
here because to decide whether to remove a dead tuple or not we need
to compare its committs_time both during hot-pruning and vacuum.

Note that tombstones_deletes is a general concept used by replication
solutions to detect updated_deleted conflict and time-based purging is
recommended. See [3][4]. We previously discussed having tombstone
tables to keep the deleted records information but it was suggested to
prevent the vacuum from removing the required dead tuples as that
would be simpler than inventing a new kind of tables/store for
tombstone_deletes [5]. So, we came up with the idea of feedback slots
discussed in this email but that didn't work out in all cases and
appears difficult to configure as pointed out by Shveta. So, now, we
are back to one of the other ideas [1] discussed previously to solve
this problem.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[2] -
BEGIN
  DBMS_GOLDENGATE_ADM.ALTER_AUTO_CDR(
    schema_name       => 'hr',
    table_name        => 'employees',
    tombstone_deletes => TRUE);
END;
/
[3] - https://en.wikipedia.org/wiki/Tombstone_(data_store)
[4] -
https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/automatic-conflict-detection-and-resolution1.html#GUID-423C6EE8-1C62-4085-899C-8454B8FB9C92
[5] - https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
shveta malik
Date:
On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > >
> > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > feedback_slots is non-configurable and thus irrelevant.
> >
> > Right.
> >
>
> One possible idea to address the non-bidirectional case raised by
> Shveta is to use a time-based cut-off to remove dead tuples. As
> mentioned earlier in my email [1], we can define a new GUC parameter
> say vacuum_committs_age which would indicate that we will allow rows
> to be removed only if the modified time of the tuple as indicated by
> committs module is greater than the vacuum_committs_age. We could keep
> this parameter a table-level option without introducing a GUC as this
> may not apply to all tables. I checked and found that some other
> replication solutions like GoldenGate also allowed similar parameters
> (tombstone_deletes) to be specified at table level [2]. The other
> advantage of allowing it at table level is that it won't hamper the
> performance of hot-pruning or vacuum in general. Note, I am careful
> here because to decide whether to remove a dead tuple or not we need
> to compare its committs_time both during hot-pruning and vacuum.

+1 on the idea, but IIUC this value doesn’t need to be significant; it
can be limited to just a few minutes. The one which is sufficient to
handle replication delays caused by network lag or other factors,
assuming clock skew has already been addressed.

This new parameter is necessary only for cases where an UPDATE and
DELETE on the same row occur concurrently, but the replication order
to a third node is not preserved, which could result in data
divergence. Consider the following example:

Node A:
   T1: INSERT INTO t (id, value) VALUES (1,1);  (10.01 AM)
   T2: DELETE FROM t WHERE id = 1;             (10.03 AM)

Node B:
   T3: UPDATE t SET value = 2 WHERE id = 1;    (10.02 AM)

Assume a third node (Node C) subscribes to both Node A and Node B. The
"correct" order of messages received by Node C would be T1-T3-T2, but
it could also receive them in the order T1-T2-T3, wherein  sayT3 is
received with a lag of say 2 mins. In such a scenario, T3 should be
able to recognize that the row was deleted by T2 on Node C, thereby
detecting the update-deleted conflict and skipping the apply.

The 'vacuum_committs_age' parameter should account for this lag, which
could lead to the order reversal of UPDATE and DELETE operations.

Any subsequent attempt to update the same row after conflict detection
and resolution should not pose an issue. For example, if Node A
triggers the following at 10:20 AM:
UPDATE t SET value = 3 WHERE id = 1;

Since the row has already been deleted, the UPDATE will not proceed
and therefore will not generate a replication operation on the other
nodes, indicating that vacuum need not to preserve the dead row to
this far.

thanks
Shveta



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > feedback_slots is non-configurable and thus irrelevant.
> > >
> > > Right.
> > >
> >
> > One possible idea to address the non-bidirectional case raised by
> > Shveta is to use a time-based cut-off to remove dead tuples. As
> > mentioned earlier in my email [1], we can define a new GUC parameter
> > say vacuum_committs_age which would indicate that we will allow rows
> > to be removed only if the modified time of the tuple as indicated by
> > committs module is greater than the vacuum_committs_age. We could keep
> > this parameter a table-level option without introducing a GUC as this
> > may not apply to all tables. I checked and found that some other
> > replication solutions like GoldenGate also allowed similar parameters
> > (tombstone_deletes) to be specified at table level [2]. The other
> > advantage of allowing it at table level is that it won't hamper the
> > performance of hot-pruning or vacuum in general. Note, I am careful
> > here because to decide whether to remove a dead tuple or not we need
> > to compare its committs_time both during hot-pruning and vacuum.
>
> +1 on the idea,

I agree that this idea is much simpler than the idea originally
proposed in this thread.

IIUC vacuum_committs_age specifies a time rather than an XID age. But
how can we implement it? If it ends up affecting the vacuum cutoff, we
should be careful not to end up with the same result of
vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
the implementation needs not to affect the performance of
ComputeXidHorizons().

> but IIUC this value doesn’t need to be significant; it
> can be limited to just a few minutes. The one which is sufficient to
> handle replication delays caused by network lag or other factors,
> assuming clock skew has already been addressed.

I think that in a non-bidirectional case the value could need to be a
large number. Is that right?

Regards,

[1] https://www.postgresql.org/message-id/20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > > feedback_slots is non-configurable and thus irrelevant.
> > > > >
> > > > > Right.
> > > > >
> > > >
> > > > One possible idea to address the non-bidirectional case raised by
> > > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > > say vacuum_committs_age which would indicate that we will allow rows
> > > > to be removed only if the modified time of the tuple as indicated by
> > > > committs module is greater than the vacuum_committs_age. We could keep
> > > > this parameter a table-level option without introducing a GUC as this
> > > > may not apply to all tables. I checked and found that some other
> > > > replication solutions like GoldenGate also allowed similar parameters
> > > > (tombstone_deletes) to be specified at table level [2]. The other
> > > > advantage of allowing it at table level is that it won't hamper the
> > > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > > here because to decide whether to remove a dead tuple or not we need
> > > > to compare its committs_time both during hot-pruning and vacuum.
> > >
> > > +1 on the idea,
> >
> > I agree that this idea is much simpler than the idea originally
> > proposed in this thread.
> >
> > IIUC vacuum_committs_age specifies a time rather than an XID age.
> >
>
> Your understanding is correct that vacuum_committs_age specifies a time.
>
> >
> > But
> > how can we implement it? If it ends up affecting the vacuum cutoff, we
> > should be careful not to end up with the same result of
> > vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> > the implementation needs not to affect the performance of
> > ComputeXidHorizons().
> >
>
> I haven't thought about the implementation details yet but I think
> during pruning (for example in heap_prune_satisfies_vacuum()), apart
> from checking if the tuple satisfies
> HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> committs is greater than configured vacuum_committs_age (for the
> table) to decide whether tuple can be removed.

Sounds very costly. I think we need to do performance tests. Even if
the vacuum gets slower only on the particular table having the
vacuum_committs_age setting, it would affect overall autovacuum
performance. Also, it would affect HOT pruning performance.

>
> > > but IIUC this value doesn’t need to be significant; it
> > > can be limited to just a few minutes. The one which is sufficient to
> > > handle replication delays caused by network lag or other factors,
> > > assuming clock skew has already been addressed.
> >
> > I think that in a non-bidirectional case the value could need to be a
> > large number. Is that right?
> >
>
> As per my understanding, even for non-bidirectional cases, the value
> should be small. For example, in the case, pointed out by Shveta [1],
> where the updates from 2 nodes are received by a third node, this
> setting is expected to be small. This setting primarily deals with
> concurrent transactions on multiple nodes, so it should be small but I
> could be missing something.
>

I might be missing something but the scenario I was thinking of is
something below.

Suppose that we setup uni-directional logical replication between Node
A and Node B (e.g., Node A -> Node B) and both nodes have the same row
with key = 1:

Node A:
    T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
      -> This change is applied on Node B at 10:01 AM.

Node B:
    T2: DELETE FROM t WHERE key = 1;         (05:00 AM)

If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
Node A would raise an "update_missing" conflict. On the other hand, if
a vacuum runs on Node B at 11:00 AM, the change would raise an
"update_deleted" conflict. It looks whether we detect an
"update_deleted" or an "updated_missing" depends on the timing of
vacuum, and to avoid such a situation, we would need to set
vacuum_committs_age to more than 5 hours.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I haven't thought about the implementation details yet but I think
> > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > from checking if the tuple satisfies
> > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > committs is greater than configured vacuum_committs_age (for the
> > table) to decide whether tuple can be removed.
>
> Sounds very costly. I think we need to do performance tests. Even if
> the vacuum gets slower only on the particular table having the
> vacuum_committs_age setting, it would affect overall autovacuum
> performance. Also, it would affect HOT pruning performance.
>

Agreed that we should do some performance testing and additionally
think of any better way to implement. I think the cost won't be much
if the tuples to be removed are from a single transaction because the
required commit_ts information would be cached but when the tuples are
from different transactions, we could see a noticeable impact. We need
to test to say anything concrete on this.

> >
> > > > but IIUC this value doesn’t need to be significant; it
> > > > can be limited to just a few minutes. The one which is sufficient to
> > > > handle replication delays caused by network lag or other factors,
> > > > assuming clock skew has already been addressed.
> > >
> > > I think that in a non-bidirectional case the value could need to be a
> > > large number. Is that right?
> > >
> >
> > As per my understanding, even for non-bidirectional cases, the value
> > should be small. For example, in the case, pointed out by Shveta [1],
> > where the updates from 2 nodes are received by a third node, this
> > setting is expected to be small. This setting primarily deals with
> > concurrent transactions on multiple nodes, so it should be small but I
> > could be missing something.
> >
>
> I might be missing something but the scenario I was thinking of is
> something below.
>
> Suppose that we setup uni-directional logical replication between Node
> A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> with key = 1:
>
> Node A:
>     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
>       -> This change is applied on Node B at 10:01 AM.
>
> Node B:
>     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
>
> If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> Node A would raise an "update_missing" conflict. On the other hand, if
> a vacuum runs on Node B at 11:00 AM, the change would raise an
> "update_deleted" conflict. It looks whether we detect an
> "update_deleted" or an "updated_missing" depends on the timing of
> vacuum, and to avoid such a situation, we would need to set
> vacuum_committs_age to more than 5 hours.
>

Yeah, in this case, it would detect a different conflict (if we don't
set vacuum_committs_age to greater than 5 hours) but as per my
understanding, the primary purpose of conflict detection and
resolution is to avoid data inconsistency in a bi-directional setup.
Assume, in the above case it is a bi-directional setup, then we want
to have the same data in both nodes. Now, if there are other cases
like the one you mentioned that require to detect the conflict
reliably than I agree this value could be large and probably not the
best way to achieve it. I think we can mention in the docs that the
primary purpose of this is to achieve data consistency among
bi-directional kind of setups.

Having said that even in the above case, the result should be the same
whether the vacuum has removed the row or not. Say, if the vacuum has
not yet removed the row (due to vacuum_committs_age or otherwise) then
also because the incoming update has a later timestamp, we will
convert the update to insert as per last_update_wins resolution
method, so the conflict will be considered as update_missing. And,
say, the vacuum has removed the row and the conflict detected is
update_missing, then also we will convert the update to insert. In
short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
has higher commit-ts, UPDATE should win.

So, we can expect data consistency in bidirectional cases and expect a
deterministic behavior in other cases (e.g. the final data in a table
does not depend on the order of applying the transactions from other
nodes).

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I haven't thought about the implementation details yet but I think
> > > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > > from checking if the tuple satisfies
> > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > > committs is greater than configured vacuum_committs_age (for the
> > > table) to decide whether tuple can be removed.
> >
> > Sounds very costly. I think we need to do performance tests. Even if
> > the vacuum gets slower only on the particular table having the
> > vacuum_committs_age setting, it would affect overall autovacuum
> > performance. Also, it would affect HOT pruning performance.
> >
>
> Agreed that we should do some performance testing and additionally
> think of any better way to implement. I think the cost won't be much
> if the tuples to be removed are from a single transaction because the
> required commit_ts information would be cached but when the tuples are
> from different transactions, we could see a noticeable impact. We need
> to test to say anything concrete on this.

Agreed.

>
> > >
> > > > > but IIUC this value doesn’t need to be significant; it
> > > > > can be limited to just a few minutes. The one which is sufficient to
> > > > > handle replication delays caused by network lag or other factors,
> > > > > assuming clock skew has already been addressed.
> > > >
> > > > I think that in a non-bidirectional case the value could need to be a
> > > > large number. Is that right?
> > > >
> > >
> > > As per my understanding, even for non-bidirectional cases, the value
> > > should be small. For example, in the case, pointed out by Shveta [1],
> > > where the updates from 2 nodes are received by a third node, this
> > > setting is expected to be small. This setting primarily deals with
> > > concurrent transactions on multiple nodes, so it should be small but I
> > > could be missing something.
> > >
> >
> > I might be missing something but the scenario I was thinking of is
> > something below.
> >
> > Suppose that we setup uni-directional logical replication between Node
> > A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> > with key = 1:
> >
> > Node A:
> >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> >       -> This change is applied on Node B at 10:01 AM.
> >
> > Node B:
> >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> >
> > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > Node A would raise an "update_missing" conflict. On the other hand, if
> > a vacuum runs on Node B at 11:00 AM, the change would raise an
> > "update_deleted" conflict. It looks whether we detect an
> > "update_deleted" or an "updated_missing" depends on the timing of
> > vacuum, and to avoid such a situation, we would need to set
> > vacuum_committs_age to more than 5 hours.
> >
>
> Yeah, in this case, it would detect a different conflict (if we don't
> set vacuum_committs_age to greater than 5 hours) but as per my
> understanding, the primary purpose of conflict detection and
> resolution is to avoid data inconsistency in a bi-directional setup.
> Assume, in the above case it is a bi-directional setup, then we want
> to have the same data in both nodes. Now, if there are other cases
> like the one you mentioned that require to detect the conflict
> reliably than I agree this value could be large and probably not the
> best way to achieve it. I think we can mention in the docs that the
> primary purpose of this is to achieve data consistency among
> bi-directional kind of setups.
>
> Having said that even in the above case, the result should be the same
> whether the vacuum has removed the row or not. Say, if the vacuum has
> not yet removed the row (due to vacuum_committs_age or otherwise) then
> also because the incoming update has a later timestamp, we will
> convert the update to insert as per last_update_wins resolution
> method, so the conflict will be considered as update_missing. And,
> say, the vacuum has removed the row and the conflict detected is
> update_missing, then also we will convert the update to insert. In
> short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> has higher commit-ts, UPDATE should win.
>
> So, we can expect data consistency in bidirectional cases and expect a
> deterministic behavior in other cases (e.g. the final data in a table
> does not depend on the order of applying the transactions from other
> nodes).

Agreed.

I think that such a time-based configuration parameter would be a
reasonable solution. The current concerns are that it might affect
vacuum performance and lead to a similar bug we had with
vacuum_defer_cleanup_age.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:

> -----Original Message-----
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila <amit.kapila16@gmail.com>
> Cc: shveta malik <shveta.malik@gmail.com>; Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: Conflict detection for update_deleted in logical replication
> 
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
> 
> Agreed.
> 
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > >       -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency in bidirectional cases and expect a
> > deterministic behavior in other cases (e.g. the final data in a table
> > does not depend on the order of applying the transactions from other
> > nodes).
> 
> Agreed.
> 
> I think that such a time-based configuration parameter would be a reasonable
> solution. The current concerns are that it might affect vacuum performance and
> lead to a similar bug we had with vacuum_defer_cleanup_age.

Thanks for the feedback!

I am working on the POC patch and doing some initial performance tests on this idea.
I will share the results after finishing.

Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to our
approach for retaining dead tuples and have come up with another idea that can
reliably detect conflicts without requiring users to choose a wise value for
the vacuum_committs_age. This new idea could also reduce the performance
impact. Thanks a lot to Amit for off-list discussion.

The concept of the new idea is that, the dead tuples are only useful to detect
conflicts when applying *concurrent* transactions from remotes. Any subsequent
UPDATE from a remote node after removing the dead tuples should have a later
timestamp, meaning it's reasonable to detect an update_missing scenario and
convert the UPDATE to an INSERT when applying it.

To achieve above, we can create an additional replication slot on the
subscriber side, maintained by the apply worker. This slot is used to retain
the dead tuples. The apply worker will advance the slot.xmin after confirming
that all the concurrent transaction on publisher has been applied locally.

The process of advancing the slot.xmin could be:

1) the apply worker call GetRunningTransactionData() to get the
'oldestRunningXid' and consider this as 'candidate_xmin'.
2) the apply worker send a new message to walsender to request the latest wal
flush position(GetFlushRecPtr) on publisher, and save it to
'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
extend the existing keepalive message(e,g extends the requestReply bit in
keepalive message to add a 'request_wal_position' value)
3) The apply worker can continue to apply changes. After applying all the WALs
upto 'candidate_remote_wal_lsn', the apply worker can then advance the
slot.xmin to 'candidate_xmin'.

This approach ensures that dead tuples are not removed until all concurrent
transactions have been applied. It can be effective for both bidirectional and
non-bidirectional replication cases.

We could introduce a boolean subscription option (retain_dead_tuples) to
control whether this feature is enabled. Each subscription intending to detect
update-delete conflicts should set retain_dead_tuples to true.

The following explains how it works in different cases to achieve data
consistency:

--
2 nodes, bidirectional case 1:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.01 AM

subscription retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

--
2 nodes, bidirectional case 2:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

After executing T2, the apply worker on Node A will request the latest wal
flush location on Node B. And the T3 is either running concurrently or has not
started. In both cases, the T3 must have a later timestamp. So, even if the
dead tuple is removed in this cases and update_missing is detected, the default
resolution is to convert UDPATE to INSERT which is OK because the data are
still consistent on Node A and B.

--
3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
--

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

Node C:
    apply T1, T2, T3

After applying T2, the apply worker on Node C will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

Your feedback on this idea would be greatly appreciated.

Best Regards,
Hou zj



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> On Friday, September 20, 2024 2:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > 
> >
> > I think that such a time-based configuration parameter would be a
> > reasonable solution. The current concerns are that it might affect
> > vacuum performance and lead to a similar bug we had with
> vacuum_defer_cleanup_age.
> 
> Thanks for the feedback!
> 
> I am working on the POC patch and doing some initial performance tests on
> this idea.
> I will share the results after finishing.
> 
> Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to
> our approach for retaining dead tuples and have come up with another idea that
> can reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
> 
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any
> subsequent UPDATE from a remote node after removing the dead tuples
> should have a later timestamp, meaning it's reasonable to detect an
> update_missing scenario and convert the UPDATE to an INSERT when
> applying it.
> 
> To achieve above, we can create an additional replication slot on the subscriber
> side, maintained by the apply worker. This slot is used to retain the dead tuples.
> The apply worker will advance the slot.xmin after confirming that all the
> concurrent transaction on publisher has been applied locally.
> 
> The process of advancing the slot.xmin could be:
> 
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or extend the existing keepalive message(e,g extends the
> requestReply bit in keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
> 
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
> 
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
> 
> The following explains how it works in different cases to achieve data
> consistency:
...
> --
> 3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
> --

Sorry for a typo here, the time of T2 and T3 were reversed.
Please see the following correction:

> 
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Here T2 should be at ts=10.02 AM

> 
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

T3 should be at ts=10.01 AM

> 
> Node C:
>     apply T1, T2, T3
> 
> After applying T2, the apply worker on Node C will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
> 
> Your feedback on this idea would be greatly appreciated.
> 

Best Regards,
Hou zj 


Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Apart from the vacuum_defer_cleanup_age idea.
>

I think you meant to say vacuum_committs_age idea.

> we’ve given more thought to our
> approach for retaining dead tuples and have come up with another idea that can
> reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
>
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any subsequent
> UPDATE from a remote node after removing the dead tuples should have a later
> timestamp, meaning it's reasonable to detect an update_missing scenario and
> convert the UPDATE to an INSERT when applying it.
>
> To achieve above, we can create an additional replication slot on the
> subscriber side, maintained by the apply worker. This slot is used to retain
> the dead tuples. The apply worker will advance the slot.xmin after confirming
> that all the concurrent transaction on publisher has been applied locally.
>
> The process of advancing the slot.xmin could be:
>
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> extend the existing keepalive message(e,g extends the requestReply bit in
> keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
>
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
>
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
>

As each apply worker needs a separate slot to retain deleted rows, the
requirement for slots will increase. The other possibility is to
maintain one slot by launcher or some other central process that
traverses all subscriptions, remember the ones marked with
retain_dead_rows (let's call this list as retain_sub_list). Then using
running_transactions get the oldest running_xact, and then get the
remote flush location from the other node (publisher node) and store
those as candidate values (candidate_xmin and
candidate_remote_wal_lsn) in slot. We can probably reuse existing
candidate variables of the slot. Next, we can check the remote_flush
locations from all the origins corresponding in retain_sub_list and if
all are ahead of candidate_remote_wal_lsn, we can update the slot's
xmin to candidate_xmin.

I think in the above idea we can an optimization to combine the
request for remote wal LSN from different subscriptions pointing to
the same node to avoid sending multiple requests to the same node. I
am not sure if using pg_subscription.subconninfo is sufficient for
this, if not we can probably leave this optimization.

If this idea is feasible then it would reduce the number of slots
required to retain the deleted rows but the launcher needs to get the
remote wal location corresponding to each publisher node. There are
two ways to achieve that (a) launcher requests one of the apply
workers corresponding to subscriptions pointing to the same publisher
node to get this information; (b) launcher launches another worker to
get the remote wal flush location.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
  T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
  T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
  current XID: 300
  latest flush LSN; 3000

Node B
  current XID: 700
  latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in this scenario, Node-B applies changes in the
order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
resulting in the table having a row with val=3 (assuming
latest_committs_win is the default resolver for this confliction).

Please confirm this scenario as I might be missing something.

>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Does it mean that we use one candiate_remote_wal_lsn in a slot for all
subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
LSN of one of publishers, so other publishers could have completely
different LSNs. How do we compare the candidate_remote_wal_lsn to
remote_flush locations from all the origins?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
> 
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
> 
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> On Node-A, whether we detect "update_deleted" or "update_missing"
> still depends on when vacuum removes the tuple deleted by T2.

I think in this case, no matter we detect "update_delete" or "update_missing",
the final data is the same. Because T4's commit timestamp should be later than
T2 on node A, so in the case of "update_deleted", it will compare the commit
timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
which means we will convert the update into insert and apply. Even if the
deleted tuple is deleted and "update_missing" is detected, the update will
still be converted into insert and applied. So, the result is the same.

> 
> If applying T4 raises an "update_missing" (i.e. the changes are
> applied in the order of T2->T3->(vacuum)->T4), it converts into an
> insert, resulting in the table having a row with value = 3.
> 
> If applying T4 raises an "update_deleted" (i.e. the changes are
> applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
> in the table having no row.
> 
> On the other hand, in this scenario, Node-B applies changes in the
> order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
> resulting in the table having a row with val=3 (assuming
> latest_committs_win is the default resolver for this confliction).
> 
> Please confirm this scenario as I might be missing something.

As explained above, I think the data can be consistent in this case as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 24, 2024 at 2:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > As each apply worker needs a separate slot to retain deleted rows, the
> > requirement for slots will increase. The other possibility is to
> > maintain one slot by launcher or some other central process that
> > traverses all subscriptions, remember the ones marked with
> > retain_dead_rows (let's call this list as retain_sub_list). Then using
> > running_transactions get the oldest running_xact, and then get the
> > remote flush location from the other node (publisher node) and store
> > those as candidate values (candidate_xmin and
> > candidate_remote_wal_lsn) in slot. We can probably reuse existing
> > candidate variables of the slot. Next, we can check the remote_flush
> > locations from all the origins corresponding in retain_sub_list and if
> > all are ahead of candidate_remote_wal_lsn, we can update the slot's
> > xmin to candidate_xmin.
>
> Does it mean that we use one candiate_remote_wal_lsn in a slot for all
> subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
> LSN of one of publishers, so other publishers could have completely
> different LSNs. How do we compare the candidate_remote_wal_lsn to
> remote_flush locations from all the origins?
>

This should be an array/list with one element per publisher. We can
copy candidate_xmin to actual xmin only when the
candiate_remote_wal_lsn's corresponding to all publishers have been
applied aka their remote_flush locations (present in origins) are
ahead. The advantages I see with this are (a) reduces the number of
slots required to achieve the retention of deleted rows for conflict
detection, (b) in some cases we can avoid sending messages to the
publisher because with this we only need to send message to a
particular publisher once rather than by all the apply workers
corresponding to same publisher node.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Sep 24, 2024 at 9:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>

As per my understanding, one of the main reasons to keep it logical is
to allow syncing it to standbys (slotsync functionality). It is
required because after promotion the subscriptions replicated to
standby could be enabled to make it a subscriber. If that is not
possible due to any reason then we can consider it to be a physical
slot as well.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> >   current XID: 300
> >   latest flush LSN; 3000
> >
> > Node B
> >   current XID: 700
> >   latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > 301 (candidate_xmin).
> > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > deleted by T2.
> >
> > Here, on Node-B, there might be a transition between LSN 7100 and 8000
> > that might require the tuple that is deleted by T2.
> >
> > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > On Node-A, whether we detect "update_deleted" or "update_missing"
> > still depends on when vacuum removes the tuple deleted by T2.
>
> I think in this case, no matter we detect "update_delete" or "update_missing",
> the final data is the same. Because T4's commit timestamp should be later than
> T2 on node A, so in the case of "update_deleted", it will compare the commit
> timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
> which means we will convert the update into insert and apply. Even if the
> deleted tuple is deleted and "update_missing" is detected, the update will
> still be converted into insert and applied. So, the result is the same.

The "latest_timestamp_wins" is the default resolution method for
"update_deleted"? When I checked the wiki page[1], the "skip" was the
default solution method for that.

Regards,

[1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > I'm still studying this idea but let me confirm the following scenario.
> > > >
> > > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > > and XIDs and commit LSNs of T2 and T3 are the following:
> > > >
> > > > Node A
> > > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> > commit-LSN:1000
> > > >
> > > > Node B
> > > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > > commit-LSN:5000
> > > >
> > > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > > latest flush WAL position of Node-A and Node-B are following:
> > > >
> > > > Node A
> > > >   current XID: 300
> > > >   latest flush LSN; 3000
> > > >
> > > > Node B
> > > >   current XID: 700
> > > >   latest flush LSN: 7000
> > > >
> > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > > (i.e., the logical replication is delaying for 5 min).
> > > >
> > > > Consider the following scenario:
> > > >
> > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > > gets 301 (set as candidate_xmin).
> > > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> > 8000.
> > > > 4. The apply worker on Node-A continues applying changes, and
> > > > applies the transactions up to remote (commit) LSN 7100.
> > > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > > 301 (candidate_xmin).
> > > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > > deleted by T2.
> > > >
> > > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > > 8000 that might require the tuple that is deleted by T2.
> > > >
> > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > > still depends on when vacuum removes the tuple deleted by T2.
> > >
> > > I think in this case, no matter we detect "update_delete" or
> > > "update_missing", the final data is the same. Because T4's commit
> > > timestamp should be later than
> > > T2 on node A, so in the case of "update_deleted", it will compare the
> > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > > T4 should win, which means we will convert the update into insert and
> > > apply. Even if the deleted tuple is deleted and "update_missing" is
> > > detected, the update will still be converted into insert and applied. So, the
> > result is the same.
> >
> > The "latest_timestamp_wins" is the default resolution method for
> > "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> > solution method for that.
>
> Right, I think the wiki needs some update.
>
> I think using 'skip' as default for update_delete could easily cause data
> divergence when the dead tuple is deleted by an old transaction while the
> UPDATE has a newer timestamp like the case you mentioned. It's necessary to
> follow the last update win strategy when the incoming update has later
> timestamp, which is to convert update to insert.

Right. If "latest_timestamp_wins" is the default resolution for
"update_deleted", I think your idea works fine unless I'm missing
corner cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.
> >
> > The process of advancing the slot.xmin could be:
> >
> > 1) the apply worker call GetRunningTransactionData() to get the
> > 'oldestRunningXid' and consider this as 'candidate_xmin'.
> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)
> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >
>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Yeah, I think that such an idea to reduce the number required slots
would be necessary.

>
> I think in the above idea we can an optimization to combine the
> request for remote wal LSN from different subscriptions pointing to
> the same node to avoid sending multiple requests to the same node. I
> am not sure if using pg_subscription.subconninfo is sufficient for
> this, if not we can probably leave this optimization.
>
> If this idea is feasible then it would reduce the number of slots
> required to retain the deleted rows but the launcher needs to get the
> remote wal location corresponding to each publisher node. There are
> two ways to achieve that (a) launcher requests one of the apply
> workers corresponding to subscriptions pointing to the same publisher
> node to get this information; (b) launcher launches another worker to
> get the remote wal flush location.

I think the remote wal flush location is asked using a replication
protocol. Therefore, if a new worker is responsible for asking wal
flush location from multiple publishers (like the idea (b)), the
corresponding process would need to be launched on publisher sides and
logical replication would also need to start on each connection. I
think it would be better to get the remote wal flush location using
the existing logical replication connection (i.e., between the logical
wal sender and the apply worker), and advertise the locations on the
shared memory. Then, the central process who holds the slot to retain
the deleted row versions traverses them and increases slot.xmin if
possible.

The cost of requesting the remote wal flush location would not be huge
if we don't ask it very frequently. So probably we can start by having
each apply worker (in the retain_sub_list) ask the remote wal flush
location and can leave the optimization of avoiding sending the
request for the same publisher.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Mon, Sep 30, 2024 at 12:02 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 25, 2024 2:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I think the remote wal flush location is asked using a replication protocol.
> > Therefore, if a new worker is responsible for asking wal flush location from
> > multiple publishers (like the idea (b)), the corresponding process would need
> > to be launched on publisher sides and logical replication would also need to
> > start on each connection. I think it would be better to get the remote wal flush
> > location using the existing logical replication connection (i.e., between the
> > logical wal sender and the apply worker), and advertise the locations on the
> > shared memory. Then, the central process who holds the slot to retain the
> > deleted row versions traverses them and increases slot.xmin if possible.
> >
> > The cost of requesting the remote wal flush location would not be huge if we
> > don't ask it very frequently. So probably we can start by having each apply
> > worker (in the retain_sub_list) ask the remote wal flush location and can leave
> > the optimization of avoiding sending the request for the same publisher.
>
> Agreed. Here is the POC patch set based on this idea.
>
> The implementation is as follows:
>
> A subscription option is added to allow users to specify whether dead
> tuples on the subscriber, which are useful for detecting update_deleted
> conflicts, should be retained. The default setting is false. If set to true,
> the detection of update_deleted will be enabled,
>

I find the option name retain_dead_tuples bit misleading because by
name one can't make out the purpose of the same. It is better to name
it as detect_update_deleted or something on those lines.

> and an additional replication
> slot named pg_conflict_detection will be created on the subscriber to prevent
> dead tuples from being removed. Note that if multiple subscriptions on one node
> enable this option, only one replication slot will be created.
>

In general, we should have done this by default but as detecting
update_deleted type conflict has some overhead in terms of retaining
dead tuples for more time, so having an option seems reasonable. But I
suggest to keep this as a separate last patch. If we can make the core
idea work by default then we can enable it via option in the end.

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Hou,

Thanks for updating the patch! Here are my comments.
My comments do not take care which file contains the change, and the ordering may
be random.

1.
```
+       and <link
linkend="sql-createsubscription-params-with-detect-update-deleted"><literal>detect_conflict</literal></link>
+       are enabled.
```
"detect_conflict" still exists, it should be "detect_update_deleted".

2. maybe_advance_nonremovable_xid
```
+        /* Send a wal position request message to the server */
+        walrcv_send(LogRepWorkerWalRcvConn, "x", sizeof(uint8))
```
I think the character is used for PoC purpose, so it's about time we change it.
How about:

- 'W', because it requests the WAL location, or
- 'S', because it is accosiated with 's' message.

3. maybe_advance_nonremovable_xid
```
+        if (!AllTablesyncsReady())
+            return;
```
If we do not update oldest_nonremovable_xid during the sync, why do we send
the status message? I feel we can return in any phases if !AllTablesyncsReady().

4. advance_conflict_slot_xmin
```
+            ReplicationSlotCreate(CONFLICT_DETECTION_SLOT, false,
+                                  RS_PERSISTENT, false, false, false);
```
Hmm. You said the slot would be logical, but now it is physical. Which is correct?

5. advance_conflict_slot_xmin
```
+            xmin_horizon = GetOldestSafeDecodingTransactionId(true);
```
Since the slot won't do the logical decoding, we do not have to use oldest-safe-decoding
xid. I feel it is OK to use the latest xid.

6. advance_conflict_slot_xmin
```
+    /* No need to update xmin if the slot has been invalidated */
+    if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
```
I feel the slot won't be invalidated. According to
InvalidatePossiblyObsoleteSlot(), the physical slot cannot be invalidated if it
has invalid restart_lsn.

7. ApplyLauncherMain
```
+            retain_dead_tuples |= sub->detectupdatedeleted;
```
Can you tell me why it must be updated even if the sub is disabled?

8. ApplyLauncherMain

If the subscription which detect_update_deleted = true exists but wal_receiver_status_interval = 0,
the slot won't be advanced and dead tuple retains forever... is it right? Can we
avoid it anyway?

9. FindMostRecentlyDeletedTupleInfo

It looks for me that the scan does not use indexes even if exists, but I feel it should use.
Am I missing something or is there a reason?

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716E0A283D1B66954CDF5A694682%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, October 11, 2024 4:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> Attach the V4 patch set which addressed above comments.
> 

While reviewing the patch, I noticed that the current design could not work in
a non-bidirectional cluster (publisher -> subscriber) when the publisher is
also a physical standby. (We supported logical decoding on a physical standby
recently, so it's possible to take a physical standby as a logical publisher).

The cluster looks like:

    physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)

The issue arises because the physical standby (acting as the publisher) might
lag behind its primary. As a result, the logical walsender on the standby might
not be able to get the latest WAL position when requested by the logical
subscriber. We can only get the WAL replay position but there may be more WALs
that are being replicated from the primary and those WALs could have older
commit timestamp. (Note that transactions on both primary and standby have
the same commit timestamp).

So, the logical walsender might send an outdated WAL position as feedback.
This, in turn, can cause the replication slot on the subscriber to advance
prematurely, leading to the unwanted removal of dead tuples. As a result, the
apply worker may fail to correctly detect update-delete conflicts.

We thought of few options to fix this:

1) Add a Time-Based Subscription Option:

We could add a new time-based option for subscriptions, such as
retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
candidate XID, the launcher will wait for the specified time before advancing
the slot.xmin. This ensures that deleted tuples are retained for at least the
duration defined by this new option.

This approach is designed to simulate the functionality of the GUC
(vacuum_committs_age), but with a simpler implementation that does not impact
vacuum performance. We can maintain both this time-based method and the current
automatic method. If a user does not specify the time-based option, we will
continue using the existing approach to retain dead tuples until all concurrent
transactions from the remote node have been applied.

2) Modification to the Logical Walsender

On the logical walsender, which is as a physical standby, we can build an
additional connection to the physical primary to obtain the latest WAL
position. This position will then be sent as feedback to the logical
subscriber.

A potential concern is that this requires the walsender to use the walreceiver
API, which may seem a bit unnatural. And, it starts an additional walsender
process on the primary, as the logical walsender on the physical standby will
need to communicate with this walsender to fetch the WAL position.

3) Documentation of Restrictions

As an alternative, we could simply document the restriction that detecting
update_delete is not supported if the publisher is also acting as a physical
standby.


Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> While reviewing the patch, I noticed that the current design could not work in
> a non-bidirectional cluster (publisher -> subscriber) when the publisher is
> also a physical standby. (We supported logical decoding on a physical standby
> recently, so it's possible to take a physical standby as a logical publisher).
>
> The cluster looks like:
>
>         physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)
>
> The issue arises because the physical standby (acting as the publisher) might
> lag behind its primary. As a result, the logical walsender on the standby might
> not be able to get the latest WAL position when requested by the logical
> subscriber. We can only get the WAL replay position but there may be more WALs
> that are being replicated from the primary and those WALs could have older
> commit timestamp. (Note that transactions on both primary and standby have
> the same commit timestamp).
>
> So, the logical walsender might send an outdated WAL position as feedback.
> This, in turn, can cause the replication slot on the subscriber to advance
> prematurely, leading to the unwanted removal of dead tuples. As a result, the
> apply worker may fail to correctly detect update-delete conflicts.
>
> We thought of few options to fix this:
>
> 1) Add a Time-Based Subscription Option:
>
> We could add a new time-based option for subscriptions, such as
> retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> candidate XID, the launcher will wait for the specified time before advancing
> the slot.xmin. This ensures that deleted tuples are retained for at least the
> duration defined by this new option.
>
> This approach is designed to simulate the functionality of the GUC
> (vacuum_committs_age), but with a simpler implementation that does not impact
> vacuum performance. We can maintain both this time-based method and the current
> automatic method. If a user does not specify the time-based option, we will
> continue using the existing approach to retain dead tuples until all concurrent
> transactions from the remote node have been applied.
>
> 2) Modification to the Logical Walsender
>
> On the logical walsender, which is as a physical standby, we can build an
> additional connection to the physical primary to obtain the latest WAL
> position. This position will then be sent as feedback to the logical
> subscriber.
>
> A potential concern is that this requires the walsender to use the walreceiver
> API, which may seem a bit unnatural. And, it starts an additional walsender
> process on the primary, as the logical walsender on the physical standby will
> need to communicate with this walsender to fetch the WAL position.
>

This idea is worth considering, but I think it may not be a good
approach if the physical standby is cascading. We need to restrict the
update_delete conflict detection, if the standby is cascading, right?

The other approach is that we send current_timestamp from the
subscriber and somehow check if the physical standby has applied
commit_lsn up to that commit_ts, if so, it can send that WAL position
to the subscriber, otherwise, wait for it to be applied. If we do this
then we don't need to add a restriction for cascaded physical standby.
I think the subscriber anyway needs to wait for such an LSN to be
applied on standby before advancing the xmin even if we get it from
the primary. This is because the subscriber can only be able to apply
and flush the WAL once it is applied on the standby. Am, I missing
something?

This approach has a disadvantage that we are relying on clocks to be
synced on both nodes which we anyway need for conflict resolution as
discussed in the thread [1]. We also need to consider the Commit
Timestamp and LSN inversion issue as discussed in another thread [2]
if we want to pursue this approach because we may miss an LSN that has
a prior timestamp.

> 3) Documentation of Restrictions
>
> As an alternative, we could simply document the restriction that detecting
> update_delete is not supported if the publisher is also acting as a physical
> standby.
>

If we don't want to go for something along the lines of the approach
mentioned in (2) then I think we can do a combination of (1) and (3)
where we can error out if the user has not provided retain_dead_tuples
and the publisher is physical standby.

[1] - https://www.postgresql.org/message-id/CABdArM4%3D152B9PoyF4kggQ4LniYtjBCdUjL%3DqBwD-jcogP2BPQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > We thought of few options to fix this:
> >
> > 1) Add a Time-Based Subscription Option:
> >
> > We could add a new time-based option for subscriptions, such as
> > retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> > candidate XID, the launcher will wait for the specified time before advancing
> > the slot.xmin. This ensures that deleted tuples are retained for at least the
> > duration defined by this new option.
> >
> > This approach is designed to simulate the functionality of the GUC
> > (vacuum_committs_age), but with a simpler implementation that does not impact
> > vacuum performance. We can maintain both this time-based method and the current
> > automatic method. If a user does not specify the time-based option, we will
> > continue using the existing approach to retain dead tuples until all concurrent
> > transactions from the remote node have been applied.
> >
> > 2) Modification to the Logical Walsender
> >
> > On the logical walsender, which is as a physical standby, we can build an
> > additional connection to the physical primary to obtain the latest WAL
> > position. This position will then be sent as feedback to the logical
> > subscriber.
> >
> > A potential concern is that this requires the walsender to use the walreceiver
> > API, which may seem a bit unnatural. And, it starts an additional walsender
> > process on the primary, as the logical walsender on the physical standby will
> > need to communicate with this walsender to fetch the WAL position.
> >
>
> This idea is worth considering, but I think it may not be a good
> approach if the physical standby is cascading. We need to restrict the
> update_delete conflict detection, if the standby is cascading, right?
>
> The other approach is that we send current_timestamp from the
> subscriber and somehow check if the physical standby has applied
> commit_lsn up to that commit_ts, if so, it can send that WAL position
> to the subscriber, otherwise, wait for it to be applied. If we do this
> then we don't need to add a restriction for cascaded physical standby.
> I think the subscriber anyway needs to wait for such an LSN to be
> applied on standby before advancing the xmin even if we get it from
> the primary. This is because the subscriber can only be able to apply
> and flush the WAL once it is applied on the standby. Am, I missing
> something?
>
> This approach has a disadvantage that we are relying on clocks to be
> synced on both nodes which we anyway need for conflict resolution as
> discussed in the thread [1]. We also need to consider the Commit
> Timestamp and LSN inversion issue as discussed in another thread [2]
> if we want to pursue this approach because we may miss an LSN that has
> a prior timestamp.
>

The problem due to Commit Timestamp and LSN inversion is that the
standby may not consider the WAL LSN from an earlier timestamp, which
could lead to the removal of required dead rows on the subscriber.

The other problem pointed out by Hou-San offlist due to Commit
Timestamp and LSN inversion is that we could miss sending the WAL LSN
that the subscriber requires to retain dead rows for update_delete
conflict. For example, consider the following case 2 node,
bidirectional setup:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1; ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

Say subscription is created with retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest
wal flush location on Node B. Till that time, the T3 should have
finished, so the xmin will be advanced only after applying the WALs
that is later than T3. So, the dead tuple will not be removed before
applying the T3, which means the update_delete can be detected.

As there is a gap between when we acquire the commit_timestamp and the
commit LSN, it is possible that T3 would have not yet flushed it's WAL
even though it is committed earlier than T2. If this happens then we
won't be able to detect update_deleted conflict reliably.

Now, the one simpler idea is to acquire the commit timestamp and
reserve WAL (LSN) under the same spinlock in
ReserveXLogInsertLocation() but that could be costly as discussed in
the thread [1]. The other more localized solution is to acquire a
timestamp after reserving the commit WAL LSN outside the lock which
will solve this particular problem.

[1] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Oct 11, 2024 at 2:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attach the V4 patch set which addressed above comments.
>

A few minor comments:
1.
+ * Retaining the dead tuples for this period is sufficient because any
+ * subsequent transaction from the publisher will have a later timestamp.
+ * Therefore, it is acceptable if dead tuples are removed by vacuum and an
+ * update_missing conflict is detected, as the correct resolution for the
+ * last-update-wins strategy in this case is to convert the UPDATE to an INSERT
+ * and apply it anyway.
+ *
+ * The 'remote_wal_pos' will be reset after sending a new request to walsender.
+ */
+static void
+maybe_advance_nonremovable_xid(XLogRecPtr *remote_wal_pos,
+    DeadTupleRetainPhase *phase)

We should cover the key point of retaining dead tuples which is to
avoid converting updates to inserts (considering the conflict as
update_missing) in the comments above and also in the commit message.

2. In maybe_advance_nonremovable_xid() all three phases are handled by
different if blocks but as per my understanding the phase value will
be unique in one call to the function. So, shouldn't it be handled
with else if?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Peter Smith
Date:
Hi Hou-san, here are my review comments for patch v5-0001.

======
General

1.
Sometimes in the commit message and code comments the patch refers to
"transaction id" and other times to "transaction ID". The patch should
use the same wording everywhere.

======
Commit message.

2.
"While for concurrent remote transactions with earlier timestamps,..."

I think this means:
"But, for concurrent remote transactions with earlier timestamps than
the DELETE,..."

Maybe expressed this way is clearer.

~~~

3.
... the resolution would be to convert the update to an insert.

Change this to uppercase like done elsewhere:
"... the resolution would be to convert the UPDATE to an INSERT.

======
doc/src/sgml/protocol.sgml

4.
            +       <varlistentry
id="protocol-replication-primary-wal-status-update">
+        <term>Primary WAL status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>
+           <listitem>
+            <para>
+             Identifies the message as a primary WAL status update.
+            </para>
+           </listitem>
+          </varlistentry>

I felt it would be better if this is described as just a "Primary
status update" instead of a "Primary WAL status update". Doing this
makes it more flexible in case there is a future requirement to put
more status values in here which may not be strictly WAL related.

~~~

5.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Standby WAL status request (F)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('W')</term>
+           <listitem>
+            <para>
+             Identifies the message as a request for the WAL status
on the primary.
+            </para>
+           </listitem>
+          </varlistentry>
+         </variablelist>
+        </listitem>
+       </varlistentry>

5a.
Ditto the previous comment #4. Perhaps you should just call this a
"Primary status request".

~

5b.
Also, The letter 'W' also seems chosen because of WAL. But it might be
more flexible if those identifiers are more generic.

e.g.
's' = the request for primary status update
'S' = the primary status update

======
src/backend/replication/logical/worker.c

6.
+ else if (c == 's')
+ {
+ TimestampTz timestamp;
+
+ remote_lsn = pq_getmsgint64(&s);
+ timestamp = pq_getmsgint64(&s);
+
+ maybe_advance_nonremovable_xid(&remote_lsn, &phase);
+ UpdateWorkerStats(last_received, timestamp, false);
+ }

Since there's no equivalent #define or enum value, IMO it is too hard
to know the intent of this code without already knowing the meaning of
the magic letter 's'. At least there could be a comment here to
explain that this is for handling an incoming "Primary status update"
message.

~~~

maybe_advance_nonremovable_xid:

7.
+ * The oldest_nonremovable_xid is maintained in shared memory to prevent dead
+ * rows from being removed prematurely when the apply worker still needs them
+ * to detect update-delete conflicts.

/update-delete/update_deleted/

~

8.
+ * applied and flushed locally. The process involves:
+ *
+ * DTR_REQUEST_WALSENDER_WAL_POS - Call GetOldestActiveTransactionId() to get
+ * the candidate xmin and send a message to request the remote WAL position
+ * from the walsender.
+ *
+ * DTR_WAIT_FOR_WALSENDER_WAL_POS - Wait for receiving the WAL position from
+ * the walsender.
+ *
+ * DTR_WAIT_FOR_LOCAL_FLUSH - Advance the non-removable transaction ID if the
+ * current flush location has reached or surpassed the received WAL position.

8a.
This part would be easier to read if those 3 phases were indented from
the rest of this function comment.

~

8b.
/Wait for receiving/Wait to receive/

~

9.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, which involves
+ * converting an UPDATE to an INSERT and applying it if remote transactions

The commit message referred to a "latest_timestamp_wins". I suppose
that is the same as what this function comment called
"last-update-wins". The patch should use consistent terminology.

It would be better if the commit message and (parts of) this function
comment were just cut/pasted to be identical. Currently, they seem to
be saying the same thing, but using slightly different wording.

~

10.
+ static TimestampTz xid_advance_attemp_time = 0;
+ static FullTransactionId candidate_xid;

typo in var name - "attemp"

~

11.
+ *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+
+ /*
+ * Do not return here because the apply worker might have already
+ * applied all changes up to remote_wal_pos. Proceeding to the next
+ * phase to check if we can immediately advance the transaction ID.
+ */

11a.
IMO this comment should be above the *phase assignment.

11b.
/Proceeding to the next phase to check.../Instead, proceed to the next
phase to check.../

~

12.
+ /*
+ * Advance the non-removable transaction id if the remote wal position
+ * has been received, and all transactions up to that position on the
+ * publisher have been applied and flushed locally.
+ */

Some minor re-wording would help clarify this comment.

SUGGESTION
Reaching here means the remote wal position has been received, and all
transactions up to that position on the
publisher have been applied and flushed locally. So, now we can
advance the non-removable transaction id.

~

13.
+ *phase = DTR_REQUEST_WALSENDER_WAL_POS;
+
+ /*
+ * Do not return here as enough time might have passed since the last
+ * wal position request. Proceeding to the next phase to determine if
+ * we can send the next request.
+ */

13a.
IMO this comment should be above the *phase assignment.

~

13b.
This comment should have the same wording here as in the previous
if-block (see #11b).

/Proceeding to the next phase to determine.../Instead, proceed to the
next phase to check.../

~

14.
+ FullTransactionId next_full_xix;
+ FullTransactionId full_xid;

You probably mean 'next_full_xid' (not xix)

~

15.
+ /*
+ * Exit early if the user has disabled sending messages to the
+ * publisher.
+ */
+ if (wal_receiver_status_interval <= 0)
+ return;

What are the implications of this early exit? If the update request is
not possible, then I guess the update status is never received, but
then I suppose that means none of this update_deleted logic is
possible. If that is correct, then will there be some documented
warning/caution about conflict-handling implications by disabling that
GUC?

======
src/backend/replication/walsender.c

16.
+/*
+ * Process the standby message requesting the latest WAL write position.
+ */
+static void
+ProcessStandbyWalPosRequestMessage(void)

Ideally, this function comment should be referring to this message we
are creating by the same name that it was called in the documentation.
For example something like:

"Process the request for a primary status update message."

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Conflict detection for update_deleted in logical replication

From
Nisha Moond
Date:
> Here is the V5 patch set which addressed above comments.
>
Here are a couple of comments on v5 patch-set -

1) In FindMostRecentlyDeletedTupleInfo(),

+ /* Try to find the tuple */
+ while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+ {
+ Assert(tuples_equal(scanslot, searchslot, eq));
+ update_recent_dead_tuple_info(scanslot, oldestXmin, delete_xid,
+   delete_time, delete_origin);
+ }

In my tests, I found that the above assert() triggers during
unidirectional replication of an update on a table. While doing the
replica identity index scan, it can only ensure to match the indexed
columns value, but the current Assert() assumes all the column values
should match, which seems wrong.

2) Since update_deleted requires both 'track_commit_timestamp' and the
'detect_update_deleted' to be enabled, should we raise an error in the
CREATE and ALTER subscription commands when track_commit_timestamp=OFF
but the user specifies detect_update_deleted=true?



Re: Conflict detection for update_deleted in logical replication

From
Michail Nikolaev
Date:
Hello, Hayato!

> Thanks for updating the patch! While reviewing yours, I found a corner case that
> a recently deleted tuple cannot be detected when index scan is chosen.
> This can happen when indices are re-built during the replication.
> Unfortunately, I don't have any solutions for it.

I just randomly saw your message, so, I could be wrong and out of the context - so, sorry in advance.

But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0] (WaitForOlderSnapshots) while creating index concurrently.


Best regards,
Mikhail.

Re: Conflict detection for update_deleted in logical replication

From
Peter Smith
Date:
Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

======
General.

1.
There are multiple comments in this patch mentioning 'wal' which
probably should say 'WAL' (uppercase).

~~~

2.
There are multiple comments in this patch missing periods (.)

======
doc/src/sgml/protocol.sgml

3.
+        <term>Primary status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>

Currently, there are identifiers 's' for the "Primary status update"
message, and 'S' for the "Primary status request" message.

As mentioned in the previous review ([1] #5b) I preferred it to be the
other way around:
'S' = status from primary
's' = request status from primary

Of course, it doesn't make any difference, but "S" seems more
important than "s", so therefore "S" being the main msg and coming
from the *primary* seemed more natural to me.

~~~

4.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Primary status request (F)</term>

Is it better to call this slightly differently to emphasise this is
only the request?

/Primary status request/Request primary status update/

======
src/backend/replication/logical/worker.c

5.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, as dead tuples are
+ * useful for detecting conflicts only during the application of concurrent

As mentioned in review [1] #9, this is still called "last-update-wins
strategy" here in the comment, but was called the "last update win
strategy" strategy in the commit message. Those terms should be the
same -- e.g. the 'last-update-wins' strategy.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPs3sgXh2%3DrHDaqjU%3Dp28CK5rCgCLJZgPByc6Tr7_P2imw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Mikhail,

Thanks for giving comments!

> But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0]
> (WaitForOlderSnapshots) while creating index concurrently.

WaitForOlderSnapshots() waits other transactions which can access older tuples
than the specified (=current) transaction, right? I think it does not solve our issue.

Assuming that same workloads [1] are executed, slot.xmin on node2 is arbitrary
older than noted SQL, and WaitForOlderSnapshots(slot.xmin) is added in
ReindexRelationConcurrently(). In this case, transaction older than slot.xmin
does not exist at step 5, so the REINDEX will finish immediately. Then, the worker
receives changes at step 7 so it is problematic if worker uses the reindexed index.

From another point of view... this approach must fix REINDEX code, but we should
not modify other component of codes as much as possible. This feature is related
with the replication so that changes should be closed within the replication subdir.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5692541820BCC365C69442FFF54F2%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Conflict detection for update_deleted in logical replication

From
Michail Nikolaev
Date:
Hello Hayato,

> WaitForOlderSnapshots() waits other transactions which can access older tuples
> than the specified (=current) transaction, right? I think it does not solve our issue.

Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
The reason WaitForOlderSnapshots is used in ReindexConcurrently and DefineIndex is to guarantee that any transaction
needing to view rows not included in the index has completed before the index is marked as valid.
The same logic should apply here — we need to wait for the xmin of slot used in conflict detection as well.

> From another point of view... this approach must fix REINDEX code, but we should
> not modify other component of codes as much as possible. This feature is related
> with the replication so that changes should be closed within the replication subdir.

One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
This way, WaitForOlderSnapshots will account for it.

By the way, WaitForOlderSnapshots is also used in partitioning and other areas for similar reasons,
so these might be good places to check for any related issues.

Best regards,
Mikhail,

RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Mikhail,

Thanks for describing more detail!

> Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
> rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
> ...
> One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
> This way, WaitForOlderSnapshots will account for it.

Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
error out, deadlock...). In this case, the snapshot cannot eb registered by the
worker and index can be re-built during the period.

If we do not assume the existence of workers, we must directly somehow check slot.xmin
and wait until it is advanced until the REINDEXing transaction. I still think it
is risky and another topic.

Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
detection. We can work on it in later versions based on the needs.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Hou,

Thanks for updating the patch! Here are my comments.

01. CreateSubscription
```
+    if (opts.detectupdatedeleted && !track_commit_timestamp)
+        ereport(ERROR,
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("detecting update_deleted conflicts requires \"%s\" to be enabled",
+                       "track_commit_timestamp"));
```

I don't think this guard is sufficient. I found two cases:

* Creates a subscription with detect_update_deleted = false and track_commit_timestamp = true,
  then alters detect_update_deleted to true.
* Creates a subscription with detect_update_deleted = true and track_commit_timestamp = true,
  then update track_commit_timestamp to true and restart the instance.

Based on that, how about detecting the inconsistency on the apply worker? It check
the parameters and do error out when it starts or re-reads a catalog. If we want
to detect in SQL commands, this can do in parse_subscription_options().

02. AlterSubscription()
```
+                    ApplyLauncherWakeupAtCommit();
```

The reason why launcher should wake up is different from other parts. Can we add comments
that it is needed to track/untrack the xmin?

03. build_index_column_bitmap()
```
+    for (int i = 0; i < indexinfo->ii_NumIndexAttrs; i++)
+    {
+        int         keycol = indexinfo->ii_IndexAttrNumbers[i];
+
+        index_bitmap = bms_add_member(index_bitmap, keycol);
+    }
```

I feel we can assert the ii_IndexAttrNumbers is valid, because the passed index is a replica identity key.

04. LogicalRepApplyLoop()

Can we move the definition of "phase" to the maybe_advance_nonremovable_xid() and
make it static? The variable is used only by the function.

05. LogicalRepApplyLoop()
```
+                        UpdateWorkerStats(last_received, timestamp, false);
```

The statistics seems not correct. last_received is not sent at "timestamp", it had
already been sent earlier. Do we really have to update here?

06. ErrorOnReservedSlotName()

I feel we should document that the slot name 'pg_conflict_detection' cannot be specified
unconditionally.

07. General

update_deleted can happen without DELETE commands. Should we rename the conflict
reason, like 'update_target_modified'?

E.g., there is a 2-way replication system and below transactions happen:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts = 10.00
  T2: UPDATE t SET id = 2 WHERE id = 1; ts = 10.02
Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts = 10.01

Then, T3 comes to Node A after executing T2. T3 tries to find id = 1 but find a
dead tuple instead. In this case, 'update_delete' happens without the delete.

08. Others

Also, here is an analysis related with the truncation of commit timestamp. I worried the
case that commit timestamp might be removed so that the detection would not go well.
But it seemed that entries can be removed when it behinds GetOldestNonRemovableTransactionId(NULL),
i.e., horizons.shared_oldest_nonremovable. The value is affected by the replication
slots so that interesting commit_ts entries for us are not removed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

From
Michail Nikolaev
Date:
Hello Hayato!

> Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
> error out, deadlock...). In this case, the snapshot cannot eb registered by the
> worker and index can be re-built during the period.

However, the xmin of a slot affects replication_slot_xmin in ProcArrayStruct, so it might
be straightforward to wait for it during concurrent index builds. We could consider adding
a separate conflict_resolution_replication_slot_xmin to wait only for that.

> Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> detection. We can work on it in later versions based on the needs.

From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
in production databases on regular basic, so any master-master system should be unaffected by it.

Best regards,
Mikhail.

RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Mikhail,

Thanks for the reply!

> > Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> > detection. We can work on it in later versions based on the needs.
>
> From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
> in production databases on regular basic, so any master-master system should be unaffected by it.

I think you do not understand what I said correctly. The main point here is that
the index scan is not needed to detect the update_deleted. In the first version
workers can do the normal sequential scan instead. This workaround definitely does
not affect REINDEX CONCURRENTLY.
After the patch being good shape or pushed, we can support using the index to find
the dead tuple, at that time we can consider how we ensure the index contains the entry
for dead tuples.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Monday, October 28, 2024 1:40 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

Thanks for the comments!

> 
> ======
> doc/src/sgml/protocol.sgml
> 
> 3.
> +        <term>Primary status update (B)</term>
> +        <listitem>
> +         <variablelist>
> +          <varlistentry>
> +           <term>Byte1('s')</term>
> 
> Currently, there are identifiers 's' for the "Primary status update"
> message, and 'S' for the "Primary status request" message.
> 
> As mentioned in the previous review ([1] #5b) I preferred it to be the other way
> around:
> 'S' = status from primary
> 's' = request status from primary
> 
> Of course, it doesn't make any difference, but "S" seems more important than
> "s", so therefore "S" being the main msg and coming from the *primary*
> seemed more natural to me.

I am not sure if one message is more important than another, so I prefer to
keep the current style. Since this is a minor issue, we can easily revise it in
future version patches if we receive additional feedback.

Other comments look good to me and will address in V7 patch set.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Nov 12, 2024 at 2:19 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, October 18, 2024 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > We thought of few options to fix this:
> > > >
> > > > 1) Add a Time-Based Subscription Option:
> > > >
> > > > We could add a new time-based option for subscriptions, such as
> > > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining
> > > > the candidate XID, the launcher will wait for the specified time
> > > > before advancing the slot.xmin. This ensures that deleted tuples are
> > > > retained for at least the duration defined by this new option.
> > > >
> > > > This approach is designed to simulate the functionality of the GUC
> > > > (vacuum_committs_age), but with a simpler implementation that does
> > > > not impact vacuum performance. We can maintain both this time-based
> > > > method and the current automatic method. If a user does not specify
> > > > the time-based option, we will continue using the existing approach
> > > > to retain dead tuples until all concurrent transactions from the remote node
> > have been applied.
> > > >
> > > > 2) Modification to the Logical Walsender
> > > >
> > > > On the logical walsender, which is as a physical standby, we can
> > > > build an additional connection to the physical primary to obtain the
> > > > latest WAL position. This position will then be sent as feedback to
> > > > the logical subscriber.
> > > >
> > > > A potential concern is that this requires the walsender to use the
> > > > walreceiver API, which may seem a bit unnatural. And, it starts an
> > > > additional walsender process on the primary, as the logical
> > > > walsender on the physical standby will need to communicate with this
> > walsender to fetch the WAL position.
> > > >
> > >
> > > This idea is worth considering, but I think it may not be a good
> > > approach if the physical standby is cascading. We need to restrict the
> > > update_delete conflict detection, if the standby is cascading, right?
> > >
> > > The other approach is that we send current_timestamp from the
> > > subscriber and somehow check if the physical standby has applied
> > > commit_lsn up to that commit_ts, if so, it can send that WAL position
> > > to the subscriber, otherwise, wait for it to be applied. If we do this
> > > then we don't need to add a restriction for cascaded physical standby.
> > > I think the subscriber anyway needs to wait for such an LSN to be
> > > applied on standby before advancing the xmin even if we get it from
> > > the primary. This is because the subscriber can only be able to apply
> > > and flush the WAL once it is applied on the standby. Am, I missing
> > > something?
> > >
> > > This approach has a disadvantage that we are relying on clocks to be
> > > synced on both nodes which we anyway need for conflict resolution as
> > > discussed in the thread [1]. We also need to consider the Commit
> > > Timestamp and LSN inversion issue as discussed in another thread [2]
> > > if we want to pursue this approach because we may miss an LSN that has
> > > a prior timestamp.
> > >
>
> For the "publisher is also a standby" issue, I have modified the V8 patch to
> report a warning in this case. As I personally feel this is not the main use case
> for conflict detection, we can revisit it later after pushing the main patches
> receiving some user feedback.
>
> >
> > The problem due to Commit Timestamp and LSN inversion is that the standby
> > may not consider the WAL LSN from an earlier timestamp, which could lead to
> > the removal of required dead rows on the subscriber.
> >
> > The other problem pointed out by Hou-San offlist due to Commit Timestamp
> > and LSN inversion is that we could miss sending the WAL LSN that the
> > subscriber requires to retain dead rows for update_delete conflict. For example,
> > consider the following case 2 node, bidirectional setup:
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1; ts=10.02 AM
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Say subscription is created with retain_dead_tuples = true/false
> >
> > After executing T2, the apply worker on Node A will check the latest wal flush
> > location on Node B. Till that time, the T3 should have finished, so the xmin will
> > be advanced only after applying the WALs that is later than T3. So, the dead
> > tuple will not be removed before applying the T3, which means the
> > update_delete can be detected.
> >
> > As there is a gap between when we acquire the commit_timestamp and the
> > commit LSN, it is possible that T3 would have not yet flushed it's WAL even
> > though it is committed earlier than T2. If this happens then we won't be able to
> > detect update_deleted conflict reliably.
> >
> > Now, the one simpler idea is to acquire the commit timestamp and reserve WAL
> > (LSN) under the same spinlock in
> > ReserveXLogInsertLocation() but that could be costly as discussed in the
> > thread [1]. The other more localized solution is to acquire a timestamp after
> > reserving the commit WAL LSN outside the lock which will solve this particular
> > problem.
>
> Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought
> about another approach that can fix the issue independently. This idea is to
> delay the non-removable xid advancement until all the remote concurrent
> transactions that may have been assigned earlier timestamp have been committed.
>
> The implementation is:
>
> On the walsender, after receiving a request, it can send the oldest xid and
> next xid along with the
>
> In response, the apply worker can safely advance the non-removable XID if
> oldest_committing_xid == nextXid, indicating that there is no race conditions.
>
> Alternatively, if oldest_committing_xid != nextXid, the apply worker might send
> a second request after some interval. If the subsequently obtained
> oldest_committing_xid surpasses or equal to the initial nextXid, it indicates
> that all previously risky transactions have committed, therefore the the
> non-removable transaction ID can be advanced.
>
>
> Attach the V8 patch set. Note that I put the new approach for above race
> condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the
> approach may or may not be accepted based on the discussion in WAL/LSN
> inversion thread.

I've started to review these patch series. I've reviewed only 0001
patch for now but let me share some comments:

---
+        if (*phase == DTR_WAIT_FOR_WALSENDER_WAL_POS)
+        {
+                Assert(xid_advance_attempt_time);

What is this assertion for? If we want to check here that we have sent
a request message for the publisher, I think it's clearer if we have
"Assert(xid_advance_attempt_time > 0)". I'm not sure we really need
this assertion though since it's never false once we set
xid_advance_attempt_time.

---
+                /*
+                 * Do not return here because the apply worker might
have already
+                 * applied all changes up to remote_lsn. Instead,
proceed to the
+                 * next phase to check if we can immediately advance
the transaction
+                 * ID.
+                 */
+                *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+        }

If we always proceed to the next phase, is this phase really
necessary? IIUC even if we jump to DTR_WAIT_FOR_LOCAL_FLUSH phase
after DTR_REQUEST_WALSENDER_WAL_POS and have a check if we received
the remote WAL position in DTR_WAIT_FOR_LOCAL_FLUSH phase, it would
work fine.

---
+                /*
+                 * Reaching here means the remote WAL position has
been received, and
+                 * all transactions up to that position on the
publisher have been
+                 * applied and flushed locally. So, now we can advance the
+                 * non-removable transaction ID.
+                 */
+                SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+                MyLogicalRepWorker->oldest_nonremovable_xid = candidate_xid;
+                SpinLockRelease(&MyLogicalRepWorker->relmutex);

How about adding a debug log message showing new
oldest_nonremovable_xid and related LSN for making the
debug/investigation easier? For example,

elog(LOG, "confirmed remote flush up to %X/%X: new oldest_nonremovable_xid %u",
     LSN_FORMAT_ARGS(*remote_lsn),
     XidFromFullTransactionId(candidate_xid));

---
+                /*
+                 * Exit early if the user has disabled sending messages to the
+                 * publisher.
+                 */
+                if (wal_receiver_status_interval <= 0)
+                        return;

In send_feedback(), we send a feedback message if the publisher
requests, even if wal_receiver_status_interval is 0. On the other
hand, the above codes mean that we don't send a WAL position request
and therefore never update oldest_nonremovable_xid if
wal_receiver_status_interval is 0. I'm concerned it could be a pitfall
for users.

---
% git show | grep update_delete
    This set of patches aims to support the detection of
update_deleted conflicts,
    transactions with earlier timestamps than the DELETE, detecting
update_delete
    We assume that the appropriate resolution for update_deleted conflicts, to
    that when detecting the update_deleted conflict, and the remote update has a
+ * to detect update_deleted conflicts.
+ * update_deleted is necessary, as the UPDATEs in remote transactions should be
+        * to allow for the detection of update_delete conflicts when applying

There are mixed 'update_delete' and 'update_deleted' in the commit
message and the codes. I think it's better to use 'update_deleted'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Nisha Moond
Date:
On Thu, Nov 14, 2024 at 8:24 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V9 patch set which addressed above comments.
>

Reviewed v9 patch-set and here are my comments for below changes:

@@ -1175,10 +1189,29 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

  LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
  w = logicalrep_worker_find(sub->oid, InvalidOid, false);
+
+ if (can_advance_xmin && w != NULL)
+ {
+ FullTransactionId nonremovable_xid;
+
+ SpinLockAcquire(&w->relmutex);
+ nonremovable_xid = w->oldest_nonremovable_xid;
+ SpinLockRelease(&w->relmutex);
+
+ if (!FullTransactionIdIsValid(xmin) ||
+ !FullTransactionIdIsValid(nonremovable_xid) ||
+ FullTransactionIdPrecedes(nonremovable_xid, xmin))
+ xmin = nonremovable_xid;
+ }
+

1) In Patch-0002, could you please add a comment above "+ if
(can_advance_xmin && w != NULL)" to briefly explain the purpose of
finding the minimum XID at this point?

2) In Patch-0004, with the addition of the 'detect_update_deleted'
option, I see the following two issues in the above code:
2a) Currently, all enabled subscriptions are considered when comparing
and finding the minimum XID, even if detect_update_deleted is disabled
for some subscriptions.
I suggest excluding the oldest_nonremovable_xid of subscriptions where
detect_update_deleted=false by updating the check as follows:

    if (sub->detectupdatedeleted && can_advance_xmin && w != NULL)

2b) I understand why advancing xmin is not allowed when a subscription
is disabled. However, the current check allows a disabled subscription
with detect_update_deleted=false to block xmin advancement, which
seems incorrect. Should the check also account for
detect_update_deleted?, like :
  if (sub->detectupdatedeleted &&  !sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

However, I'm not sure if this is the right fix, as it could lead to
inconsistencies if the detect_update_deleted is set to false after
disabling the sub.
Thoughts?

--
Thanks,
Nisha



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Thu, Nov 21, 2024 at 3:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V10 patch set which addressed above comments
> and fixed a CFbot warning due to un-initialized variable.
>

We should make the v10_2-0001* as the first main patch for review till
we have a consensus to resolve LSN<->Timestamp inversion issue. This
is because v10_2 doesn't rely on the correctness of LSN<->Timestamp
mapping. Now, say in some later release, we fix the LSN<->Timestamp
inversion issue, we can simply avoid sending remote_xact information
and it will behave the same as your v10_1 approach.

Comments on v10_2_0001*:
======================
1.
+/*
+ * The phases involved in advancing the non-removable transaction ID.
+ *
+ * Refer to maybe_advance_nonremovable_xid() for details on how the function
+ * transitions between these phases.
+ */
+typedef enum
+{
+ DTR_GET_CANDIDATE_XID,
+ DTR_REQUEST_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_LOCAL_FLUSH
+} DeadTupleRetainPhase;

First, can we have a better name for this enum like
RetainConflictInfoPhase or something like that? Second, the phase
transition is not very clear from the comments atop
maybe_advance_nonremovable_xid. You can refer to comments atop
tablesync.c or snapbuild.c to see other cases where we have explained
phase transitions.

2.
+ *   Wait for the status from the walsender. After receiving the first status
+ *   after acquiring a new candidate transaction ID, do not proceed if there
+ *   are ongoing concurrent remote transactions.

In this part of the comments: " .. after acquiring a new candidate
transaction ID ..." appears misplaced.

3. In maybe_advance_nonremovable_xid(), the handling of each phase
looks ad-hoc though I see that you have done that have so that you can
handle the phase change functionality after changing the phase
immediately. If we have to ever extend this functionality, it will be
tricky to handle the new phase or at least the code will become
complicated. How about handling each phase in the order of their
occurrence and having separate functions for each phase as we have in
apply_dispatch()? That way it would be convenient to invoke the phase
handling functionality even if it needs to be called multiple times in
the same function.

4.
/*
+ * An invalid position indiates the publisher is also
+ * a physical standby. In this scenario, advancing the
+ * non-removable transaction ID is not supported. This
+ * is because the logical walsender on the standby can
+ * only get the WAL replay position but there may be
+ * more WALs that are being replicated from the
+ * primary and those WALs could have earlier commit
+ * timestamp. Refer to
+ * maybe_advance_nonremovable_xid() for details.
+ */
+ if (XLogRecPtrIsInvalid(remote_lsn))
+ {
+ ereport(WARNING,
+ errmsg("cannot get the latest WAL position from the publisher"),
+ errdetail("The connected publisher is also a standby server."));
+
+ /*
+ * Continuously revert to the request phase until
+ * the standby server (publisher) is promoted, at
+ * which point a valid WAL position will be
+ * received.
+ */
+ phase = DTR_REQUEST_PUBLISHER_STATUS;
+ }

Shouldn't this be an ERROR as the patch doesn't support this case? The
same should be true for later patches for the subscription option.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Tue, Nov 26, 2024 at 1:50 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>

Few comments on the latest 0001 patch:
1.
+ * - RCI_REQUEST_PUBLISHER_STATUS:
+ *   Send a message to the walsender requesting the publisher status, which
+ *   includes the latest WAL write position and information about running
+ *   transactions.

Shall we make the later part of this comment (".. information about
running transactions.") accurate w.r.t the latest changes of
requesting xacts that are known to be in the process of committing?

2.
+ * The overall state progression is: GET_CANDIDATE_XID ->
+ * REQUEST_PUBLISHER_STATUS -> WAIT_FOR_PUBLISHER_STATUS -> (loop to
+ * REQUEST_PUBLISHER_STATUS if concurrent remote transactions persist) ->
+ * WAIT_FOR_LOCAL_FLUSH.

This state machine progression misses to mention that after we waited
for flush the state again moves back to GET_CANDIDATE_XID.

3.
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+ /* Send a WAL position request message to the server */
+ walrcv_send(LogRepWorkerWalRcvConn,
+ reply_message->data, reply_message->len);

This message requests more than a WAL write position but the comment
is incomplete.

4.
+/*
+ * Process the request for a primary status update message.
+ */
+static void
+ProcessStandbyPSRequestMessage(void)
...
+ /*
+ * Information about running transactions and the WAL write position is
+ * only available on a non-standby server.
+ */
+ if (!RecoveryInProgress())
+ {
+ oldestXidInCommit = GetOldestTransactionIdInCommit();
+ nextFullXid = ReadNextFullTransactionId();
+ lsn = GetXLogWriteRecPtr();
+ }

Shall we ever reach here for a standby case? If not shouldn't that be an ERROR?

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Hou,

Thanks for updating the patch! Here are my comments mainly for 0001.

01. protocol.sgml

I think the ordering of attributes in "Primary status update" seems not correct.
The second entry is LSN, not the oldest running xid.

02. maybe_advance_nonremovable_xid

```
+        case RCI_REQUEST_PUBLISHER_STATUS:
+            request_publisher_status(data);
+            break;
```

I think the part is not reachable because the transit
RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
get_candidate_xid()->request_publisher_status().
Can we remove this?

03. RetainConflictInfoData

```
+    Timestamp   xid_advance_attempt_time;   /* when the candidate_xid is
+                                             * decided */
+    Timestamp   reply_time;     /* when the publisher responds with status */
+
+} RetainConflictInfoData;
```

The datatype should be TimestampTz.

04. get_candidate_xid

```
+    if (!TimestampDifferenceExceeds(data->xid_advance_attempt_time, now,
+                                    wal_receiver_status_interval * 1000))
+        return;
```

I think data->xid_advance_attempt_time can be accessed without the initialization
at the first try. I've found the patch could not pass test for 32-bit build
due to the reason.


05. request_publisher_status

```
+    if (!reply_message)
+    {
+        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
+
+        reply_message = makeStringInfo();
+        MemoryContextSwitchTo(oldctx);
+    }
+    else
+        resetStringInfo(reply_message);
```

Same lines exist in two functions: can we provide an inline function?

06. wait_for_publisher_status

```
+    if (!FullTransactionIdIsValid(data->last_phase_at))
+        data->last_phase_at = FullTransactionIdFromEpochAndXid(data->remote_epoch,
+                                                               data->remote_nextxid);
+
```

Not sure, is there a possibility that data->last_phase_at is valid here? It is initialized
just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

07. wait_for_publisher_status

I think all calculations and checking in the function can be done even on the
walsender. Based on this, I come up with an idea to reduce the message size:
walsender can just send a status (boolean) whether there are any running transactions
instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
amount of calc. on publisher side?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 07. wait_for_publisher_status
>
> I think all calculations and checking in the function can be done even on the
> walsender. Based on this, I come up with an idea to reduce the message size:
> walsender can just send a status (boolean) whether there are any running transactions
> instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
> amount of calc. on publisher side?
>

Won't it be tricky to implement this tracking on publisher side?
Because we not only need to check that there is no running xact but
also that the oldest_running_xact that was present last time when the
status message arrived has finished. Won't this need more bookkeeping
on publisher's side?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 02. maybe_advance_nonremovable_xid
>
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
>
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
> get_candidate_xid()->request_publisher_status().
> Can we remove this?
>

After changing phase to RCI_REQUEST_PUBLISHER_STATUS, we directly
invoke request_publisher_status, and similarly, after changing phase
to RCI_WAIT_FOR_LOCAL_FLUSH, we call wait_for_local_flush. Won't it be
better that in both cases and other similar cases, we instead invoke
maybe_advance_nonremovable_xid()? This will make
maybe_advance_nonremovable_xid() the only function with the knowledge
to take action based on phase rather than spreading the knowledge of
phase-related actions to various functions. Then we should also add a
comment at the end in request_publisher_status() where we change the
phase but don't do anything. The comment should explain the reason for
the same.

One more point, it seems on a busy server, the patch won't be able to
advance nonremovable_xid. We should call
maybe_advance_nonremovable_xid() at all the places where we call
send_feedback() and additionally, we should also call it after
applying some threshold number (say 100) of messages. The latter is to
avoid the cases where we won't invoke the required functionality on a
busy server with a large value of sender/receiver timeouts.

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, November 29, 2024 6:35 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Hou,
> 
> Thanks for updating the patch! Here are my comments mainly for 0001.

Thanks for the comments!

> 
> 02. maybe_advance_nonremovable_xid
> 
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
> 
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATU
> S is done in get_candidate_xid()->request_publisher_status().
> Can we remove this?

I changed to call the maybe_advance_nonremovable_xid() after changing the phase
in get_candidate_xid/wait_for_publisher_status, so that the code is reachable.

> 
> 
> 05. request_publisher_status
> 
> ```
> +    if (!reply_message)
> +    {
> +        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
> +
> +        reply_message = makeStringInfo();
> +        MemoryContextSwitchTo(oldctx);
> +    }
> +    else
> +        resetStringInfo(reply_message);
> ```
> 
> Same lines exist in two functions: can we provide an inline function?

I personally feel these codes may not worth a separate function since it’s simple.
So didn't change in this version.

> 
> 06. wait_for_publisher_status
> 
> ```
> +    if (!FullTransactionIdIsValid(data->last_phase_at))
> +        data->last_phase_at =
> FullTransactionIdFromEpochAndXid(data->remote_epoch,
> +
> + data->remote_nextxid);
> +
> ```
> 
> Not sure, is there a possibility that data->last_phase_at is valid here? It is
> initialized just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

Oh. I think last_phase_at should be initialized only in the first phase. Fixed.

Other comments look good to me and have been addressed in V13.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Mon, Dec 9, 2024 at 3:20 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is a summary of tests targeted to the Publisher node in a
> Publisher-Subscriber setup.
> (All tests done with v14 patch-set)
>
> ----------------------------
> Performance Tests:
> ----------------------------
> Test machine details:
> Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :120 - 800GB RAM
>
> Setup:
> - Created two nodes ( 'Pub' and 'Sub'), with logical replication.
> - Configurations for Both Nodes:
>
>     shared_buffers = 40GB
>     max_worker_processes = 32
>     max_parallel_maintenance_workers = 24
>     max_parallel_workers = 32
>     checkpoint_timeout = 1d
>     max_wal_size = 24GB
>     min_wal_size = 15GB
>     autovacuum = off
>
> - Additional setting on Sub: 'track_commit_timestamp = on' (required
> for the feature).
> - Initial data insertion via 'pgbench' with scale factor 100 on both nodes.
>
> Workload:
> - Ran pgbench with 60 clients for the publisher.
> - The duration was 120s, and the measurement was repeated 10 times.
>

You didn't mention it is READONLY or READWRITE tests but I think it is
later. I feel it is better to run these tests for 15 minutes, repeat
them 3 times, and get the median data for those. Also, try to run it
for lower client counts like 2, 16, 32. Overall, the conclusion may be
same but it will rule out the possibility of any anomaly.

With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, December 11, 2024 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 6, 2024 at 1:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, December 5, 2024 6:00 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > A few more comments:
> > > 1.
> > > +static void
> > > +wait_for_local_flush(RetainConflictInfoData *data)
> > > {
> > > ...
> > > +
> > > + data->phase = RCI_GET_CANDIDATE_XID;
> > > +
> > > + maybe_advance_nonremovable_xid(data);
> > > +}
> > >
> > > Isn't it better to reset all the fields of data before the next
> > > round of GET_CANDIDATE_XID phase? If we do that then we don't need
> > > to reset
> > > data->remote_lsn = InvalidXLogRecPtr; and data->last_phase_at =
> > > InvalidFullTransactionId; individually in request_publisher_status()
> > > and
> > > get_candidate_xid() respectively. Also, it looks clean and logical
> > > to me unless I am missing something.
> >
> > The remote_lsn was used to determine whether a status is received, so
> > was reset each time in request_publisher_status. To make it more
> > straightforward, I added a new function parameter 'status_received',
> > which would be set to true when calling
> > maybe_advance_nonremovable_xid() on receving the status. After this
> change, there is no need to reset the remote_lsn.
> >
> 
> As part of the above comment, I had asked for three things (a) avoid setting
> data->remote_lsn = InvalidXLogRecPtr; in request_publisher_status(); (b)
> avoid setting data->last_phase_at =InvalidFullTransactionId; in
> get_candidate_xid(); (c) reset data in
> wait_for_local_flush() after wait is over. You only did (a) in the patch and didn't
> mention anything about (b) or (c). Is that intentional? If so, what is the reason?

I think I misunderstood the intention, so will address in next version.

> 
> *
> +static bool
> +can_advance_nonremovable_xid(RetainConflictInfoData *data) {
> +
> 
> Isn't it better to make this an inline function as it contains just one check?

Agreed. Will address in next version.

> 
> *
> + /*
> + * The non-removable transaction ID for a subscription is centrally
> + * managed by the main apply worker.
> + */
> + if (!am_leader_apply_worker())
> 
> I have tried to improve this comment in the attached.

Thanks, will check and merge the next version.

Best Regards,
Hou zj


Re: Conflict detection for update_deleted in logical replication

From
Dilip Kumar
Date:
On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V16 patch set which addressed above comments.
>
> There is a new 0002 patch where I tried to dynamically adjust the interval for
> advancing the transaction ID. Instead of always waiting for
> wal_receiver_status_interval, we can start with a short interval and increase
> it if there is no activity (no xid assigned on subscriber), but not beyond
> wal_receiver_status_interval.
>
> The intention is to more effectively advance xid to avoid retaining too much
> dead tuples. My colleague will soon share detailed performance data and
> analysis related to this enhancement.

I am starting to review the patches, and trying to understand the
concept that how you are preventing vacuum to remove the dead tuple
which might required by the concurrent remote update, so I was looking
at the commit message which explains the idea quite clearly but I have
one question

The process of advancing the non-removable transaction ID in the apply worker
involves:

== copied from commit message of 0001 start==
1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
candidate xid.
2) Send a message to the walsender requesting the publisher status, which
includes the latest WAL write position and information about transactions
that are in the commit phase.
3) Wait for the status from the walsender. After receiving the first status, do
not proceed if there are concurrent remote transactions that are still in the
commit phase. These transactions might have been assigned an earlier commit
timestamp but have not yet written the commit WAL record. Continue to request
the publisher status until all these transactions have completed.
4) Advance the non-removable transaction ID if the current flush location has
reached or surpassed the last received WAL position.
== copied from commit message of 0001 start==

So IIUC in step 2) we send the message and get the list of all the
transactions which are in the commit phase? What do you exactly mean
by a transaction which is in the commit phase?  Can I assume
transactions which are currently running on the publisher?  And in
step 3) we wait for all the transactions to get committed which we saw
running (or in the commit phase) and we anyway don't worry about the
newly started transactions as they would not be problematic for us.
And in step 4) we would wait for all the flush location to reach "last
received WAL position", here my question is what exactly will be the
"last received WAL position" I assume it would be the position
somewhere after the position of the commit WAL of all the transaction
we were interested on the publisher?

At high level the overall idea looks promising to me but wanted to put
more thought on lower level details about what transactions exactly we
are waiting for and what WAL LSN we are waiting to get flushed.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

> 
> On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the V16 patch set which addressed above comments.
> >
> > There is a new 0002 patch where I tried to dynamically adjust the interval for
> > advancing the transaction ID. Instead of always waiting for
> > wal_receiver_status_interval, we can start with a short interval and increase
> > it if there is no activity (no xid assigned on subscriber), but not beyond
> > wal_receiver_status_interval.
> >
> > The intention is to more effectively advance xid to avoid retaining too much
> > dead tuples. My colleague will soon share detailed performance data and
> > analysis related to this enhancement.
> 
> I am starting to review the patches, and trying to understand the
> concept that how you are preventing vacuum to remove the dead tuple
> which might required by the concurrent remote update, so I was looking
> at the commit message which explains the idea quite clearly but I have
> one question

Thanks for the review!

> 
> The process of advancing the non-removable transaction ID in the apply worker
> involves:
> 
> == copied from commit message of 0001 start==
> 1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
> candidate xid.
> 2) Send a message to the walsender requesting the publisher status, which
> includes the latest WAL write position and information about transactions
> that are in the commit phase.
> 3) Wait for the status from the walsender. After receiving the first status, do
> not proceed if there are concurrent remote transactions that are still in the
> commit phase. These transactions might have been assigned an earlier commit
> timestamp but have not yet written the commit WAL record. Continue to
> request
> the publisher status until all these transactions have completed.
> 4) Advance the non-removable transaction ID if the current flush location has
> reached or surpassed the last received WAL position.
> == copied from commit message of 0001 start==
> 
> So IIUC in step 2) we send the message and get the list of all the
> transactions which are in the commit phase? What do you exactly mean by a
> transaction which is in the commit phase?

I was referring to transactions calling RecordTransactionCommit() and have
entered the commit critical section. In the patch, we checked if the proc has
marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.

> Can I assume transactions which are currently running on the publisher?

I think it's a subset of the running transactions. We only get the transactions
in commit phase with the intention to avoid delays caused by waiting for
long-running transactions to complete, which can result in the long retention
of dead tuples.

We decided to wait for running(committing) transactions due to the WAL/LSN
inversion issue[1]. The original idea is to directly return the latest WAL
write position without checking running transactions. But since there is a gap
between when we acquire the commit_timestamp and the commit LSN, it's possible
the transactions might have been assigned an earlier commit timestamp but have
not yet written the commit WAL record.

> And in step 3) we wait for all the transactions to get committed which we saw
> running (or in the commit phase) and we anyway don't worry about the newly
> started transactions as they would not be problematic for us. And in step 4)
> we would wait for all the flush location to reach "last received WAL
> position", here my question is what exactly will be the "last received WAL
> position" I assume it would be the position somewhere after the position of
> the commit WAL of all the transaction we were interested on the publisher?

Yes, your understanding is correct. It's a position after the position of all
the interesting transactions. In the patch, we get the latest WAL write
position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
have finished and reply it to apply worker.

> At high level the overall idea looks promising to me but wanted to put
> more thought on lower level details about what transactions exactly we
> are waiting for and what WAL LSN we are waiting to get flushed.

Yeah, that makes sense, thanks.

[1]
https://www.postgresql.org/message-id/OS0PR01MB571628594B26B4CC2346F09294592%40OS0PR01MB5716.jpnprd01.prod.outlook.com>

Best Regards,
Hou zj



Re: Conflict detection for update_deleted in logical replication

From
Dilip Kumar
Date:
On Tue, Dec 17, 2024 at 8:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > So IIUC in step 2) we send the message and get the list of all the
> > transactions which are in the commit phase? What do you exactly mean by a
> > transaction which is in the commit phase?
>
> I was referring to transactions calling RecordTransactionCommit() and have
> entered the commit critical section. In the patch, we checked if the proc has
> marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.
>
> > Can I assume transactions which are currently running on the publisher?
>
> I think it's a subset of the running transactions. We only get the transactions
> in commit phase with the intention to avoid delays caused by waiting for
> long-running transactions to complete, which can result in the long retention
> of dead tuples.

Ok

> We decided to wait for running(committing) transactions due to the WAL/LSN
> inversion issue[1]. The original idea is to directly return the latest WAL
> write position without checking running transactions. But since there is a gap
> between when we acquire the commit_timestamp and the commit LSN, it's possible
> the transactions might have been assigned an earlier commit timestamp but have
> not yet written the commit WAL record.

Yes, that makes sense.

> > And in step 3) we wait for all the transactions to get committed which we saw
> > running (or in the commit phase) and we anyway don't worry about the newly
> > started transactions as they would not be problematic for us. And in step 4)
> > we would wait for all the flush location to reach "last received WAL
> > position", here my question is what exactly will be the "last received WAL
> > position" I assume it would be the position somewhere after the position of
> > the commit WAL of all the transaction we were interested on the publisher?
>
> Yes, your understanding is correct. It's a position after the position of all
> the interesting transactions. In the patch, we get the latest WAL write
> position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
> have finished and reply it to apply worker.

Got it, thanks.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: Conflict detection for update_deleted in logical replication

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Hou,

Thanks for updating the patch. Few comments:

01. worker.c

```
+/*
+ * The minimum (100ms) and maximum (3 minutes) intervals for advancing
+ * non-removable transaction IDs.
+ */
+#define MIN_XID_ADVANCEMENT_INTERVAL 100
+#define MAX_XID_ADVANCEMENT_INTERVAL 180000L
```

Since the max_interval is an integer variable, it can be s/180000L/180000/.


02.  ErrorOnReservedSlotName()

Currently the function is callsed from three points - create_physical_replication_slot(),
create_logical_replication_slot() and CreateReplicationSlot(). 
Can we move them to the ReplicationSlotCreate(), or combine into ReplicationSlotValidateName()?

03. advance_conflict_slot_xmin()

```
    Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
```

Assuming the case that the launcher crashed just after ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
After the restart, the slot can be acquired since SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
is true, but the process would fail the assert because data.xmin is still invalid.

I think we should re-create the slot when the xmin is invalid. Thought?

04. documentation

Should we update "Configuration Settings" section in logical-replication.sgml
because an additional slot is required?

05. check_remote_recovery()

Can we add a test case related with this?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Thu, Dec 19, 2024 at 4:34 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Sunday, December 15, 2024 9:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >
> > 5. The apply worker needs to at least twice get the publisher status message to
> > advance oldest_nonremovable_xid once. It then uses the remote_lsn of the last
> > such message to ensure that it has been applied locally. Such a remote_lsn
> > could be a much later value than required leading to delay in advancing
> > oldest_nonremovable_xid. How about if while first time processing the
> > publisher_status message on walsender, we get the
> > latest_transaction_in_commit by having a function
> > GetLatestTransactionIdInCommit() instead of
> > GetOldestTransactionIdInCommit() and then simply wait till that proc has
> > written commit WAL (aka wait till it clears DELAY_CHKPT_IN_COMMIT)?
> > Then get the latest LSN wrote and send that to apply worker waiting for the
> > publisher_status message. If this is feasible then we should be able to
> > advance oldest_nonremovable_xid with just one publisher_status message.
> > Won't that be an improvement over current? If so, we can even further try to
> > improve it by just using commit_LSN of the transaction returned by
> > GetLatestTransactionIdInCommit(). One idea is that we can try to use
> > MyProc->waitLSN which we are using in synchronous replication for our
> > purpose. See SyncRepWaitForLSN.
>
> I will do more performance tests on this and address if it improves
> the performance.
>

Did you check this idea? Again, thinking about this, I see a downside
to the new proposal. In the new proposal, the walsender needs to
somehow wait for the transactions in the commit which essentially
means that it may lead delay in decoding and sending the decoded WAL.
But it is still worth checking the impact of such a change, if nothing
else, we can add a short comment in the code to suggest such an
improvement is not worthwhile.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>

But why would it prevent the launcher from creating the slot? I think
we should add this check in the function
ReplicationSlotValidateName(). Another related point:

+ErrorOnReservedSlotName(const char *name)
+{
+ if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_RESERVED_NAME),
+ errmsg("replication slot name \"%s\" is reserved",
+    name));

Won't it be sufficient to check using an existing IsReservedName()?
Even, if not, then also we should keep that as part of the check
similar to what we are doing in pg_replication_origin_create().

> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>

Sounds reasonable but OTOH, all other places that create physical
slots (which we are doing here) don't use this trick. So, don't they
need similar reliability? Also, add some comments as to why we are
initially creating the RS_EPHEMERAL slot as we have at other places.

Few other comments on 0003
=======================
1.
+ if (sublist)
+ {
+ bool updated;
+
+ if (!can_advance_xmin)
+ xmin = InvalidFullTransactionId;
+
+ updated = advance_conflict_slot_xmin(xmin);

How will it help to try advancing slot_xmin when xmin is invalid?

2.
@@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;

In ApplyLauncherMain(), if one of the subscriptions is disabled (say
the last one in sublist), then can_advance_xmin will become false in
the above code. Now, later, as quoted in comment-1, the patch
overrides xmin to InvalidFullTransactionId if can_advance_xmin is
false. Won't that lead to the wrong computation of xmin?

3.
+ slot_maybe_exist = true;
+ }
+
+ /*
+ * Drop the slot if we're no longer retaining dead tuples.
+ */
+ else if (slot_maybe_exist)
+ {
+ drop_conflict_slot_if_exists();
+ slot_maybe_exist = false;

Can't we use MyReplicationSlot instead of introducing a new boolean
slot_maybe_exist?

In any case, how does the above code deal with the case where the
launcher is restarted for some reason and there is no subscription
after that? Will it be possible to drop the slot in that case?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.

Conflict detection of truncated updates is detected as update_missing
and deleted update is detected as update_deleted. I was not sure if
truncated updates should also be detected as update_deleted, as the
document says truncate operation is "It has the same effect as an
unqualified DELETE on each table" at [1].

I tried with the following three node(N1,N2 & N3) setup with
subscriber on N3 subscribing to the publisher pub1 in N1 and publisher
pub2 in N2:
N1 - pub1
N2 - pub2
N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2)

-- Insert a record in N1
insert into t1 values(1);

-- Insert a record in N2
insert into t1 values(1);

-- Now N3 has the above inserts from N1 and N2
N3=# select * from t1;
 c1
----
  1
  1
(2 rows)

-- Truncate t1 from N2
N2=# truncate t1;
TRUNCATE TABLE

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generated a conflict
postgres=# update t1 set c1 = 2;
UPDATE 1
N1=# select * from t1;
 c1
----
  2
(1 row)

--- N3 logs the conflict as update_missing
2025-01-02 12:21:37.388 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-02 12:21:37.388 IST [24803] DETAIL:  Could not find the row to
be updated.
        Remote tuple (2); replica identity full (1).
2025-01-02 12:21:37.388 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 757, finished
at 0/17478D0

-- Insert a record with value 2 in N2
N2=# insert into t1 values(2);
INSERT 0 1

-- Now N3 has the above inserted records:
N3=# select * from t1;
 c1
----
  2
(1 row)

-- Delete this record from N2:
N2=# delete from t1;
DELETE 1

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generate a conflict
postgres=# update t1 set c1 = 3;
UPDATE 1

--- N3 logs the conflict as update_deleted
2025-01-02 12:22:38.036 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-02 12:22:38.036 IST [24803] DETAIL:  The row to be updated was
deleted by a different origin "pg_16388" in transaction 764 at
2025-01-02 12:22:29.025347+05:30.
        Remote tuple (3); replica identity full (2).
2025-01-02 12:22:38.036 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 758, finished
at 0/174D240

I'm not sure if this behavior is expected or not. If this is expected
can we mention this in the documentation for the user to handle the
conflict resolution accordingly in these cases.
Thoughts?

[1] - https://www.postgresql.org/docs/devel/sql-truncate.html

Regards,
Vignesh



Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few suggestions:
1) If we have a subscription with detect_update_deleted option and we
try to upgrade it with default settings(in case dba forgot to set
track_commit_timestamp), the upgrade will fail after doing a lot of
steps like that mentioned in ok below:
Setting locale and encoding for new cluster                   ok
Analyzing all rows in the new cluster                         ok
Freezing all rows in the new cluster                          ok
Deleting files from new pg_xact                               ok
Copying old pg_xact to new server                             ok
Setting oldest XID for new cluster                            ok
Setting next transaction ID and epoch for new cluster         ok
Deleting files from new pg_multixact/offsets                  ok
Copying old pg_multixact/offsets to new server                ok
Deleting files from new pg_multixact/members                  ok
Copying old pg_multixact/members to new server                ok
Setting next multixact ID and offset for new cluster          ok
Resetting WAL archives                                        ok
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
  postgres
*failure*

We should detect this at an earlier point somewhere like in
check_new_cluster_subscription_configuration and throw an error from
there.

2) Also should we include an additional slot for the
pg_conflict_detection slot while checking max_replication_slots.
Though this error will occur after the upgrade is completed, it may be
better to include the slot during upgrade itself so that the DBA need
not handle this error separately after the upgrade is completed.

3) We have reserved the pg_conflict_detection name in this version, so
if there was a replication slot with the name pg_conflict_detection in
the older version, the upgrade will fail at a very later stage like an
earlier upgrade shown. I feel we should check if the old cluster has
any slot with the name pg_conflict_detection and throw an error
earlier itself:
+void
+ErrorOnReservedSlotName(const char *name)
+{
+       if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+               ereport(ERROR,
+                               errcode(ERRCODE_RESERVED_NAME),
+                               errmsg("replication slot name \"%s\"
is reserved",
+                                          name));
+}

4) We should also mention something like below in the documentation so
the user can be aware of it:
The slot name cannot be created with pg_conflict_detection, as this is
reserved for logical replication conflict detection.

Regards,
Vignesh



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch set which addressed all other comments.
>

Some more miscellaneous comments:
=============================
1.
@@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
  * modifying it.  This makes checkpoint's determination of which xacts
  * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
  */
- Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
  START_CRIT_SECTION();
- MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+ MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;

  /*
  * Insert the commit XLOG record.
@@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
  */
  if (markXidCommitted)
  {
- MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+ MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
  END_CRIT_SECTION();

The comments related to this change should be updated in EndPrepare()
and RecordTransactionCommitPrepared(). They still refer to the
DELAY_CHKPT_START flag. We should update the comments explaining why a
similar change is not required for prepare or commit_prepare, if there
is one.

2.
 static bool
 tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
- TypeCacheEntry **eq)
+ TypeCacheEntry **eq, Bitmapset *columns)
 {
  int attrnum;

@@ -337,6 +340,14 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
  if (att->attisdropped || att->attgenerated)
  continue;

+ /*
+ * Ignore columns that are not listed for checking.
+ */
+ if (columns &&
+ !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber,
+    columns))
+ continue;

Update the comment atop tuples_equal to reflect this change.

3.
+FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
+ TransactionId *delete_xid,
+ RepOriginId *delete_origin,
+ TimestampTz *delete_time)
...
...
+ /* Try to find the tuple */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
+ {
+ bool dead = false;
+ TransactionId xmax;
+ TimestampTz localts;
+ RepOriginId localorigin;
+
+ if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
+ continue;
+
+ tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
+ buf = hslot->buffer;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
HEAPTUPLE_RECENTLY_DEAD)
+ dead = true;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!dead)
+ continue;

Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
matter to detect update_delete conflict?

4. In FindMostRecentlyDeletedTupleInfo(), add comments to state why we
need to use SnapshotAny.

5.
+
+      <varlistentry
id="sql-createsubscription-params-with-detect-update-deleted">
+        <term><literal>detect_update_deleted</literal>
(<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether the detection of <xref
linkend="conflict-update-deleted"/>
+          is enabled. The default is <literal>false</literal>. If set to
+          true, the dead tuples on the subscriber that are still useful for
+          detecting <xref linkend="conflict-update-deleted"/>
+          are retained,

One of the purposes of retaining dead tuples is to detect
update_delete conflict. But, I also see the following in 0001's commit
message: "Since the mechanism relies on a single replication slot, it
not only assists in retaining dead tuples but also preserves commit
timestamps and origin data. These information will be displayed in the
additional logs generated for logical replication conflicts.
Furthermore, the preserved commit timestamps and origin data are
essential for consistently detecting update_origin_differs conflicts."
which indicates there are other cases where retaining dead tuples can
help. So, I was thinking about whether to name this new option as
retain_dead_tuples or something along those lines?

BTW, it is not clear how retaining dead tuples will help the detection
update_origin_differs. Will it happen when the tuple is inserted or
updated on the subscriber and then when we try to update the same
tuple due to remote update, the commit_ts information of the xact is
not available because the same is already removed by vacuum? This
should happen for the update case for the new row generated by the
update operation as that will be used in comparison. Can you please
show it be a test case even if it is manual?

Can't it happen for delete_origin_differs as well for the same reason?

6. I feel we should keep 0004 as a later patch. We can ideally
consider committing 0001, 0002, 0003, 0005, and 0006 (or part of 0006
to get some tests that are relevant) as one unit and then the patch to
detect and report update_delete conflict. What do you think?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.
>

Thank you for updating the patches!

I have one comment on the 0001 patch:

+       /*
+        * The changes made by this and later transactions are still
non-removable
+        * to allow for the detection of update_deleted conflicts when applying
+        * changes in this logical replication worker.
+        *
+        * Note that this info cannot directly protect dead tuples from being
+        * prematurely frozen or removed. The logical replication launcher
+        * asynchronously collects this info to determine whether to advance the
+        * xmin value of the replication slot.
+        *
+        * Therefore, FullTransactionId that includes both the
transaction ID and
+        * its epoch is used here instead of a single Transaction ID. This is
+        * critical because without considering the epoch, the transaction ID
+        * alone may appear as if it is in the future due to transaction ID
+        * wraparound.
+        */
+       FullTransactionId oldest_nonremovable_xid;

The last paragraph of the comment mentions that we need to use
FullTransactionId to properly compare XIDs even after the XID
wraparound happens. But once we set the oldest-nonremovable-xid it
prevents XIDs from being wraparound, no? I mean that workers'
oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
xmin) are never away from more than 2^31 XIDs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few comments:
1) In case there are no logical replication workers, the launcher
process just logs a warning "out of logical replication worker slots"
and continues. Whereas in case of "pg_conflict_detection" replication
slot creation launcher throw an error and the launcher exits, can we
throw a warning in this case too:
2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
"max_replication_slots".
2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
replication launcher" (PID 4280) exited with exit code 1

2) Currently, we do not detect when the track_commit_timestamp setting
is disabled for a subscription immediately after the worker starts.
Instead, it is detected later during conflict detection. Since
changing the track_commit_timestamp GUC requires a server restart, it
would be beneficial for DBAs if the error were raised immediately.
This way, DBAs would be aware of the issue as they monitor the server
restart and can take the necessary corrective actions without delay.

3) Tab completion missing for CREATE SUBSCRIPTION does not include
detect_update_deleted option:
postgres=# create subscription sub3 CONNECTION 'dbname=postgres
host=localhost port=5432' publication pub1 with (
BINARY              COPY_DATA           DISABLE_ON_ERROR    FAILOVER
         PASSWORD_REQUIRED   SLOT_NAME           SYNCHRONOUS_COMMIT
CONNECT             CREATE_SLOT         ENABLED             ORIGIN
         RUN_AS_OWNER        STREAMING           TWO_PHASE

4) Tab completion missing for ALTER SUBSCRIPTION does not include
detect_update_deleted option:
ALTER SUBSCRIPTION sub3 SET (
BINARY              FAILOVER            PASSWORD_REQUIRED   SLOT_NAME
         SYNCHRONOUS_COMMIT
DISABLE_ON_ERROR    ORIGIN              RUN_AS_OWNER        STREAMING
         TWO_PHASE

5) Copyright year can be updated to 2025:
+++ b/src/test/subscription/t/035_confl_update_deleted.pl
@@ -0,0 +1,169 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test the CREATE SUBSCRIPTION 'detect_update_deleted' parameter and its
+# interaction with the xmin value of replication slots.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;

6) This include is not required, I was able to compile without it:
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -173,12 +173,14 @@
 #include "replication/logicalrelation.h"
 #include "replication/logicalworker.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"

Regards,
Vignesh



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Jan 3, 2025 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have one comment on the 0001 patch:
>
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
>
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID
> wraparound happens. But once we set the oldest-nonremovable-xid it
> prevents XIDs from being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.
>

I also think that the slot's non-removal-xid should ensure that we
never allow xid to advance to a level where it can cause a wraparound
for the oldest-nonremovable-xid value stored in each worker because
the slot's value is the minimum of all workers. Now, if both of us are
missing something then it is probably better to write some more
detailed comments as to how this can happen.

Along the same lines, I was thinking whether
RetainConflictInfoData->last_phase_at should be FullTransactionId but
I think that is correct because we can't stop wraparound from
happening on remote_node, right?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Jan 3, 2025 at 2:34 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Few comments:
> 1) In case there are no logical replication workers, the launcher
> process just logs a warning "out of logical replication worker slots"
> and continues. Whereas in case of "pg_conflict_detection" replication
> slot creation launcher throw an error and the launcher exits, can we
> throw a warning in this case too:
> 2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
> 2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
> "max_replication_slots".
> 2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
> replication launcher" (PID 4280) exited with exit code 1
>

This case is not the same because if give just WARNING and allow to
proceed then we won't be able to protect dead rows from removal. We
don't want the apply workers to keep working and making progress till
this slot is created. Am, I missing something? If not, we probably
need to ensure this if not already ensured. Also, we should mention in
the docs that the 'max_replication_slots' setting should consider this
additional slot.

> 2) Currently, we do not detect when the track_commit_timestamp setting
> is disabled for a subscription immediately after the worker starts.
> Instead, it is detected later during conflict detection.
>

I am not sure if an ERROR is required in the first place. Shouldn't we
simply not detect the update_delete in that case? It should be
documented that to detect this conflict 'track_commit_timestamp'
should be enabled. Don't we do the same thing for *_origin_differs
type of conflicts?

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Conflict detection of truncated updates is detected as update_missing
> and deleted update is detected as update_deleted. I was not sure if
> truncated updates should also be detected as update_deleted, as the
> document says truncate operation is "It has the same effect as an
> unqualified DELETE on each table" at [1].
>

This is expected behavior because TRUNCATE would immediately reclaim
space and remove all the data. So, there is no way to retain the
removed row.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Dec 20, 2024 at 12:41 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
>  - To validate this further, we modified the patch to check only each transaction's commit_time and advance the
non-removableXID if the commit_time is greater than candidate_xid_time. The benchmark results[4] remained consistent,
showingsimilar performance reduction. This confirms that the performance impact on the subscriber side is a reasonable
behaviorif we want to detect the update_deleted conflict reliably. 
>
>
>
> We have also tested similar scenarios in physical streaming replication, to see the effect of enabling the
hot_standby_feedbackand recovery_min_apply_delay. The benchmark results[5] showed performance reduction in these cases
aswell, though impact was less compared to the update_deleted scenario because the physical walreceiver does not need
towait for specified WAL to be applied before sending the hot standby feedback message. However, as the
recovery_min_apply_delayincreased, a similar TPS reduction (~50%) was observed, aligning with the behavior seen in the
update_deletedcase. 
>

The first impression after seeing such a performance dip will be not
to use such a setting but as the primary reason is that one
purposefully wants to retain dead tuples both in physical replication
and pub-sub environment, it is an expected outcome. Now, it is
possible that in real world people may not use exactly the setup we
have used to check the worst-case performance. For example, for a
pub-sub setup, one could imagine that writes happen on two nodes N1,
and N2 (both will be publisher nodes) and then all the changes from
both nodes will be assembled in the third node N3 (a subscriber node).
Or, the subscriber node, may not be set up for aggressive writes, Or,
one would be okay not to detect update_delete conflicts with complete
accuracy.

>
>
> Based on the above, I think the performance reduction observed with the update_deleted patch is expected and
necessarybecause the patch's main goal is to retain dead tuples for reliable conflict detection. Reducing this
retentionperiod would compromise the accuracy of update_deleted detection. 
>

The point related to dead tuple accumulation (or database bloat) with
this setting should be documented similarly to what we document for
hot_standby_feedback. See hot_standby_feedback description in docs
[1].

[1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> I have one comment on the 0001 patch:

Thanks for the comments!

> 
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when
> applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance
> the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
> 
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID wraparound
> happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.

I think the issue is that the launcher may create the replication slot after
the apply worker has already set the 'oldest_nonremovable_xid' because the
launcher are doing that asynchronously. So, Before the slot is created, there's
a window where transaction IDs might wrap around. If initially the apply worker
has computed a candidate_xid (755) and the xid wraparound before the launcher
creates the slot, causing the new current xid to be (740), then the old
candidate_xid(755) looks like a xid in the future, and the launcher could
advance the xmin to 755 which cause the dead tuples to be removed prematurely.
(We are trying to reproduce this to ensure that it's a real issue and will
share after finishing)

We thought of another approach, which is to create/drop this slot first as
soon as one enables/disables detect_update_deleted (E.g. create/drop slot
during DDL). But it seems complicate to control the concurrent slot
create/drop. For example, if one backend A enables detect_update_deteled, it
will create a slot. But if another backend B is disabling the
detect_update_deteled at the same time, then the newly created slot may be
dropped by backend B. I thought about checking the number of subscriptions that
enables detect_update_deteled before dropping the slot in backend B, but the
subscription changes caused by backend A may not visable yet (e.g. not
committed yet).

Does that make sense to you, or do you have some other ideas?

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Nisha Moond
Date:
On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>

I tried to reproduce the issue described above, where an
xid_wraparound occurs before the launcher creates the conflict slot,
and the apply worker retains a very old xid (from before the
wraparound) as its oldest_nonremovable_xid.

In this scenario, the launcher will not update the apply worker's
older epoch xid (oldest_nonremovable_xid = 755) as the conflict slot's
xmin. This is because advance_conflict_slot_xmin() ensures proper
handling by comparing the full 64-bit xids. However, this could lead
to real issues if 32-bit TransactionID were used instead of 64-bit
FullTransactionID. The detailed test steps and results are as below:

Setup:  A Publisher-Subscriber setup with logical replication.

Steps done to reproduce the test scenario -
On Sub -
1) Created a subscription with detect_update_deleted=off, so no
conflict slot to start with.
2) Attached gdb to the launcher and put a breakpoint at
advance_conflict_slot_xmin().
3) Run "alter subscription ..... (detect_update_deleted=ON);"
4) Stopped the launcher at the start of the
"advance_conflict_slot_xmin()",  and blocked the creation of the
conflict slot.
5) Attached another gdb session to the apply worker and made sure it
has set an oldest_nonremovable_xid . In
"maybe_advance_nonremovable_xid()" -

  (gdb) p MyLogicalRepWorker->oldest_nonremovable_xid
  $3 = {value = 760}
  -- so apply worker's oldest_nonremovable_xid = 760

6) Consumed ~4.2 billion xids to let the xid_wraparound happen. After
the wraparound, the next_xid was "705", which is less than "760".
7) Released the launcher from gdb, but the apply_worker still stopped in gdb.
8) The slot gets created with xmin=705 :

  postgres=# select slot_name, slot_type, active, xmin, catalog_xmin,
restart_lsn, inactive_since, confirmed_flush_lsn from
pg_replication_slots;
         slot_name       | slot_type | active | xmin | catalog_xmin |
restart_lsn | inactive_since | confirmed_flush_lsn

-----------------------+-----------+--------+------+--------------+-------------+----------------+---------------------
  pg_conflict_detection | physical  | t      |  705 |              |
          |                |
  (1 row)

Next, when launcher tries to advance the slot's xmin in
advance_conflict_slot_xmin() with new_xmin as the apply worker's
oldest_nonremovable_xid(760), it returns without updating the slot's
xmin because of below check -
````
  if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
    return false;
````
we are comparing the full xids (64-bit) in
FullTransactionIdPrecedesOrEquals() and in this case the values are:
  new_xmin=760
  full_xmin=4294968001 (w.r.t. xid=705)

As "760 <= 4294968001", the launcher will return from here and not
update the slot's xmin to "760".  Above check will always be true in
such scenarios.
Note: The launcher would have updated the slot's xmin to 760 if 32-bit
XIDs were being compared, i.e., "760 <= 705".

--
Thanks,
Nisha



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)

The slot's first xmin is calculated by
GetOldestSafeDecodingTransactionId(false). The initial computed
cancidate_xid could be newer than this xid?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > I have one comment on the 0001 patch:
> >
> > Thanks for the comments!
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts
> > > + when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from
> being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to
> > > + advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction
> ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID
> > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > prevents XIDs from being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot
> > after the apply worker has already set the 'oldest_nonremovable_xid'
> > because the launcher are doing that asynchronously. So, Before the
> > slot is created, there's a window where transaction IDs might wrap
> > around. If initially the apply worker has computed a candidate_xid
> > (755) and the xid wraparound before the launcher creates the slot,
> > causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher
> > could advance the xmin to 755 which cause the dead tuples to be removed
> prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and
> > will share after finishing)
> 
> The slot's first xmin is calculated by
> GetOldestSafeDecodingTransactionId(false). The initial computed
> cancidate_xid could be newer than this xid?

I think the issue occurs when the slot is created after an XID wraparound. As a
result, GetOldestSafeDecodingTransactionId() returns the current XID
(after wraparound), which appears older than the computed candidate_xid (e.g.,
oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
issue in [1]. What do you think ?

[1] https://www.postgresql.org/message-id/CABdArM6P0zoEVRN%2B3YHNET_oOaAVOKc-EPUnXiHkcBJ-uDKQVw%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > >
> > > > I have one comment on the 0001 patch:
> > >
> > > Thanks for the comments!
> > >
> > > >
> > > > +       /*
> > > > +        * The changes made by this and later transactions are still
> > > > non-removable
> > > > +        * to allow for the detection of update_deleted conflicts
> > > > + when
> > > > applying
> > > > +        * changes in this logical replication worker.
> > > > +        *
> > > > +        * Note that this info cannot directly protect dead tuples from
> > being
> > > > +        * prematurely frozen or removed. The logical replication launcher
> > > > +        * asynchronously collects this info to determine whether to
> > > > + advance
> > > > the
> > > > +        * xmin value of the replication slot.
> > > > +        *
> > > > +        * Therefore, FullTransactionId that includes both the
> > > > transaction ID and
> > > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > > +        * critical because without considering the epoch, the transaction
> > ID
> > > > +        * alone may appear as if it is in the future due to transaction ID
> > > > +        * wraparound.
> > > > +        */
> > > > +       FullTransactionId oldest_nonremovable_xid;
> > > >
> > > > The last paragraph of the comment mentions that we need to use
> > > > FullTransactionId to properly compare XIDs even after the XID
> > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > xmin) are never away from more than 2^31 XIDs.
> > >
> > > I think the issue is that the launcher may create the replication slot
> > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > because the launcher are doing that asynchronously. So, Before the
> > > slot is created, there's a window where transaction IDs might wrap
> > > around. If initially the apply worker has computed a candidate_xid
> > > (755) and the xid wraparound before the launcher creates the slot,
> > > causing the new current xid to be (740), then the old
> > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > could advance the xmin to 755 which cause the dead tuples to be removed
> > prematurely.
> > > (We are trying to reproduce this to ensure that it's a real issue and
> > > will share after finishing)
> >
> > The slot's first xmin is calculated by
> > GetOldestSafeDecodingTransactionId(false). The initial computed
> > cancidate_xid could be newer than this xid?
>
> I think the issue occurs when the slot is created after an XID wraparound. As a
> result, GetOldestSafeDecodingTransactionId() returns the current XID
> (after wraparound), which appears older than the computed candidate_xid (e.g.,
> oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> issue in [1]. What do you think ?

I agree that the scenario Nisha shared could happen with the current
patch. On the other hand, I think that if slot's initial xmin is
always newer than or equal to the initial computed non-removable-xid
(i.e., the oldest of workers' oldest_nonremovable_xid values), we can
always use slot's first xmin. And I think it might be true while I'm
concerned the fact that worker's oldest_nonremoable_xid and the slot's
initial xmin is calculated differently (GetOldestActiveTransactionId()
and GetOldestSafeDecodingTransactionId(), respectively). That way,
subsequent comparisons between slot's xmin and computed candidate_xid
won't need to take care of the epoch. IOW, the worker's
non-removable-xid values effectively are not used until the slot is
created.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Fri, Jan 3, 2025 at 11:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 5.
> +
> +      <varlistentry
> id="sql-createsubscription-params-with-detect-update-deleted">
> +        <term><literal>detect_update_deleted</literal>
> (<type>boolean</type>)</term>
> +        <listitem>
> +         <para>
> +          Specifies whether the detection of <xref
> linkend="conflict-update-deleted"/>
> +          is enabled. The default is <literal>false</literal>. If set to
> +          true, the dead tuples on the subscriber that are still useful for
> +          detecting <xref linkend="conflict-update-deleted"/>
> +          are retained,
>
> One of the purposes of retaining dead tuples is to detect
> update_delete conflict. But, I also see the following in 0001's commit
> message: "Since the mechanism relies on a single replication slot, it
> not only assists in retaining dead tuples but also preserves commit
> timestamps and origin data. These information will be displayed in the
> additional logs generated for logical replication conflicts.
> Furthermore, the preserved commit timestamps and origin data are
> essential for consistently detecting update_origin_differs conflicts."
> which indicates there are other cases where retaining dead tuples can
> help. So, I was thinking about whether to name this new option as
> retain_dead_tuples or something along those lines?
>

The other possible option name could be retain_conflict_info.
Sawada-San, and others, do you have any preference for the name of
this option?

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, January 7, 2025 3:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > >
> > > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > I have one comment on the 0001 patch:
> > > >
> > > > Thanks for the comments!
> > > >
> > > > >
> > > > > +       /*
> > > > > +        * The changes made by this and later transactions are still
> > > > > non-removable
> > > > > +        * to allow for the detection of update_deleted conflicts
> > > > > + when
> > > > > applying
> > > > > +        * changes in this logical replication worker.
> > > > > +        *
> > > > > +        * Note that this info cannot directly protect dead tuples from
> > > being
> > > > > +        * prematurely frozen or removed. The logical replication
> launcher
> > > > > +        * asynchronously collects this info to determine whether to
> > > > > + advance
> > > > > the
> > > > > +        * xmin value of the replication slot.
> > > > > +        *
> > > > > +        * Therefore, FullTransactionId that includes both the
> > > > > transaction ID and
> > > > > +        * its epoch is used here instead of a single Transaction ID.
> This is
> > > > > +        * critical because without considering the epoch, the
> transaction
> > > ID
> > > > > +        * alone may appear as if it is in the future due to transaction
> ID
> > > > > +        * wraparound.
> > > > > +        */
> > > > > +       FullTransactionId oldest_nonremovable_xid;
> > > > >
> > > > > The last paragraph of the comment mentions that we need to use
> > > > > FullTransactionId to properly compare XIDs even after the XID
> > > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > > xmin) are never away from more than 2^31 XIDs.
> > > >
> > > > I think the issue is that the launcher may create the replication slot
> > > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > > because the launcher are doing that asynchronously. So, Before the
> > > > slot is created, there's a window where transaction IDs might wrap
> > > > around. If initially the apply worker has computed a candidate_xid
> > > > (755) and the xid wraparound before the launcher creates the slot,
> > > > causing the new current xid to be (740), then the old
> > > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > > could advance the xmin to 755 which cause the dead tuples to be
> removed
> > > prematurely.
> > > > (We are trying to reproduce this to ensure that it's a real issue and
> > > > will share after finishing)
> > >
> > > The slot's first xmin is calculated by
> > > GetOldestSafeDecodingTransactionId(false). The initial computed
> > > cancidate_xid could be newer than this xid?
> >
> > I think the issue occurs when the slot is created after an XID wraparound. As
> a
> > result, GetOldestSafeDecodingTransactionId() returns the current XID
> > (after wraparound), which appears older than the computed candidate_xid
> (e.g.,
> > oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> > issue in [1]. What do you think ?
> 
> I agree that the scenario Nisha shared could happen with the current
> patch. On the other hand, I think that if slot's initial xmin is
> always newer than or equal to the initial computed non-removable-xid
> (i.e., the oldest of workers' oldest_nonremovable_xid values), we can
> always use slot's first xmin. And I think it might be true while I'm
> concerned the fact that worker's oldest_nonremoable_xid and the slot's
> initial xmin is calculated differently (GetOldestActiveTransactionId()
> and GetOldestSafeDecodingTransactionId(), respectively). That way,
> subsequent comparisons between slot's xmin and computed candidate_xid
> won't need to take care of the epoch. IOW, the worker's
> non-removable-xid values effectively are not used until the slot is
> created.

I might be missing something, so could you please elaborate a bit more on this
idea?

Initially, I thought you meant delaying the initialization of slot.xmin until
after the worker computes the oldest_nonremovable_xid. However, I think the
same issue would occur with this approach as well [1], with the difference
being that the slot would directly use a future XID as xmin, which seems
inappropriate to me.

Or do you mean opposite that we delay the initialization of
oldest_nonremovable_xid after the creation of the slot ?

[1]
> So, Before the slot is created, there's a window where transaction IDs might
> wrap around. If initially the apply worker has computed a candidate_xid (755)
> and the xid wraparound before the launcher creates the slot, causing the new
> current xid to be (740), then the old candidate_xid(755) looks like a xid in
> the future, and the launcher could advance the xmin to 755 which cause the
> dead tuples to be removed prematurely.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>
> We thought of another approach, which is to create/drop this slot first as
> soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> during DDL). But it seems complicate to control the concurrent slot
> create/drop. For example, if one backend A enables detect_update_deteled, it
> will create a slot. But if another backend B is disabling the
> detect_update_deteled at the same time, then the newly created slot may be
> dropped by backend B. I thought about checking the number of subscriptions that
> enables detect_update_deteled before dropping the slot in backend B, but the
> subscription changes caused by backend A may not visable yet (e.g. not
> committed yet).
>

This means that for the transaction whose changes are not yet visible,
we may have already created the slot and the backend B would end up
dropping it. Is it possible that during the change of this new option
via DDL, we take AccessExclusiveLock on pg_subscription as we do in
DropSubscription() to ensure that concurrent transactions can't drop
the slot? Will that help in solving the above scenario?

The second idea could be that each worker first checks whether a slot
exists along with a subscription flag (new option). Checking the
existence of a slot each time would be costly, so we somehow need to
cache it. But if we do that then we need to invent some cache
invalidation mechanism for the slot. I am not sure if we can design a
race-free mechanism for that. I mean we need to think of a solution
for race conditions between the launcher and apply workers to ensure
that after dropping the slot, launcher doesn't recreate the slot (say
if some subscription enables this option) before all the workers can
clear their existing values of oldest_nonremovable_xid.

The third idea to avoid the race condition could be that in the
function InitializeLogRepWorker() after CommitTransactionCommand(), we
check if the retain_dead_tuples flag is true for MySubscription then
we check whether the system slot exists. If exits then go ahead,
otherwise, wait till the slot is created. It could be some additional
cycles during worker start up but it is a one-time effort and that too
only when the flag is set. In addition to this, we anyway need to
create the slot in the launcher before launching the workers, and
after re-reading the subscription, the change in retain_dead_tuples
flag (off->on) should cause the worker restart.

Now, in the third idea, the issue can still arise if, after waiting
for the slot to be created, the user sets the retain_dead_tuples to
false and back to true again immediately. Because the launcher may
have noticed the "retain_dead_tuples=false" operation and dropped the
slot, while the apply worker has not noticed and still holds an old
candidate_xid. The xid may wraparound in this window before setting
the retain_dead_tuples back to true. And, the apply worker would not
restart because after it calls maybe_reread_subscription(), the
retain_dead_tuples would have been set back to true again. Again, to
avoid this race condition, the launcher can wait for each worker to
reset the oldest_nonremovamble_xid before dropping the slot.

Even after doing the above, the third idea could still have another
race condition:
1. The launcher creates the replication slot and starts a worker with
retain_dead_tuples = true, the worker is waiting for publish status
and has not set oldest_nonremovable_xid.
2. The user set the option retain_dead_tuples to false, the launcher
noticed that and drop the replication slot.
3. The worker received the status and set oldest_nonremovable_xid to a
valid value (say 750).
4. Xid wraparound happened at this point and say new_available_xid becomes 740
5. User set retain_dead_tuples = true again.

After the above steps, the apply worker holds an old
oldest_nonremovable_xid (750) and will not restart if it does not call
maybe_reread_subscription() before step 5. So, such a case can again
create a problem of incorrect slot->xmin value. We can probably try to
find some way to avoid this race condition as well but I haven't
thought more about this as this idea sounds a bit risky and bug-prone
to me.

Among the above ideas, the first idea of taking AccessExclusiveLock on
pg_subscription sounds safest to me. I haven't evaluated the changes
for the first approach so I could be missing something that makes it
difficult to achieve but I think it is worth investigating unless we
have better ideas or we think that the current approach used in patch
to use FullTransactionId is okay.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

I was doing backward compatibility test by creating publication in
PG17 and subscription with the patch on HEAD:
Currently, we are able to create subscription with
detect_update_deleted option for a publication on PG17:
postgres=# create subscription sub1 connection 'dbname=postgres
host=localhost port=5432' publication pub1 with
(detect_update_deleted=true);
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION

This should not be allowed now as the subscriber will now request
publisher status from the publisher for which handling is not
available in the publisher:
+static void
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+       pq_sendbyte(request_message, 'p');
+       pq_sendint64(request_message, GetCurrentTimestamp());
...
+}

I felt this should not be allowed.

Regards,
Vignesh



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, January 2, 2025 6:34 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> Few suggestions:
> 1) If we have a subscription with detect_update_deleted option and we
> try to upgrade it with default settings(in case dba forgot to set
> track_commit_timestamp), the upgrade will fail after doing a lot of
> steps like that mentioned in ok below:
> Setting locale and encoding for new cluster                   ok
> Analyzing all rows in the new cluster                         ok
> Freezing all rows in the new cluster                          ok
> Deleting files from new pg_xact                               ok
> Copying old pg_xact to new server                             ok
> Setting oldest XID for new cluster                            ok
> Setting next transaction ID and epoch for new cluster         ok
> Deleting files from new pg_multixact/offsets                  ok
> Copying old pg_multixact/offsets to new server                ok
> Deleting files from new pg_multixact/members                  ok
> Copying old pg_multixact/members to new server                ok
> Setting next multixact ID and offset for new cluster          ok
> Resetting WAL archives                                        ok
> Setting frozenxid and minmxid counters in new cluster         ok
> Restoring global objects in the new cluster                   ok
> Restoring database schemas in the new cluster
>   postgres
> *failure*
> 
> We should detect this at an earlier point somewhere like in
> check_new_cluster_subscription_configuration and throw an error from
> there.
> 
> 2) Also should we include an additional slot for the
> pg_conflict_detection slot while checking max_replication_slots.
> Though this error will occur after the upgrade is completed, it may be
> better to include the slot during upgrade itself so that the DBA need
> not handle this error separately after the upgrade is completed.

Thanks for the comments!

I added the suggested changes but didn't add more tests to verify each error
message in this version, because it seems a rare case to me, so I am not sure
if it's worth increasing the testing time for these errors. But I am OK to add
if people think it's worth the effort and I will also test this locally.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, January 2, 2025 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> Sounds reasonable but OTOH, all other places that create physical
> slots (which we are doing here) don't use this trick. So, don't they
> need similar reliability?

I have not figured the reason for existing physical slots' handling,
but will think more.

> Also, add some comments as to why we are
> initially creating the RS_EPHEMERAL slot as we have at other places.

Added.

> 
> Few other comments on 0003
> =======================
> 1.
> + if (sublist)
> + {
> + bool updated;
> +
> + if (!can_advance_xmin)
> + xmin = InvalidFullTransactionId;
> +
> + updated = advance_conflict_slot_xmin(xmin);
> 
> How will it help to try advancing slot_xmin when xmin is invalid?

It was intended to create the slot without updating the xmin in this case,
but the function name seems misleading. So, I will think more on this and
modify it in next version because it may also be affected by the discussion
in [1].

> 
> 2.
> @@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
>   long elapsed;
> 
>   if (!sub->enabled)
> + {
> + can_advance_xmin = false;
> 
> In ApplyLauncherMain(), if one of the subscriptions is disabled (say
> the last one in sublist), then can_advance_xmin will become false in
> the above code. Now, later, as quoted in comment-1, the patch
> overrides xmin to InvalidFullTransactionId if can_advance_xmin is
> false. Won't that lead to the wrong computation of xmin?

advance_conflict_slot_xmin() would skip updating the slot.xmin
if the input value is invalid. But I will think how to improve this
in next version.

> 
> 3.
> + slot_maybe_exist = true;
> + }
> +
> + /*
> + * Drop the slot if we're no longer retaining dead tuples.
> + */
> + else if (slot_maybe_exist)
> + {
> + drop_conflict_slot_if_exists();
> + slot_maybe_exist = false;
> 
> Can't we use MyReplicationSlot instead of introducing a new boolean
> slot_maybe_exist?
> 
> In any case, how does the above code deal with the case where the
> launcher is restarted for some reason and there is no subscription
> after that? Will it be possible to drop the slot in that case?

Since the initial value of slot_maybe_exist is true, so I think the launcher would
always check the slot once and drop the slot if not needed even if the
launcher restarted.

[1] https://www.postgresql.org/message-id/CAA4eK1Li8XLJ5f-pYvPJ8pXxyA3G-QsyBLNzHY940amF7jm%3D3A%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Few comments:
1) All other options are ordered, we can mention retain_conflict_info
after password_required to keep it consistent, I think it got
misplaced because of the name change from detect_update_deleted to
retain_conflict_info:
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index bbd08770c3..9d07fbf07a 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -2278,9 +2278,10 @@ match_previous_words(int pattern_id,
                COMPLETE_WITH("(", "PUBLICATION");
        /* ALTER SUBSCRIPTION <name> SET ( */
        else if (Matches("ALTER", "SUBSCRIPTION", MatchAny, MatchAnyN,
"SET", "("))
-               COMPLETE_WITH("binary", "disable_on_error",
"failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+               COMPLETE_WITH("binary", "retain_conflict_info",
"disable_on_error",
+                                         "failover", "origin",
"password_required",
+                                         "run_as_owner", "slot_name",
"streaming",
+                                         "synchronous_commit", "two_phase");

2) Similarly here too:
        /* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
        else if (Matches("CREATE", "SUBSCRIPTION", MatchAnyN, "WITH", "("))
                COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-                                         "disable_on_error",
"enabled", "failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+                                         "retain_conflict_info",
"disable_on_error", "enabled",

3) Now that the option detect_update_deleted is changed to
retain_conflict_info, we can change this to "Retain conflict info":
+               if (pset.sversion >= 180000)
+                       appendPQExpBuffer(&buf,
+                                                         ",
subretainconflictinfo AS \"%s\"\n",
+
gettext_noop("Detect update deleted"));

4) The corresponding test changes also should be updated:
+++ b/src/test/regress/expected/subscription.out
@@ -116,18 +116,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION
'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the
replication slot, enable the subscription, and refresh the
subscription.
 \dRs+ regress_testsub4
-
                                           List of subscriptions
-       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Synchronous commit |
      Conninfo           | Skip LSN

-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | off                |
dbname=regress_doesnotexist | 0/0
+
                                                       List of
subscriptions
+       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Detect update deleted |
Synchronous commit |          Conninfo           | Skip LSN

+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+-----------------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | f                     |
off                | dbname=regress_doesnotexist | 0/0

5) This part of code is not very easy to understand that it is done
for handling wrap around, could we add some comments here:
+       if (!TimestampDifferenceExceeds(data->candidate_xid_time, now,
+
 data->xid_advance_interval))
+               return;
+
+       data->candidate_xid_time = now;
+
+       oldest_running_xid = GetOldestActiveTransactionId();
+       next_full_xid = ReadNextFullTransactionId();
+       epoch = EpochFromFullTransactionId(next_full_xid);
+
+       /* Compute the epoch of the oldest_running_xid */
+       if (oldest_running_xid > XidFromFullTransactionId(next_full_xid))
+               epoch--;

Regards,
Vignesh



Re: Conflict detection for update_deleted in logical replication

From
Nisha Moond
Date:
On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
>

Here are a couple of initial review comments on v19 patch set:

1) The subscription option 'retain_conflict_info' remains set to
"true" for a subscription even after restarting the server with
'track_commit_timestamp=off', which can lead to incorrect behavior.
  Steps to reproduce:
   1. Start the server with 'track_commit_timestamp=ON'.
   2. Create a subscription with (retain_conflict_info=ON).
   3. Restart the server with 'track_commit_timestamp=OFF'.

 - The apply worker starts successfully, and the subscription retains
'retain_conflict_info=true'. However, in this scenario, the
update_deleted conflict detection will not function correctly without
'track_commit_timestamp'.
```
postgres=# show track_commit_timestamp;
 track_commit_timestamp
------------------------
 off
(1 row)

postgres=# select subname, subretainconflictinfo from pg_subscription;
 subname | subretainconflictinfo
---------+-----------------------
 sub21   | t
 sub22   | t
```

2) With the new parameter name change to "retain_conflict_info", the
error message for both the 'CREATE SUBSCRIPTION' and 'ALTER
SUBSCRIPTION' commands needs to be updated accordingly.

  postgres=# create subscription sub11 connection 'dbname=postgres'
publication pub1 with (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled
  postgres=# alter subscription sub12 set (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled

 - Change the message to something similar - "retaining conflict info
requires "track_commit_timestamp" to be enabled".

--
Thanks,
Nisha



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID wraparound
> > > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > > being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot after
> > the apply worker has already set the 'oldest_nonremovable_xid' because the
> > launcher are doing that asynchronously. So, Before the slot is created, there's
> > a window where transaction IDs might wrap around. If initially the apply worker
> > has computed a candidate_xid (755) and the xid wraparound before the launcher
> > creates the slot, causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher could
> > advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and will
> > share after finishing)
> >
> > We thought of another approach, which is to create/drop this slot first as
> > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > during DDL). But it seems complicate to control the concurrent slot
> > create/drop. For example, if one backend A enables detect_update_deteled, it
> > will create a slot. But if another backend B is disabling the
> > detect_update_deteled at the same time, then the newly created slot may be
> > dropped by backend B. I thought about checking the number of subscriptions that
> > enables detect_update_deteled before dropping the slot in backend B, but the
> > subscription changes caused by backend A may not visable yet (e.g. not
> > committed yet).
> >
>
> This means that for the transaction whose changes are not yet visible,
> we may have already created the slot and the backend B would end up
> dropping it. Is it possible that during the change of this new option
> via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> DropSubscription() to ensure that concurrent transactions can't drop
> the slot? Will that help in solving the above scenario?

If we create/stop the slot during DDL, how do we support rollback DDLs?

>
> The second idea could be that each worker first checks whether a slot
> exists along with a subscription flag (new option). Checking the
> existence of a slot each time would be costly, so we somehow need to
> cache it. But if we do that then we need to invent some cache
> invalidation mechanism for the slot. I am not sure if we can design a
> race-free mechanism for that. I mean we need to think of a solution
> for race conditions between the launcher and apply workers to ensure
> that after dropping the slot, launcher doesn't recreate the slot (say
> if some subscription enables this option) before all the workers can
> clear their existing values of oldest_nonremovable_xid.
>
> The third idea to avoid the race condition could be that in the
> function InitializeLogRepWorker() after CommitTransactionCommand(), we
> check if the retain_dead_tuples flag is true for MySubscription then
> we check whether the system slot exists. If exits then go ahead,
> otherwise, wait till the slot is created. It could be some additional
> cycles during worker start up but it is a one-time effort and that too
> only when the flag is set. In addition to this, we anyway need to
> create the slot in the launcher before launching the workers, and
> after re-reading the subscription, the change in retain_dead_tuples
> flag (off->on) should cause the worker restart.
>
> Now, in the third idea, the issue can still arise if, after waiting
> for the slot to be created, the user sets the retain_dead_tuples to
> false and back to true again immediately. Because the launcher may
> have noticed the "retain_dead_tuples=false" operation and dropped the
> slot, while the apply worker has not noticed and still holds an old
> candidate_xid. The xid may wraparound in this window before setting
> the retain_dead_tuples back to true. And, the apply worker would not
> restart because after it calls maybe_reread_subscription(), the
> retain_dead_tuples would have been set back to true again. Again, to
> avoid this race condition, the launcher can wait for each worker to
> reset the oldest_nonremovamble_xid before dropping the slot.
>
> Even after doing the above, the third idea could still have another
> race condition:
> 1. The launcher creates the replication slot and starts a worker with
> retain_dead_tuples = true, the worker is waiting for publish status
> and has not set oldest_nonremovable_xid.
> 2. The user set the option retain_dead_tuples to false, the launcher
> noticed that and drop the replication slot.
> 3. The worker received the status and set oldest_nonremovable_xid to a
> valid value (say 750).
> 4. Xid wraparound happened at this point and say new_available_xid becomes 740
> 5. User set retain_dead_tuples = true again.
>
> After the above steps, the apply worker holds an old
> oldest_nonremovable_xid (750) and will not restart if it does not call
> maybe_reread_subscription() before step 5. So, such a case can again
> create a problem of incorrect slot->xmin value. We can probably try to
> find some way to avoid this race condition as well but I haven't
> thought more about this as this idea sounds a bit risky and bug-prone
> to me.
>
> Among the above ideas, the first idea of taking AccessExclusiveLock on
> pg_subscription sounds safest to me. I haven't evaluated the changes
> for the first approach so I could be missing something that makes it
> difficult to achieve but I think it is worth investigating unless we
> have better ideas or we think that the current approach used in patch
> to use FullTransactionId is okay.

Thank you for considering some ideas. As I mentioned above, we might
need to consider a case like where 'CREATE SUBSCRIPTION ..
(retain_conflict_info = true)' is rolled back. Having said that, this
comment is just for simplifying the logic. If using TransactionId
instead makes other parts complex, it would not make sense. I'm okay
with leaving this part and improving the comment for
oldest_nonremovable_xid, say, by mentioning that there is a window for
XID wraparound happening between workers computing their
oldst_nonremovable_xid and pg_conflict_detection slot being created.

BTW while reviewing the code, I realized that changing
retain_conflict_info value doesn't have the worker relaunch and we
don't clear the worker's oldest_nonremovable_xid value in this case.
Is it okay? I'm concerned about a case like where
RetainConflictInfoPhase state transition is paused by disabling
retain_conflict_info and resume by re-enabling it with an old
RetainConflictInfoData value.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > We thought of another approach, which is to create/drop this slot first as
> > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > during DDL). But it seems complicate to control the concurrent slot
> > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > will create a slot. But if another backend B is disabling the
> > > detect_update_deteled at the same time, then the newly created slot may be
> > > dropped by backend B. I thought about checking the number of subscriptions that
> > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > subscription changes caused by backend A may not visable yet (e.g. not
> > > committed yet).
> > >
> >
> > This means that for the transaction whose changes are not yet visible,
> > we may have already created the slot and the backend B would end up
> > dropping it. Is it possible that during the change of this new option
> > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > DropSubscription() to ensure that concurrent transactions can't drop
> > the slot? Will that help in solving the above scenario?
>
> If we create/stop the slot during DDL, how do we support rollback DDLs?
>

We will prevent changing this setting in a transaction block as we
already do for slot related case. See use of
PreventInTransactionBlock() in subscriptioncmds.c.

>
> Thank you for considering some ideas. As I mentioned above, we might
> need to consider a case like where 'CREATE SUBSCRIPTION ..
> (retain_conflict_info = true)' is rolled back. Having said that, this
> comment is just for simplifying the logic. If using TransactionId
> instead makes other parts complex, it would not make sense. I'm okay
> with leaving this part and improving the comment for
> oldest_nonremovable_xid, say, by mentioning that there is a window for
> XID wraparound happening between workers computing their
> oldst_nonremovable_xid and pg_conflict_detection slot being created.
>

Fair enough. Let us see what you think about my above response first.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is further performance test analysis with v16 patch-set.
>
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.

Nice. It means that frequently getting in-commit-phase transactions by
the subscriber didn't have a negative impact on the publisher's
performance.

>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 

Assuming that the performance dip happened due to dead tuple retention
for the conflict detection, would TPS on other databases also be
affected?

>
>
> [3] Test with pgbench run on both publisher and subscriber.
>
>
>
> Test setup:
>
> - Tests performed on pgHead + v16 patches
>
> - Created a pub-sub replication system.
>
> - Parameters for both instances were:
>
>
>
>    share_buffers = 30GB
>
>    min_wal_size = 10GB
>
>    max_wal_size = 20GB
>
>    autovacuum = false

Since you disabled autovacuum on the subscriber, dead tuples created
by non-hot updates are accumulated anyway regardless of
detect_update_deleted setting, is that right?

> Test Run:
>
> - Ran pgbench(read-write) on both the publisher and the subscriber with 30 clients for a duration of 120 seconds,
collectingdata over 5 runs. 
>
> - Note that pgbench was running for different tables on pub and sub.
>
> (The scripts used for test "case1-2_measure.sh" and case1-2_setup.sh" are attached).
>
>
>
> Results:
>
>
>
> Run#                   pub TPS              sub TPS
>
> 1                         32209   13704
>
> 2                         32378   13684
>
> 3                         32720   13680
>
> 4                         31483   13681
>
> 5                         31773   13813
>
> median               32209   13684
>
> regression          7%         -53%

What was the TPS on the subscriber when detect_update_deleted = false?
And how much were the tables bloated compared to when
detect_update_deleted = false?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> >
> > Here is further performance test analysis with v16 patch-set.
> >
> >
> > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
> >
> >
> >
> > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> >
> >
> >
> > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> >
> >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
> Nice. It means that frequently getting in-commit-phase transactions by
> the subscriber didn't have a negative impact on the publisher's
> performance.
>
> >
> >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> >
> >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
> Assuming that the performance dip happened due to dead tuple retention
> for the conflict detection, would TPS on other databases also be
> affected?
>

As we use slot->xmin to retain dead tuples, shouldn't the impact be
global (means on all databases)? Or, maybe I am missing something.

> >
> >
> > [3] Test with pgbench run on both publisher and subscriber.
> >
> >
> >
> > Test setup:
> >
> > - Tests performed on pgHead + v16 patches
> >
> > - Created a pub-sub replication system.
> >
> > - Parameters for both instances were:
> >
> >
> >
> >    share_buffers = 30GB
> >
> >    min_wal_size = 10GB
> >
> >    max_wal_size = 20GB
> >
> >    autovacuum = false
>
> Since you disabled autovacuum on the subscriber, dead tuples created
> by non-hot updates are accumulated anyway regardless of
> detect_update_deleted setting, is that right?
>

I think hot-pruning mechanism during the update operation will remove
dead tuples even when autovacuum is disabled.

--
With Regards,
Amit Kapila.



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> > >
> > > Here is further performance test analysis with v16 patch-set.
> > >
> > >
> > > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a
pub-subsetup, no performance degradation was observed on either node. 
> > >
> > >
> > >
> > > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> > >
> > >
> > >
> > > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> > >
> > >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
> >
> > Nice. It means that frequently getting in-commit-phase transactions by
> > the subscriber didn't have a negative impact on the publisher's
> > performance.
> >
> > >
> > >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> > >
> > >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
> >
> > Assuming that the performance dip happened due to dead tuple retention
> > for the conflict detection, would TPS on other databases also be
> > affected?
> >
>
> As we use slot->xmin to retain dead tuples, shouldn't the impact be
> global (means on all databases)?

I think so too.

>
> > >
> > >
> > > [3] Test with pgbench run on both publisher and subscriber.
> > >
> > >
> > >
> > > Test setup:
> > >
> > > - Tests performed on pgHead + v16 patches
> > >
> > > - Created a pub-sub replication system.
> > >
> > > - Parameters for both instances were:
> > >
> > >
> > >
> > >    share_buffers = 30GB
> > >
> > >    min_wal_size = 10GB
> > >
> > >    max_wal_size = 20GB
> > >
> > >    autovacuum = false
> >
> > Since you disabled autovacuum on the subscriber, dead tuples created
> > by non-hot updates are accumulated anyway regardless of
> > detect_update_deleted setting, is that right?
> >
>
> I think hot-pruning mechanism during the update operation will remove
> dead tuples even when autovacuum is disabled.

True, but why did it disable autovacuum? It seems that
case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates
less likely to happen.

I understand that a certain performance dip happens due to dead tuple
retention, which is fine, but I'm surprised that the TPS decreased by
50% within 120 seconds. The TPS goes even worse for a longer test? I
did a quick benchmark where I completely disabled removing dead tuples
(by autovacuum=off and a logical slot) and ran pgbench but I didn't
see such a precipitous dip.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> <nisha.moond412@gmail.com> wrote:
> > > >
> > > >
> > > > [3] Test with pgbench run on both publisher and subscriber.
> > > >
> > > >
> > > >
> > > > Test setup:
> > > >
> > > > - Tests performed on pgHead + v16 patches
> > > >
> > > > - Created a pub-sub replication system.
> > > >
> > > > - Parameters for both instances were:
> > > >
> > > >
> > > >
> > > >    share_buffers = 30GB
> > > >
> > > >    min_wal_size = 10GB
> > > >
> > > >    max_wal_size = 20GB
> > > >
> > > >    autovacuum = false
> > >
> > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > by non-hot updates are accumulated anyway regardless of
> > > detect_update_deleted setting, is that right?
> > >
> >
> > I think hot-pruning mechanism during the update operation will remove
> > dead tuples even when autovacuum is disabled.
> 
> True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> doesn't specify fillfactor, which makes hot-updates less likely to happen.

IIUC, we disable autovacuum as a general practice in read-write tests for
stable TPS numbers.

> 
> I understand that a certain performance dip happens due to dead tuple
> retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> 120 seconds. The TPS goes even worse for a longer test?

We will try to increase the time and run the test again.

> I did a quick
> benchmark where I completely disabled removing dead tuples (by
> autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> precipitous dip.

I think a logical slot only retain the dead tuples on system catalog,
so the TPS on user table would not be affected that much.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
vignesh C
Date:
On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Consider a LR setup with retain_conflict_info=true for a table t1:
Publisher:
insert into t1 values(1);
-- Have a open transaction before delete operation in subscriber
begin;

Subscriber:
-- delete the record that was replicated
delete from t1;

-- Now commit the transaction in publisher
Publisher:
update t1 set c1 = 2;
commit;

In normal case update_deleted conflict is detected
2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated
was deleted locally in transaction 751 at 2025-01-08
15:41:29.811566+05:30.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBCA0

Now execute the same above case by having a presetup to consume all
the replication slots in the system by executing
pg_create_logical_replication_slot before the subscription is created,
in this case the conflict is not detected correctly.
2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row
to be updated.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBC68
2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
"max_replication_slots".

This is because even though we say create subscription is successful,
the launcher has not yet created the replication slot.

There are few observations from this test:
1) Create subscription does not wait for the slot to be created by the
launcher and starts applying the changes. Should create a subscription
wait till the slot is created by the launcher process.
2) Currently launcher is exiting continuously and trying to create
replication slots. Should the launcher wait for
wal_retrieve_retry_interval configuration before trying to create the
slot instead of filling the logs continuously.
3) If we try to create a similar subscription with
retain_conflict_info and disable_on_error option and there is an error
in replication slot creation, currently the subscription does not get
disabled. Should we consider disable_on_error for these cases and
disable the subscription if we are not able to create the slots.

Regards,
Vignesh



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > <nisha.moond412@gmail.com> wrote:
> > > > >
> > > > >
> > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > >
> > > > >
> > > > >
> > > > > Test setup:
> > > > >
> > > > > - Tests performed on pgHead + v16 patches
> > > > >
> > > > > - Created a pub-sub replication system.
> > > > >
> > > > > - Parameters for both instances were:
> > > > >
> > > > >
> > > > >
> > > > >    share_buffers = 30GB
> > > > >
> > > > >    min_wal_size = 10GB
> > > > >
> > > > >    max_wal_size = 20GB
> > > > >
> > > > >    autovacuum = false
> > > >
> > > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > > by non-hot updates are accumulated anyway regardless of
> > > > detect_update_deleted setting, is that right?
> > > >
> > >
> > > I think hot-pruning mechanism during the update operation will remove
> > > dead tuples even when autovacuum is disabled.
> >
> > True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> > doesn't specify fillfactor, which makes hot-updates less likely to happen.
>
> IIUC, we disable autovacuum as a general practice in read-write tests for
> stable TPS numbers.

Okay. TBH I'm not sure what we can say with these results. At a
glance, in a typical bi-directional-like setup,  we can interpret
these results as that if users turn retain_conflict_info on the TPS
goes 50% down.  But I'm not sure this 50% dip is the worst case that
users possibly face. It could be better in practice thanks to
autovacuum, or it also could go even worse due to further bloats if we
run the test longer.

Suppose that users had 50% performance dip due to dead tuple retention
for update_deleted detection, is there any way for users to improve
the situation? For example, trying to advance slot.xmin more
frequently might help to reduce dead tuple accumulation. I think it
would be good if we could have a way to balance between the publisher
performance and the subscriber performance.

In test case 3, we observed a -53% performance dip, which is worse
than the results of test case 5 with wal_receiver_status_interval =
100s. Given that in test case 5 with wal_receiver_status_interval =
100s we cannot remove dead tuples for the most of the whole 120s test
time, probably we could not remove dead tuples for a long time also in
test case 3. I expected that the apply worker gets remote transaction
XIDs and tries to advance slot.xmin more frequently, so this
performance dip surprised me. I would like to know how many times the
apply worker gets remote transaction XIDs and succeeds in advance
slot.xmin during the test.

>
> >
> > I understand that a certain performance dip happens due to dead tuple
> > retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> > 120 seconds. The TPS goes even worse for a longer test?
>
> We will try to increase the time and run the test again.
>
> > I did a quick
> > benchmark where I completely disabled removing dead tuples (by
> > autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> > precipitous dip.
>
> I think a logical slot only retain the dead tuples on system catalog,
> so the TPS on user table would not be affected that much.

You're right, I missed it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
> 
> Okay. TBH I'm not sure what we can say with these results. At a glance, in a
> typical bi-directional-like setup,  we can interpret these results as that if
> users turn retain_conflict_info on the TPS goes 50% down.  But I'm not sure
> this 50% dip is the worst case that users possibly face. It could be better in
> practice thanks to autovacuum, or it also could go even worse due to further
> bloats if we run the test longer.

I think it shouldn't go worse because ideally the amount of bloat would not
increase beyond what we see here due to this patch unless there is some
misconfiguration that leads to one of the node not working properly (say it is
down). However, my colleague is running longer tests and we will share the
results soon.

> Suppose that users had 50% performance dip due to dead tuple retention for
> update_deleted detection, is there any way for users to improve the situation?
> For example, trying to advance slot.xmin more frequently might help to reduce
> dead tuple accumulation. I think it would be good if we could have a way to
> balance between the publisher performance and the subscriber performance.

AFAICS, most of the time in each xid advancement is spent on waiting for the
target remote_lsn to be applied and flushed, so increasing the frequency could
not help. This can be proved to be reasonable in the testcase 4 shared by
Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
commit_ts of incoming transaction to exceed the candidate_xid_time, the
regression is still the same. I think it indicates that we indeed need to wait
for this amount of time before applying all the transactions that have earlier
commit timestamp. IOW, the performance impact on the subscriber side is a
reasonable behavior if we want to detect the update_deleted conflict reliably.

[1] https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com>

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
>
...
> In test case 3, we observed a -53% performance dip, which is worse than the
> results of test case 5 with wal_receiver_status_interval = 100s. Given that
> in test case 5 with wal_receiver_status_interval = 100s we cannot remove dead
> tuples for the most of the whole 120s test time, probably we could not remove
> dead tuples for a long time also in test case 3. I expected that the apply
> worker gets remote transaction XIDs and tries to advance slot.xmin more
> frequently, so this performance dip surprised me.
 
As noted in my previous email[1], the delay primarily occurs during the final
phase (RCI_WAIT_FOR_LOCAL_FLUSH), where we wait for concurrent transactions
from the publisher to be applied and flushed locally (e.g., last_flushpos <
data->remote_lsn). I think that the interval between each transaction ID
advancement is brief, the duration of each advancement itself is significant.
 
> I would like to know how many times the apply worker gets remote transaction
> XIDs and succeeds in advance slot.xmin during the test.
 
my colleague will collect and share the data soon.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57164C9A65F29875AE63F0BD94132%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

From
Amit Kapila
Date:
On Wed, Jan 8, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > We thought of another approach, which is to create/drop this slot first as
> > > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > > during DDL). But it seems complicate to control the concurrent slot
> > > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > > will create a slot. But if another backend B is disabling the
> > > > detect_update_deteled at the same time, then the newly created slot may be
> > > > dropped by backend B. I thought about checking the number of subscriptions that
> > > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > > subscription changes caused by backend A may not visable yet (e.g. not
> > > > committed yet).
> > > >
> > >
> > > This means that for the transaction whose changes are not yet visible,
> > > we may have already created the slot and the backend B would end up
> > > dropping it. Is it possible that during the change of this new option
> > > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > > DropSubscription() to ensure that concurrent transactions can't drop
> > > the slot? Will that help in solving the above scenario?
> >
> > If we create/stop the slot during DDL, how do we support rollback DDLs?
> >
>
> We will prevent changing this setting in a transaction block as we
> already do for slot related case. See use of
> PreventInTransactionBlock() in subscriptioncmds.c.
>

On further thinking, even if we prevent this command in a transaction
block, there is still a small chance of rollback. Say, we created the
slot as the last operation after making database changes, but still,
the transaction can fail in the commit code path. So, it is still not
bulletproof. However, we already create a remote_slot at the end of
CREATE SUBSCRIPTION, so, if by any chance the transaction fails in the
commit code path, we will end up having a dangling slot on the remote
node. The same can happen in the DROP SUBSCRIPTION code path as well.
We can follow that or the other option is to allow creation of the
slot by the backend and let drop be handled by the launcher which can
even take care of dangling slots. However, I feel it will be better to
give the responsibility to the launcher for creating and dropping the
slot as the patch is doing and use the FullTransactionId for each
worker. What do you think?

--
With Regards,
Amit Kapila.



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, January 8, 2025 7:03 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

> Consider a LR setup with retain_conflict_info=true for a table t1:
> Publisher:
> insert into t1 values(1);
> -- Have a open transaction before delete operation in subscriber begin;
> 
> Subscriber:
> -- delete the record that was replicated delete from t1;
> 
> -- Now commit the transaction in publisher
> Publisher:
> update t1 set c1 = 2;
> commit;
> 
> In normal case update_deleted conflict is detected
> 2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on relation
> "public.t1": conflict=update_deleted
> 2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated was
> deleted locally in transaction 751 at 2025-01-08 15:41:29.811566+05:30.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBCA0
> 
> Now execute the same above case by having a presetup to consume all the
> replication slots in the system by executing pg_create_logical_replication_slot
> before the subscription is created, in this case the conflict is not detected
> correctly.
> 2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on relation
> "public.t1": conflict=update_missing
> 2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row to be
> updated.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBC68
> 2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
> 2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
> "max_replication_slots".
> 
> This is because even though we say create subscription is successful, the
> launcher has not yet created the replication slot.

I think some detection miss in the beginning after enabling the option is
acceptable. Because even if we let the launcher to create the slot before
starting workers, some dead tuples could have been already removed during this
period, so update_missing could still be detected. I have added some documents
to clarify that the information can be safely retained only after the slot is
created.

> 
> There are few observations from this test:
> 1) Create subscription does not wait for the slot to be created by the launcher
> and starts applying the changes. Should create a subscription wait till the slot
> is created by the launcher process.

I think the DDL could not wait for the slot creation, because the launcher would
not create the slot until the DDL is committed. Instead, I have changed the
code to create the slot before starting workers, so that at least the worker
would not unnecessarily maintain the oldest non-removable xid.

> 2) Currently launcher is exiting continuously and trying to create replication
> slots. Should the launcher wait for wal_retrieve_retry_interval configuration
> before trying to create the slot instead of filling the logs continuously.

Since the launcher already have a 5s (bgw_restart_time) restart interval, I
feel it would not consume the too much resources in this case.

> 3) If we try to create a similar subscription with retain_conflict_info and
> disable_on_error option and there is an error in replication slot creation,
> currently the subscription does not get disabled. Should we consider
> disable_on_error for these cases and disable the subscription if we are not able
> to create the slots.

Currently, since only ERRORs in apply worker would trigger disable_on_error, I
am not sure if It's worth the effort to teach the apply to catch launcher's
error because it doesn't seem like a common scenario.

Best Regards,
Hou zj



RE: Conflict detection for update_deleted in logical replication

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, January 8, 2025 3:49 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> 
> On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
> >
> 
> Here are a couple of initial review comments on v19 patch set:
> 
> 1) The subscription option 'retain_conflict_info' remains set to "true" for a
> subscription even after restarting the server with
> 'track_commit_timestamp=off', which can lead to incorrect behavior.
>   Steps to reproduce:
>    1. Start the server with 'track_commit_timestamp=ON'.
>    2. Create a subscription with (retain_conflict_info=ON).
>    3. Restart the server with 'track_commit_timestamp=OFF'.
> 
>  - The apply worker starts successfully, and the subscription retains
> 'retain_conflict_info=true'. However, in this scenario, the update_deleted
> conflict detection will not function correctly without
> 'track_commit_timestamp'.
> ```

IIUC, track_commit_timestamp is a GUC that designed mainly for conflict
detection, so it seems an unreasonable behavior to me if user enable this when
creating the sub but disable is afterwards. Besides, we documented that
update_deleted conflict would not be detected when track_commit_timestamp is
not enabled, so I am not sure if it's worth more effort adding checks for this
case.

> 
> 2) With the new parameter name change to "retain_conflict_info", the error
> message for both the 'CREATE SUBSCRIPTION' and 'ALTER SUBSCRIPTION'
> commands needs to be updated accordingly.
> 
>   postgres=# create subscription sub11 connection 'dbname=postgres'
> publication pub1 with (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
>   postgres=# alter subscription sub12 set (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
> 
>  - Change the message to something similar - "retaining conflict info requires
> "track_commit_timestamp" to be enabled".

After thinking more, I changed this to a warning for now, because to detect
all necessary conflicts, user must enable the option anyway, and the same has
been documented for update/delete_origin_differs conflicts as well.

Best Regards,
Hou zj



Re: Conflict detection for update_deleted in logical replication

From
Masahiko Sawada
Date:
On Thu, Jan 23, 2025 at 3:47 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 22, 2025 7:54 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> > On Saturday, January 18, 2025 11:45 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > > I think invalidating the slot is OK and we could also let the apply
> > > worker to automatic recovery as suggested in [1].
> > >
> > > Here is the V24 patch set. I modified 0004 patch to implement the slot
> > > Invalidation part. Since the automatic recovery could be an
> > > optimization and the discussion is in progress, I didn't implement that part.
> >
> > The implementation is in progress and I will include it in next version.
> >
> > Here is the V25 patch set that includes the following change:
> >
> > 0001
> >
> > * Per off-list discussion with Amit, I added few comments to mention the
> > reason of skipping advancing xid when table sync is in progress and to mention
> > that the advancement will not be delayed if changes are filtered out on
> > publisher via row/table filter.
> >
> > 0004
> >
> > * Fixed a bug that the launcher would advance the slot.xmin when some apply
> >   workers have not yet started.
> >
> > * Fixed a bug that the launcher did not advance the slot.xmin even if one of the
> >   apply worker has stopped conflict retention due to the lag.
> >
> > * Add a retain_conflict_info column in the pg_stat_subscription view to
> >   indicate whether the apply worker is effectively retaining conflict
> >   information. The value is set to true only if retain_conflict_info is enabled
> >   for the associated subscription, and the retention duration for conflict
> >   detection by the apply worker has not exceeded
> >   max_conflict_retention_duration. Thanks Kuroda-san for contributing codes
> >   off-list.
>
> Here is V25 patch set which includes the following changes:
>
> 0004
> * Addressed Nisha's comments[1].
> * Fixed a cfbot failure[2] in the doc.

I have one question about the 0004 patch; it implemented
max_conflict_retntion_duration as a subscription parameter. But the
launcher invalidates the pg_conflict_detection slot only if all
subscriptions with retain_conflict_info stopped retaining dead tuples
due to the max_conflict_retention_duration parameter. Therefore, even
if users set the parameter to a low value to avoid table bloats, it
would not make sense if other subscriptions set it to a larger value.
Is my understanding correct?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com