Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From vignesh C
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CALDaNm08M6CRZkK=BtVfS1=+zV2Qayg+fnYXQEPBiMOQ39m6sg@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.

Conflict detection of truncated updates is detected as update_missing
and deleted update is detected as update_deleted. I was not sure if
truncated updates should also be detected as update_deleted, as the
document says truncate operation is "It has the same effect as an
unqualified DELETE on each table" at [1].

I tried with the following three node(N1,N2 & N3) setup with
subscriber on N3 subscribing to the publisher pub1 in N1 and publisher
pub2 in N2:
N1 - pub1
N2 - pub2
N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2)

-- Insert a record in N1
insert into t1 values(1);

-- Insert a record in N2
insert into t1 values(1);

-- Now N3 has the above inserts from N1 and N2
N3=# select * from t1;
 c1
----
  1
  1
(2 rows)

-- Truncate t1 from N2
N2=# truncate t1;
TRUNCATE TABLE

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generated a conflict
postgres=# update t1 set c1 = 2;
UPDATE 1
N1=# select * from t1;
 c1
----
  2
(1 row)

--- N3 logs the conflict as update_missing
2025-01-02 12:21:37.388 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-02 12:21:37.388 IST [24803] DETAIL:  Could not find the row to
be updated.
        Remote tuple (2); replica identity full (1).
2025-01-02 12:21:37.388 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 757, finished
at 0/17478D0

-- Insert a record with value 2 in N2
N2=# insert into t1 values(2);
INSERT 0 1

-- Now N3 has the above inserted records:
N3=# select * from t1;
 c1
----
  2
(1 row)

-- Delete this record from N2:
N2=# delete from t1;
DELETE 1

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generate a conflict
postgres=# update t1 set c1 = 3;
UPDATE 1

--- N3 logs the conflict as update_deleted
2025-01-02 12:22:38.036 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-02 12:22:38.036 IST [24803] DETAIL:  The row to be updated was
deleted by a different origin "pg_16388" in transaction 764 at
2025-01-02 12:22:29.025347+05:30.
        Remote tuple (3); replica identity full (2).
2025-01-02 12:22:38.036 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 758, finished
at 0/174D240

I'm not sure if this behavior is expected or not. If this is expected
can we mention this in the documentation for the user to handle the
conflict resolution accordingly in these cases.
Thoughts?

[1] - https://www.postgresql.org/docs/devel/sql-truncate.html

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: Proposal: Progressive explain
Next
From: Richard Guo
Date:
Subject: Re: ERROR: corrupt MVNDistinct entry