RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id OS0PR01MB571694B5F7FFB9ECEF5FCB3294132@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
On Wednesday, January 8, 2025 7:03 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

> Consider a LR setup with retain_conflict_info=true for a table t1:
> Publisher:
> insert into t1 values(1);
> -- Have a open transaction before delete operation in subscriber begin;
> 
> Subscriber:
> -- delete the record that was replicated delete from t1;
> 
> -- Now commit the transaction in publisher
> Publisher:
> update t1 set c1 = 2;
> commit;
> 
> In normal case update_deleted conflict is detected
> 2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on relation
> "public.t1": conflict=update_deleted
> 2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated was
> deleted locally in transaction 751 at 2025-01-08 15:41:29.811566+05:30.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBCA0
> 
> Now execute the same above case by having a presetup to consume all the
> replication slots in the system by executing pg_create_logical_replication_slot
> before the subscription is created, in this case the conflict is not detected
> correctly.
> 2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on relation
> "public.t1": conflict=update_missing
> 2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row to be
> updated.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBC68
> 2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
> 2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
> "max_replication_slots".
> 
> This is because even though we say create subscription is successful, the
> launcher has not yet created the replication slot.

I think some detection miss in the beginning after enabling the option is
acceptable. Because even if we let the launcher to create the slot before
starting workers, some dead tuples could have been already removed during this
period, so update_missing could still be detected. I have added some documents
to clarify that the information can be safely retained only after the slot is
created.

> 
> There are few observations from this test:
> 1) Create subscription does not wait for the slot to be created by the launcher
> and starts applying the changes. Should create a subscription wait till the slot
> is created by the launcher process.

I think the DDL could not wait for the slot creation, because the launcher would
not create the slot until the DDL is committed. Instead, I have changed the
code to create the slot before starting workers, so that at least the worker
would not unnecessarily maintain the oldest non-removable xid.

> 2) Currently launcher is exiting continuously and trying to create replication
> slots. Should the launcher wait for wal_retrieve_retry_interval configuration
> before trying to create the slot instead of filling the logs continuously.

Since the launcher already have a 5s (bgw_restart_time) restart interval, I
feel it would not consume the too much resources in this case.

> 3) If we try to create a similar subscription with retain_conflict_info and
> disable_on_error option and there is an error in replication slot creation,
> currently the subscription does not get disabled. Should we consider
> disable_on_error for these cases and disable the subscription if we are not able
> to create the slots.

Currently, since only ERRORs in apply worker would trigger disable_on_error, I
am not sure if It's worth the effort to teach the apply to catch launcher's
error because it doesn't seem like a common scenario.

Best Regards,
Hou zj



pgsql-hackers by date:

Previous
From: Ants Aasma
Date:
Subject: Re: AIO v2.0
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication