Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load - Mailing list pgsql-bugs

From vignesh C
Subject Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Date
Msg-id CALDaNm3gj+ZiWO1MG4XijR+yBL95nPZvxH=PXRqpCG7dFWvBJg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load  ("Euler Taveira" <euler@eulerto.com>)
Responses Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
List pgsql-bugs
On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
>
> On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
>
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> confirm that the native logical replication implementation did not have this
> problem, and I've found that it might have a different problem.
>
>
> pg_createsubscriber uses a different approach than pglogical. While pglogical
> uses a restore point, pg_createsubscriber uses the LSN from the latest
> replication slot as a replication start point. The restore point approach is
> usually suitable to physical replication but might not cover all scenarios for
> logical replication (such as when there are in progress transactions). Since
> creating a logical replication slot does find a consistent decoding start
> point, it is a natural choice to start the logical replication (that also needs
> to find a decoding start point).

I observed a difference between logical replication and
pg_createsubscriber in how the replication origin is set, which can
lead to different behaviors during replication restarts.

Consider the following WAL records:
rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
IST
rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755

Behavior in Logical Replication
In logical replication, after applying the COMMIT record at LSN
0/01756868, the replication origin is set to the end LSN of the commit
record, i.e., 0/01756898. On restart, replication resumes from the
next LSN after COMMIT records, which is 0/01756898 in this case.
As a result, the same transaction is not sent again, and duplicate
data is avoided.

Behavior in pg_createsubscriber
However, in the case of pg_createsubscriber, the consistent point used
for creating the replication slot on the publisher may be set exactly
at the commit LSN (0/01756868, xid 767). When promoting the standby,
this same LSN is used as recovery_target_lsn, so the standby recovers
up to and including the commit of transaction 767.

After promotion, if the replication origin is also set to this same
commit LSN, the subscriber will request changes starting from that
point. Since the origin doesn't reflect the commit as applied,
transaction 767 gets replicated again, leading to duplicate data and
possible replication failure.

If the issue is not reproducible using Zane's suggested steps, we can
try the following:
1) Stop the standby server.
2) Perform an insert transaction and note the commit LSN using pg_waldump.
3) Set up the publisher, replication slot, etc., simulating the
pg_createsubscriber behavior.
4) Restart the standby with promotion configuration from
setup_recovery(), setting recovery_target_lsn to the commit LSN from
step 2.
5) Create the subscription and set the replication origin to the same
commit LSN.
6) Enable the subscription.

This setup should reproduce the issue where the transaction gets
applied twice on the subscriber due to the replication origin being
aligned with the commit LSN rather than its end.

Thoughts?

Regards,
Vignesh



pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Disabled logical replication origin session causes primary key errors
Next
From: "李园园"
Date:
Subject: DSA refcnt overflow in pg_stat/could not attach to dynamic shared area