Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load - Mailing list pgsql-bugs
From | Shlok Kyal |
---|---|
Subject | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date | |
Msg-id | CANhcyEUeJ-Z39n-YiWYdKK6DZQmZmBwWLBjKr4XrShNPLyXChg@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
|
List | pgsql-bugs |
On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote: > > > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote: > > > > I'm in the process of converting our databases from pglogical logical > > replication to the native logical replication implementation on PostgreSQL > > 17. One of the bugs we encountered and had to work around with pglogical was > > the plugin dropping records while converting to a streaming replica to > > logical via pglogical_create_subscriber (reported > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to > > confirm that the native logical replication implementation did not have this > > problem, and I've found that it might have a different problem. > > > > > > pg_createsubscriber uses a different approach than pglogical. While pglogical > > uses a restore point, pg_createsubscriber uses the LSN from the latest > > replication slot as a replication start point. The restore point approach is > > usually suitable to physical replication but might not cover all scenarios for > > logical replication (such as when there are in progress transactions). Since > > creating a logical replication slot does find a consistent decoding start > > point, it is a natural choice to start the logical replication (that also needs > > to find a decoding start point). > > I observed a difference between logical replication and > pg_createsubscriber in how the replication origin is set, which can > lead to different behaviors during replication restarts. > > Consider the following WAL records: > rmgr: Transaction len (rec/tot): 46/ 46, tx: 767, lsn: > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942 > IST > rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768 > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755 > > Behavior in Logical Replication > In logical replication, after applying the COMMIT record at LSN > 0/01756868, the replication origin is set to the end LSN of the commit > record, i.e., 0/01756898. On restart, replication resumes from the > next LSN after COMMIT records, which is 0/01756898 in this case. > As a result, the same transaction is not sent again, and duplicate > data is avoided. > > Behavior in pg_createsubscriber > However, in the case of pg_createsubscriber, the consistent point used > for creating the replication slot on the publisher may be set exactly > at the commit LSN (0/01756868, xid 767). When promoting the standby, > this same LSN is used as recovery_target_lsn, so the standby recovers > up to and including the commit of transaction 767. > > After promotion, if the replication origin is also set to this same > commit LSN, the subscriber will request changes starting from that > point. Since the origin doesn't reflect the commit as applied, > transaction 767 gets replicated again, leading to duplicate data and > possible replication failure. > > If the issue is not reproducible using Zane's suggested steps, we can > try the following: > 1) Stop the standby server. > 2) Perform an insert transaction and note the commit LSN using pg_waldump. > 3) Set up the publisher, replication slot, etc., simulating the > pg_createsubscriber behavior. > 4) Restart the standby with promotion configuration from > setup_recovery(), setting recovery_target_lsn to the commit LSN from > step 2. > 5) Create the subscription and set the replication origin to the same > commit LSN. > 6) Enable the subscription. > > This setup should reproduce the issue where the transaction gets > applied twice on the subscriber due to the replication origin being > aligned with the commit LSN rather than its end. > > Thoughts? Hi Vignesh, I have verified the behaviour shared by you for Logical Replication and pg_createsubscriber and agree with your analysis. I have also tried the steps shared by you and am able to reproduce the issue. I am thinking of resolving it by introducing a new API, which can give us the next lsn to the lsn provided. During pg_createsusbcriber run where we are advancing the replication origin to 'consistent_lsn'. I think we should advance it to the next lsn of 'consistent_lsn' instead. I think this will resolve the issue. Thoughts? Thanks and Regards, Shlok Kyal
pgsql-bugs by date: