Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load - Mailing list pgsql-bugs

From Zane Duffield
Subject Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Date
Msg-id CACMiCkUm5gwcoS2=jap1vkrS_n+FbFLWY5XQJ8ssFc8BUCxGCg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load  ("Euler Taveira" <euler@eulerto.com>)
Responses Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
List pgsql-bugs
Hi Euler, thanks for your reply.

On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler@eulerto.com> wrote:
On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

pg_createsubscriber uses a different approach than pglogical. While pglogical
uses a restore point, pg_createsubscriber uses the LSN from the latest
replication slot as a replication start point. The restore point approach is
usually suitable to physical replication but might not cover all scenarios for
logical replication (such as when there are in progress transactions). Since
creating a logical replication slot does find a consistent decoding start
point, it is a natural choice to start the logical replication (that also needs
to find a decoding start point).

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

pg_createsubscriber expects a physical replica that is preferably stopped
before running it.

I think pg_createsubscriber actually gives you an error if the replica is not stopped. I was talking about the primary.
 
Your script is not waiting enough time until it applies the backlog. Unless,
you are seeing a different symptom, there is no bug.

You should have used something similar to wait_for_subscription_sync routine
(Cluster.pm) before counting the rows. That's what is used in the
pg_createsubscriber tests. It guarantees the subscriber has caught up.


It may be true that the script doesn't wait long enough for all systems, but when I reproduced the issue on my machine(s) I confirmed that the logical decoder process was properly stuck on a conflicting primary key, rather than just catching up.

From the log file
2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR:  duplicate key value violates unique constraint "test_table_pkey"
2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL:  Key (f1)=(20700) already exists.
2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT:  processing remote data for replication origin "pg_24576" during message type "INSERT" for replication target relation "public.test_table" in transaction 1581, finished at 0/3720058
2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG:  background worker "logical replication apply worker" (PID 3845786) exited with exit code 1

  wait_for_subscription_sync sounds like a better solution than what I have, but you might still be able to reproduce the problem if you increase the sleep interval on line 198.

I wonder if Shlok could confirm whether they found the conflicting primary key in their reproduction?

Thanks,
Zane

pgsql-bugs by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: Command order bug in pg_dump
Next
From: Zane Duffield
Date:
Subject: Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load