Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load - Mailing list pgsql-bugs
From | Zane Duffield |
---|---|
Subject | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date | |
Msg-id | CACMiCkUm5gwcoS2=jap1vkrS_n+FbFLWY5XQJ8ssFc8BUCxGCg@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load ("Euler Taveira" <euler@eulerto.com>) |
Responses |
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
List | pgsql-bugs |
Hi Euler, thanks for your reply.
On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler@eulerto.com> wrote:
On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:I'm in the process of converting our databases from pglogical logicalreplication to the native logical replication implementation on PostgreSQL17. One of the bugs we encountered and had to work around with pglogical wasthe plugin dropping records while converting to a streaming replica tological via pglogical_create_subscriber (reportedhttps://github.com/2ndQuadrant/pglogical/issues/349). I was trying toconfirm that the native logical replication implementation did not have thisproblem, and I've found that it might have a different problem.pg_createsubscriber uses a different approach than pglogical. While pglogicaluses a restore point, pg_createsubscriber uses the LSN from the latestreplication slot as a replication start point. The restore point approach isusually suitable to physical replication but might not cover all scenarios forlogical replication (such as when there are in progress transactions). Sincecreating a logical replication slot does find a consistent decoding startpoint, it is a natural choice to start the logical replication (that also needsto find a decoding start point).I should say that I've been operating under the assumption thatpg_createsubscriber is designed for use on a replica for a *live* primarydatabase, if this isn't correct then someone please let me know.pg_createsubscriber expects a physical replica that is preferably stoppedbefore running it.
I think pg_createsubscriber actually gives you an error if the replica is not stopped. I was talking about the primary.
Your script is not waiting enough time until it applies the backlog. Unless,you are seeing a different symptom, there is no bug.You should have used something similar to wait_for_subscription_sync routine(Cluster.pm) before counting the rows. That's what is used in thepg_createsubscriber tests. It guarantees the subscriber has caught up.
It may be true that the script doesn't wait long enough for all systems, but when I reproduced the issue on my machine(s) I confirmed that the logical decoder process was properly stuck on a conflicting primary key, rather than just catching up.
From the log file
2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR: duplicate key value violates unique constraint "test_table_pkey"
2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL: Key (f1)=(20700) already exists.
2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT: processing remote data for replication origin "pg_24576" during message type "INSERT" for replication target relation "public.test_table" in transaction 1581, finished at 0/3720058
2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG: background worker "logical replication apply worker" (PID 3845786) exited with exit code 1
wait_for_subscription_sync sounds like a better solution than what I have, but you might still be able to reproduce the problem if you increase the sleep interval on line 198.
I wonder if Shlok could confirm whether they found the conflicting primary key in their reproduction?
Thanks,
Zane
pgsql-bugs by date: