Re: Add an option to skip loading missing publication to avoid logical replication failure - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Date | |
Msg-id | CALDaNm27gUnMG5-gdBLnWH_+4G+EZ_78MA2h8fbGPm9o5LjySA@mail.gmail.com Whole thread Raw |
In response to | Re: Add an option to skip loading missing publication to avoid logical replication failure (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
On Fri, 2 May 2025 at 09:23, vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 2 May 2025 at 06:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > vignesh C <vignesh21@gmail.com> writes: > > > I agree with your analysis. I was able to reproduce the issue by > > > delaying the invalidation of the subscription until the walsender > > > finished decoding the INSERT operation following the ALTER > > > SUBSCRIPTION through a debugger and using the lsn from the pg_waldump > > > of the INSERT after the ALTER SUBSCRIPTION. > > > > Can you be a little more specific about how you reproduced this? > > I tried inserting sleep() calls in various likely-looking spots > > and could not get a failure that way. > > Test Steps: > 1) Set up logical replication: > Create a publication on the publisher > Create a subscription on the subscriber > 2) Create the following table on the publisher: > CREATE TABLE tab_3 (a int); > 3) Create the same table on the subscriber: > CREATE TABLE tab_3 (a int); > 4) On the subscriber, alter the subscription to refer to a > non-existent publication: > ALTER SUBSCRIPTION sub1 SET PUBLICATION tap_pub_3; > 5) Insert data on the publisher: > INSERT INTO tab_3 VALUES (1); > > As expected, the publisher logs the following warning in normal case: > 2025-05-02 08:56:45.350 IST [516197] WARNING: skipped loading > publication: tap_pub_3 > 2025-05-02 08:56:45.350 IST [516197] DETAIL: The publication does > not exist at this point in the WAL. > 2025-05-02 08:56:45.350 IST [516197] HINT: Create the publication > if it does not exist. > > To simulate a delay in subscription invalidation, I modified the > maybe_reread_subscription() function as follows: > diff --git a/src/backend/replication/logical/worker.c > b/src/backend/replication/logical/worker.c > index 4151a4b2a96..0831784aca3 100644 > --- a/src/backend/replication/logical/worker.c > +++ b/src/backend/replication/logical/worker.c > @@ -3970,6 +3970,10 @@ maybe_reread_subscription(void) > MemoryContext oldctx; > Subscription *newsub; > bool started_tx = false; > + bool test = true; > + > + if (test) > + return; > > This change delays the subscription invalidation logic, preventing the > apply worker from detecting the subscription change immediately. > > With the patch applied, repeat steps 1–5. > Using pg_waldump, identify the LSN of the insert: > rmgr: Heap len (rec/tot): 59/ 59, tx: 756, lsn: > 0/01711848, prev 0/01711810, desc: INSERT+INIT off: 1 > rmgr: Transaction len (rec/tot): 46/ 46, tx: 756, lsn: > 0/01711888, prev 0/01711848, desc: COMMIT 2025-05-02 09:06:09.400926 > IST > > Check the confirmed flush LSN from the walsender via gdb by attaching > it to the walsender process > (gdb) p *MyReplicationSlot > ... > confirmed_flush = 24241928 > (gdb) p /x 24241928 > $4 = 0x171e708 > > Now attach to the apply worker, set a breakpoint at > maybe_reread_subscription, and continue execution. Once control > reaches the function, set test = false. Now it will identify that > subscription is invalidated and restart the apply worker. > > As the walsender has already confirmed_flush position after the > insert, causing the newly started apply worker to miss the inserted > row entirely. This leads to the CI failure. This issue can arise when > the walsender advances more quickly than the apply worker is able to > detect and react to the subscription change. > > I could not find a simpler way to reproduce this. A simpler way to consistently reproduce the issue is to add a 1-second sleep in the LogicalRepApplyLoop function, just before the call to WaitLatchOrSocket. This reproduces the test failure consistently for me. The failure reason is the same as in [1]. [1] - https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com Regards, Vignesh
Attachment
pgsql-hackers by date: