Re: Add an option to skip loading missing publication to avoid logical replication failure - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Date | |
Msg-id | CALDaNm3TH3J8fwj+NcdjdN8DZCdrnmm1kzBsHBW2nkN+h6up3A@mail.gmail.com Whole thread Raw |
In response to | Re: Add an option to skip loading missing publication to avoid logical replication failure (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Xuneng Zhou pointed out on Discord that the test case added by > > 7c99dc587 has caused repeated failures in CI --- though oddly, > > it's not failed in the buildfarm so far as I can find. The > > failures look like > > > > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.plline 103. > > > > I analyzed the relevant publisher-side CI Logs [1]: > ... > 2025-04-19 08:24:14.096 UTC [21961][client backend] > [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3 > values(1) > 2025-04-19 08:24:14.098 UTC [21961][client backend] > [024_add_drop_pub.pl][:0] LOG: disconnection: session time: > 0:00:00.003 user=postgres database=postgres host=[local] > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG: > released logical replication slot "tap_sub" > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG: > disconnection: session time: 0:00:00.329 user=postgres > database=postgres host=[local] > 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0] > LOG: connection received: host=[local] > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] > LOG: connection authenticated: user="postgres" method=trust > (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117) > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] > LOG: replication connection authorized: user=postgres > application_name=tap_sub > 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG: > statement: SELECT pg_catalog.set_config('search_path', '', false); > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG: > received replication command: IDENTIFY_SYSTEM > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] > STATEMENT: IDENTIFY_SYSTEM > 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG: > received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL > 0/0 (proto_version '4', streaming 'parallel', origin 'any', > publication_names '"tap_pub_ > ... > > This shows that walsender restarts after the "INSERT INTO tab_3 > values(1)" is processed by the previous walsender ("released logical > replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)"). > So, it is possible that the old apply worker has sent the confirmation > of WAL received location after the Insert (due to keep_alive message > handling). So, after the restart, the new walsender will start > processing WAL after the INSERT and wait for the skipped message LOG > timed out. > > Considering the above theory is correct, after "ALTER SUBSCRIPTION > tap_sub SET PUBLICATION tap_pub_3", we should wait for the new > walsender to restart. We are already doing the same for a similar case > in 001_rep_changes.pl (See "# check that change of connection string > and/or publication list causes restart of subscription workers. We > check the state along with application_name to ensure that the > walsender is (re)started.). > > Unfortunately, I will be away for the rest of the week. In the > meantime, if you or someone else is able to reproduce and fix it, then > good; otherwise, I'll take care of it after I return. I agree with your analysis. I was able to reproduce the issue by delaying the invalidation of the subscription until the walsender finished decoding the INSERT operation following the ALTER SUBSCRIPTION through a debugger and using the lsn from the pg_waldump of the INSERT after the ALTER SUBSCRIPTION. In this scenario, the confirmed_flush_lsn ends up pointing to a location after the INSERT. When the invalidation is eventually received and the apply worker/walsender is restarted, the restarted walsender begins decoding from that LSN—after the INSERT—which means the "skipped loading publication" warning is never triggered, causing the test to fail. Attached is a patch that ensures the walsender process is properly restarted after ALTER SUBSCRIPTION, preventing this race condition. Regards, Vignesh
Attachment
pgsql-hackers by date: