Re: Add an option to skip loading missing publication to avoid logical replication failure - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Date | |
Msg-id | CAA4eK1Jz20hnPRDtPDo2BbSt0Xf8u2zY4Tc84R0OAQN8M=9iCQ@mail.gmail.com Whole thread Raw |
In response to | Re: Add an option to skip loading missing publication to avoid logical replication failure (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Xuneng Zhou pointed out on Discord that the test case added by > 7c99dc587 has caused repeated failures in CI --- though oddly, > it's not failed in the buildfarm so far as I can find. The > failures look like > > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.plline 103. > I analyzed the relevant publisher-side CI Logs [1]: ... 2025-04-19 08:24:14.096 UTC [21961][client backend] [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3 values(1) 2025-04-19 08:24:14.098 UTC [21961][client backend] [024_add_drop_pub.pl][:0] LOG: disconnection: session time: 0:00:00.003 user=postgres database=postgres host=[local] 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG: released logical replication slot "tap_sub" 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG: disconnection: session time: 0:00:00.329 user=postgres database=postgres host=[local] 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0] LOG: connection received: host=[local] 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] LOG: connection authenticated: user="postgres" method=trust (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117) 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0] LOG: replication connection authorized: user=postgres application_name=tap_sub 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG: received replication command: IDENTIFY_SYSTEM 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] STATEMENT: IDENTIFY_SYSTEM 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '4', streaming 'parallel', origin 'any', publication_names '"tap_pub_ ... This shows that walsender restarts after the "INSERT INTO tab_3 values(1)" is processed by the previous walsender ("released logical replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)"). So, it is possible that the old apply worker has sent the confirmation of WAL received location after the Insert (due to keep_alive message handling). So, after the restart, the new walsender will start processing WAL after the INSERT and wait for the skipped message LOG timed out. Considering the above theory is correct, after "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_3", we should wait for the new walsender to restart. We are already doing the same for a similar case in 001_rep_changes.pl (See "# check that change of connection string and/or publication list causes restart of subscription workers. We check the state along with application_name to ensure that the walsender is (re)started.). Unfortunately, I will be away for the rest of the week. In the meantime, if you or someone else is able to reproduce and fix it, then good; otherwise, I'll take care of it after I return. [1] - https://api.cirrus-ci.com/v1/artifact/task/6561639182368768/testrun/build/testrun/subscription/024_add_drop_pub/log/024_add_drop_pub_publisher.log -- With Regards, Amit Kapila.
pgsql-hackers by date: