Re: Add an option to skip loading missing publication to avoid logical replication failure - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Add an option to skip loading missing publication to avoid logical replication failure
Date
Msg-id CAA4eK1Jz20hnPRDtPDo2BbSt0Xf8u2zY4Tc84R0OAQN8M=9iCQ@mail.gmail.com
Whole thread Raw
In response to Re: Add an option to skip loading missing publication to avoid logical replication failure  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Xuneng Zhou pointed out on Discord that the test case added by
> 7c99dc587 has caused repeated failures in CI --- though oddly,
> it's not failed in the buildfarm so far as I can find.  The
> failures look like
>
> timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at
/tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.plline 103. 
>

I analyzed the relevant publisher-side CI Logs [1]:
...
2025-04-19 08:24:14.096 UTC [21961][client backend]
[024_add_drop_pub.pl][7/4:0] LOG:  statement: INSERT INTO tab_3
values(1)
2025-04-19 08:24:14.098 UTC [21961][client backend]
[024_add_drop_pub.pl][:0] LOG:  disconnection: session time:
0:00:00.003 user=postgres database=postgres host=[local]
2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
released logical replication slot "tap_sub"
2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
disconnection: session time: 0:00:00.329 user=postgres
database=postgres host=[local]
2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
LOG:  connection received: host=[local]
2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
LOG:  connection authenticated: user="postgres" method=trust

(/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
LOG:  replication connection authorized: user=postgres
application_name=tap_sub
2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
statement: SELECT pg_catalog.set_config('search_path', '', false);
2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
received replication command: IDENTIFY_SYSTEM
2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
STATEMENT:  IDENTIFY_SYSTEM
2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/0 (proto_version '4', streaming 'parallel', origin 'any',
publication_names '"tap_pub_
...

This shows that walsender restarts after the "INSERT INTO tab_3
values(1)" is processed by the previous walsender ("released logical
replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
So, it is possible that the old apply worker has sent the confirmation
of WAL received location after the Insert (due to keep_alive message
handling). So, after the restart, the new walsender will start
processing WAL after the INSERT and wait for the skipped message LOG
timed out.

Considering the above theory is correct, after "ALTER SUBSCRIPTION
tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
walsender to restart. We are already doing the same for a similar case
in 001_rep_changes.pl (See "# check that change of connection string
and/or publication list causes restart of subscription workers. We
check the state along with application_name to ensure that the
walsender is (re)started.).

Unfortunately, I will be away for the rest of the week. In the
meantime, if you or someone else is able to reproduce and fix it, then
good; otherwise, I'll take care of it after I return.

[1] -
https://api.cirrus-ci.com/v1/artifact/task/6561639182368768/testrun/build/testrun/subscription/024_add_drop_pub/log/024_add_drop_pub_publisher.log

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: alphabetize long options in pg_dump[all] docs
Next
From: Tomas Vondra
Date:
Subject: Re: Parallel CREATE INDEX for GIN indexes