Re: Add an option to skip loading missing publication to avoid logical replication failure - Mailing list pgsql-hackers
From | Xuneng Zhou |
---|---|
Subject | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Date | |
Msg-id | CABPTF7WjggBzCWy9BBWRen9H-Anq4Y9jsqCg=i=x+W_XFSR6wQ@mail.gmail.com Whole thread Raw |
In response to | Re: Add an option to skip loading missing publication to avoid logical replication failure (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Add an option to skip loading missing publication to avoid logical replication failure
Re: Add an option to skip loading missing publication to avoid logical replication failure |
List | pgsql-hackers |
Hi,
Is this an expected behavior?
Is this an expected behavior?
A race between subscriber LSN feedback and publisher subscription change processing allows the walsender to restart decoding past relevant WAL records, bypassing the updated subscription rules for those records.
On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Xuneng Zhou pointed out on Discord that the test case added by
> > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > it's not failed in the buildfarm so far as I can find. The
> > failures look like
> >
> > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading publication: tap_pub_3) at /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> >
>
> I analyzed the relevant publisher-side CI Logs [1]:
> ...
> 2025-04-19 08:24:14.096 UTC [21961][client backend]
> [024_add_drop_pub.pl][7/4:0] LOG: statement: INSERT INTO tab_3
> values(1)
> 2025-04-19 08:24:14.098 UTC [21961][client backend]
> [024_add_drop_pub.pl][:0] LOG: disconnection: session time:
> 0:00:00.003 user=postgres database=postgres host=[local]
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> released logical replication slot "tap_sub"
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> disconnection: session time: 0:00:00.329 user=postgres
> database=postgres host=[local]
> 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> LOG: connection received: host=[local]
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: connection authenticated: user="postgres" method=trust
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG: replication connection authorized: user=postgres
> application_name=tap_sub
> 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> statement: SELECT pg_catalog.set_config('search_path', '', false);
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> STATEMENT: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> publication_names '"tap_pub_
> ...
>
> This shows that walsender restarts after the "INSERT INTO tab_3
> values(1)" is processed by the previous walsender ("released logical
> replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> So, it is possible that the old apply worker has sent the confirmation
> of WAL received location after the Insert (due to keep_alive message
> handling). So, after the restart, the new walsender will start
> processing WAL after the INSERT and wait for the skipped message LOG
> timed out.
>
> Considering the above theory is correct, after "ALTER SUBSCRIPTION
> tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> walsender to restart. We are already doing the same for a similar case
> in 001_rep_changes.pl (See "# check that change of connection string
> and/or publication list causes restart of subscription workers. We
> check the state along with application_name to ensure that the
> walsender is (re)started.).
>
> Unfortunately, I will be away for the rest of the week. In the
> meantime, if you or someone else is able to reproduce and fix it, then
> good; otherwise, I'll take care of it after I return.
I agree with your analysis. I was able to reproduce the issue by
delaying the invalidation of the subscription until the walsender
finished decoding the INSERT operation following the ALTER
SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
of the INSERT after the ALTER SUBSCRIPTION. In this scenario, the
confirmed_flush_lsn ends up pointing to a location after the INSERT.
When the invalidation is eventually received and the apply
worker/walsender is restarted, the restarted walsender begins decoding
from that LSN—after the INSERT—which means the "skipped loading
publication" warning is never triggered, causing the test to fail.
Attached is a patch that ensures the walsender process is properly
restarted after ALTER SUBSCRIPTION, preventing this race condition.
Regards,
Vignesh
pgsql-hackers by date: