Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Build-farm - intermittent error in 031_column_list.pl
Date
Msg-id CAA4eK1LkvAHXWLtbuQ5JGhJZm_Rww_ukvoF6tqxBYEkRQ2DcVw@mail.gmail.com
Whole thread Raw
In response to Re: Build-farm - intermittent error in 031_column_list.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Build-farm - intermittent error in 031_column_list.pl
RE: Build-farm - intermittent error in 031_column_list.pl
List pgsql-hackers
On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
> > > Hi hackers.
> > >
> > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
> > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD
> > >
> > > The error happened during "subscriptionCheck" phase in the TAP test
> > > t/031_column_list.pl
> > > This test file was added by this [2] commit.
> > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
> >
> > What is happening for all of them looks like that the name of a
> > publication created by CREATE PUBLICATION without a failure report is
> > missing for a walsender came later. It seems like CREATE PUBLICATION
> > can silently fail to create a publication, or walsender somehow failed
> > to find existing one.
> >
>
> Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?
>
> >
> > > ~~
> > >
> >
> > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
 
> > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG:  disconnection: session time:
0:00:00.002user=bf database=postgres host=[local]
 
> >
> > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > the session has been disconnected. But the following request for the
> > same publication fails due to the absense of the publication.
> >
> > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL
0/153DB88(proto_version '3', publication_names '"pub9"')
 
> > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  publication "pub9" does not exist
> >
>
> This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> apply worker (which we can see in LOGS as well) and after the restart,
> the apply worker will use the existing slot and replication origin
> corresponding to the subscription. Now, it is possible that before
> restart the origin has not been updated and the WAL start location
> points to a location prior to where PUBLICATION pub9 exists which can
> lead to such an error. Once this error occurs, apply worker will never
> be able to proceed and will always return the same error. Does this
> make sense?
>
> Unless you or others see a different theory, this seems to be the
> existing problem in logical replication which is manifested by this
> test. If we just want to fix these test failures, we can create a new
> subscription instead of altering the existing publication to point to
> the new publication.
>

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Andrey Lepikhov
Date:
Subject: Re: Removing unneeded self joins
Next
From: Andrew Dunstan
Date:
Subject: Re: Addition of PostgreSQL::Test::Cluster::pg_version()