Re: Build-farm - intermittent error in 031_column_list.pl - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Build-farm - intermittent error in 031_column_list.pl |
Date | |
Msg-id | CAA4eK1LkvAHXWLtbuQ5JGhJZm_Rww_ukvoF6tqxBYEkRQ2DcVw@mail.gmail.com Whole thread Raw |
In response to | Re: Build-farm - intermittent error in 031_column_list.pl (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Build-farm - intermittent error in 031_column_list.pl
RE: Build-farm - intermittent error in 031_column_list.pl |
List | pgsql-hackers |
On Thu, May 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith <smithpb2250@gmail.com> wrote in > > > Hi hackers. > > > > > > FYI, I saw that there was a recent Build-farm error on the "grison" machine [1] > > > [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison&br=HEAD > > > > > > The error happened during "subscriptionCheck" phase in the TAP test > > > t/031_column_list.pl > > > This test file was added by this [2] commit. > > > [2] https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5 > > > > What is happening for all of them looks like that the name of a > > publication created by CREATE PUBLICATION without a failure report is > > missing for a walsender came later. It seems like CREATE PUBLICATION > > can silently fail to create a publication, or walsender somehow failed > > to find existing one. > > > > Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed? > > > > > > ~~ > > > > > > > 2022-04-17 00:16:04.278 CEST [293659][client backend][4/270:0][031_column_list.pl] LOG: statement: CREATE PUBLICATIONpub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true); > > 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] LOG: disconnection: session time: 0:00:00.002user=bf database=postgres host=[local] > > > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then > > the session has been disconnected. But the following request for the > > same publication fails due to the absense of the publication. > > > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88(proto_version '3', publication_names '"pub9"') > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR: publication "pub9" does not exist > > > > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The > probable theory is that ALTER SUBSCRIPTION will lead to restarting of > apply worker (which we can see in LOGS as well) and after the restart, > the apply worker will use the existing slot and replication origin > corresponding to the subscription. Now, it is possible that before > restart the origin has not been updated and the WAL start location > points to a location prior to where PUBLICATION pub9 exists which can > lead to such an error. Once this error occurs, apply worker will never > be able to proceed and will always return the same error. Does this > make sense? > > Unless you or others see a different theory, this seems to be the > existing problem in logical replication which is manifested by this > test. If we just want to fix these test failures, we can create a new > subscription instead of altering the existing publication to point to > the new publication. > If the above theory is correct then I think allowing the publisher to catch up with "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION should fix this problem. Because if before ALTER both publisher and subscriber are in sync then the new publication should be visible to WALSender. -- With Regards, Amit Kapila.
pgsql-hackers by date: