RE: Test failures of 100_bugs.pl - Mailing list pgsql-hackers

From Yu Shi (Fujitsu)
Subject RE: Test failures of 100_bugs.pl
Date
Msg-id OSZPR01MB6310D6F48372F52F1D85E1C5FD609@OSZPR01MB6310.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Test failures of 100_bugs.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Responses RE: Test failures of 100_bugs.pl
List pgsql-hackers
On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
> > occasionally. Just rarely enough that I never got around to looking into it
> > for real.
> >
> ...
> >
> > We see t2 added to the publication:
> > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0]
> LOG:  statement: ALTER PUBLICATION testpub ADD TABLE t2
> >
> > And that *then* "t" was synchronized:
> > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG:  logical
> replication table synchronization worker for subscription "testsub", table "t" has
> finished
> >
> > and then that the refresh was issued:
> > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0]
> LOG:  statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION
> >
> > And we see a walsender starting and the query to get the new tables being
> executed:
> > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG:
> statement: SELECT DISTINCT t.schemaname, t.tablename
> >         , t.attnames
> >         FROM pg_catalog.pg_publication_tables t
> >          WHERE t.pubname IN ('testpub')
> >
> >
> > And that's it, the rest of the time is just polling.
> >
> >
> > Perhaps wait_for_subscription_sync() should dump the set of rel states to
> make
> > something like this more debuggable?
> >
> >
> > The fact that the synchronization for t finished just before the refresh makes
> > me wonder if a wakeup or a cache invalidation got lost?
> >
> 
> From the LOGs, the only thing one could draw is lost invalidation
> because the nap time of the apply worker is 1s, so it should process
> invalidation during the time we are polling. Also, the rel should be
> added to pg_subscription_rel because the test is still polling for
> rels to be in 'ready' or 'done' state.
> 
> I think we can do three things to debug (a) as you suggest dump the
> rel state in wait_for_subscription_sync; (b) add some DEBUG log in
> invalidate_syncing_table_states() to ensure that invalidation has been
> processed; (c) print rel states and relids from table_states_not_ready
> in process_syncing_tables_for_apply() to see if t2 has ever appeared
> in that list.
> 

There was a similar buildfarm failure on francolin recently[1]. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.

--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
/home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.plline 269.
 

014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR:  could not receive data from WAL stream: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
--

I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.

[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09

Regards,
Shi Yu

Attachment

pgsql-hackers by date:

Previous
From: Gurjeet Singh
Date:
Subject: Minor code de-duplication in fe-connect.c
Next
From: Amit Kapila
Date:
Subject: Re: Support logical replication of DDLs