RE: Test failures of 100_bugs.pl - Mailing list pgsql-hackers
From | Yu Shi (Fujitsu) |
---|---|
Subject | RE: Test failures of 100_bugs.pl |
Date | |
Msg-id | OSZPR01MB6310D6F48372F52F1D85E1C5FD609@OSZPR01MB6310.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Test failures of 100_bugs.pl (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
RE: Test failures of 100_bugs.pl
|
List | pgsql-hackers |
On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote: > > > > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail > > occasionally. Just rarely enough that I never got around to looking into it > > for real. > > > ... > > > > We see t2 added to the publication: > > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] > LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2 > > > > And that *then* "t" was synchronized: > > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical > replication table synchronization worker for subscription "testsub", table "t" has > finished > > > > and then that the refresh was issued: > > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] > LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION > > > > And we see a walsender starting and the query to get the new tables being > executed: > > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: > statement: SELECT DISTINCT t.schemaname, t.tablename > > , t.attnames > > FROM pg_catalog.pg_publication_tables t > > WHERE t.pubname IN ('testpub') > > > > > > And that's it, the rest of the time is just polling. > > > > > > Perhaps wait_for_subscription_sync() should dump the set of rel states to > make > > something like this more debuggable? > > > > > > The fact that the synchronization for t finished just before the refresh makes > > me wonder if a wakeup or a cache invalidation got lost? > > > > From the LOGs, the only thing one could draw is lost invalidation > because the nap time of the apply worker is 1s, so it should process > invalidation during the time we are polling. Also, the rel should be > added to pg_subscription_rel because the test is still polling for > rels to be in 'ready' or 'done' state. > > I think we can do three things to debug (a) as you suggest dump the > rel state in wait_for_subscription_sync; (b) add some DEBUG log in > invalidate_syncing_table_states() to ensure that invalidation has been > processed; (c) print rel states and relids from table_states_not_ready > in process_syncing_tables_for_apply() to see if t2 has ever appeared > in that list. > There was a similar buildfarm failure on francolin recently[1]. I think the problem is that after REFRESH PUBLICATION, the table sync worker for the new table test_mismatching_types was not started. So, the test timed out while waiting for an ERROR message that should have been reported by the table sync worker. -- regress_log_014_binary: timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.plline 269. 014_binary_subscriber.log: 2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. -- I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested. Please see the attached patch. I will try to add some debug logs in code later. [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09 Regards, Shi Yu
Attachment
pgsql-hackers by date: