Thread: Test failures of 100_bugs.pl
Hi, cfbot, the buildfarm and locally I have seen 100_bugs.pl fail occasionally. Just rarely enough that I never got around to looking into it for real. Just now there was another failure on master: https://cirrus-ci.com/task/5279589287591936 [01:00:49.441] ok 1 - index predicates do not cause crash [01:00:49.441] ok 2 - update to temporary table without replica identity with FOR ALL TABLES publication [01:00:49.441] ok 3 - update to unlogged table without replica identity with FOR ALL TABLES publication [01:00:49.441] # test failed [01:00:49.441] --- stderr --- [01:00:49.441] # poll_query_until timed out executing this query: [01:00:49.441] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's'); [01:00:49.441] # expecting this output: [01:00:49.441] # t [01:00:49.441] # last actual query output: [01:00:49.441] # f [01:00:49.441] # with stderr: [01:00:49.441] # Tests were run but no plan was declared and done_testing() was not seen. [01:00:49.441] # Looks like your test exited with 29 just after 3. [01:00:49.441] [01:00:49.441] (test program exited with status code 29) the regress log: https://api.cirrus-ci.com/v1/artifact/task/5279589287591936/testrun/build-32/testrun/subscription/100_bugs/log/regress_log_100_bugs and twoway's log: https://api.cirrus-ci.com/v1/artifact/task/5279589287591936/testrun/build-32/testrun/subscription/100_bugs/log/100_bugs_twoways.log We see t2 added to the publication: 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] LOG: statement: ALTER PUBLICATION testpub ADD TABLEt2 And that *then* "t" was synchronized: 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical replication table synchronization worker forsubscription "testsub", table "t" has finished and then that the refresh was issued: 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] LOG: statement: ALTER SUBSCRIPTION testsub REFRESHPUBLICATION And we see a walsender starting and the query to get the new tables being executed: 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: statement: SELECT DISTINCT t.schemaname, t.tablename , t.attnames FROM pg_catalog.pg_publication_tables t WHERE t.pubname IN ('testpub') And that's it, the rest of the time is just polling. Perhaps wait_for_subscription_sync() should dump the set of rel states to make something like this more debuggable? The fact that the synchronization for t finished just before the refresh makes me wonder if a wakeup or a cache invalidation got lost? Greetings, Andres Freund
On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote: > > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail > occasionally. Just rarely enough that I never got around to looking into it > for real. > ... > > We see t2 added to the publication: > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] LOG: statement: ALTER PUBLICATION testpub ADDTABLE t2 > > And that *then* "t" was synchronized: > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical replication table synchronization workerfor subscription "testsub", table "t" has finished > > and then that the refresh was issued: > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] LOG: statement: ALTER SUBSCRIPTION testsub REFRESHPUBLICATION > > And we see a walsender starting and the query to get the new tables being executed: > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: statement: SELECT DISTINCT t.schemaname, t.tablename > , t.attnames > FROM pg_catalog.pg_publication_tables t > WHERE t.pubname IN ('testpub') > > > And that's it, the rest of the time is just polling. > > > Perhaps wait_for_subscription_sync() should dump the set of rel states to make > something like this more debuggable? > > > The fact that the synchronization for t finished just before the refresh makes > me wonder if a wakeup or a cache invalidation got lost? > From the LOGs, the only thing one could draw is lost invalidation because the nap time of the apply worker is 1s, so it should process invalidation during the time we are polling. Also, the rel should be added to pg_subscription_rel because the test is still polling for rels to be in 'ready' or 'done' state. I think we can do three things to debug (a) as you suggest dump the rel state in wait_for_subscription_sync; (b) add some DEBUG log in invalidate_syncing_table_states() to ensure that invalidation has been processed; (c) print rel states and relids from table_states_not_ready in process_syncing_tables_for_apply() to see if t2 has ever appeared in that list. -- With Regards, Amit Kapila.
On Tue, Jan 24, 2023 7:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 24, 2023 at 8:53 AM Andres Freund <andres@anarazel.de> wrote: > > > > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail > > occasionally. Just rarely enough that I never got around to looking into it > > for real. > > > ... > > > > We see t2 added to the publication: > > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0] > LOG: statement: ALTER PUBLICATION testpub ADD TABLE t2 > > > > And that *then* "t" was synchronized: > > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG: logical > replication table synchronization worker for subscription "testsub", table "t" has > finished > > > > and then that the refresh was issued: > > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0] > LOG: statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION > > > > And we see a walsender starting and the query to get the new tables being > executed: > > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG: > statement: SELECT DISTINCT t.schemaname, t.tablename > > , t.attnames > > FROM pg_catalog.pg_publication_tables t > > WHERE t.pubname IN ('testpub') > > > > > > And that's it, the rest of the time is just polling. > > > > > > Perhaps wait_for_subscription_sync() should dump the set of rel states to > make > > something like this more debuggable? > > > > > > The fact that the synchronization for t finished just before the refresh makes > > me wonder if a wakeup or a cache invalidation got lost? > > > > From the LOGs, the only thing one could draw is lost invalidation > because the nap time of the apply worker is 1s, so it should process > invalidation during the time we are polling. Also, the rel should be > added to pg_subscription_rel because the test is still polling for > rels to be in 'ready' or 'done' state. > > I think we can do three things to debug (a) as you suggest dump the > rel state in wait_for_subscription_sync; (b) add some DEBUG log in > invalidate_syncing_table_states() to ensure that invalidation has been > processed; (c) print rel states and relids from table_states_not_ready > in process_syncing_tables_for_apply() to see if t2 has ever appeared > in that list. > There was a similar buildfarm failure on francolin recently[1]. I think the problem is that after REFRESH PUBLICATION, the table sync worker for the new table test_mismatching_types was not started. So, the test timed out while waiting for an ERROR message that should have been reported by the table sync worker. -- regress_log_014_binary: timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.plline 269. 014_binary_subscriber.log: 2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2023-04-16 18:21:39.219 UTC [3079474] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. -- I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested. Please see the attached patch. I will try to add some debug logs in code later. [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2023-04-16%2018%3A17%3A09 Regards, Shi Yu
Attachment
On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote: > > I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested. > Please see the attached patch. > I will try to add some debug logs in code later. > Please see the attached v2 patch. I added some debug logs when invalidating syncing table states and updating table_states_not_ready list. I also adjusted the message level in the tests which failed before. Regards, Shi Yu
Attachment
On Monday, April 24, 2023 5:50 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote: > > On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu) <shiy.fnst@fujitsu.com> wrote: > > > > I wrote a patch to dump rel state in wait_for_subscription_sync() as > suggested. > > Please see the attached patch. > > I will try to add some debug logs in code later. > > > > Please see the attached v2 patch. > > I added some debug logs when invalidating syncing table states and updating > table_states_not_ready list. I also adjusted the message level in the tests which > failed before. Just a reference. I think similar issue has been analyzed in other thread[1] and the reason seems clear that the table state cache invalidation got lost due to a race condition. The fix is also being discussed there. [1] https://www.postgresql.org/message-id/flat/711a6afe-edb7-1211-cc27-1bef8239eec7%40gmail.com Best Regards, Hou zj