bailing out in tap tests nearly always a bad idea - Mailing list pgsql-hackers

From Andres Freund
Subject bailing out in tap tests nearly always a bad idea
Date
Msg-id 20220213232249.7sevhlioapydla37@alap3.anarazel.de
Whole thread Raw
Responses Re: bailing out in tap tests nearly always a bad idea  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: bailing out in tap tests nearly always a bad idea  (Peter Eisentraut <peter.eisentraut@enterprisedb.com>)
List pgsql-hackers
Hi,

A recent cfbot failure reminded me of a topic I wanted to raise: The use of
BAIL in tap tests.

When a tap test exits with BAIL it doesn't just consider that test failed, but
it also stops running further tap tests in the same group:
https://metacpan.org/pod/Test::More
"Indicates to the harness that things are going so badly all testing should
terminate. This includes the running of any additional test scripts."

There may be a few cases where stopping testing makes sense, but nearly all of
the uses of bail in our tests look like they just want to fail the the current
test, rather than other tests in the same prove invocation.


Besides aborting other tests, it's also just hard to parse the output:
https://cirrus-ci.com/task/6243385840238592?logs=test_world#L2329
...
[16:28:12.178] [16:27:23] t/024_archive_recovery.pl ............ ok     3440 ms ( 0.00 usr  0.00 sys +  1.43 cusr  0.64
csys=  2.07 CPU)
 
[16:28:12.178] [16:27:26] t/025_stuck_on_old_timeline.pl ....... ok     3024 ms ( 0.00 usr  0.00 sys +  1.21 cusr  0.61
csys=  1.82 CPU)
 
[16:28:12.178] [16:27:30] t/026_overwrite_contrecord.pl ........ ok     3693 ms ( 0.01 usr  0.00 sys +  1.05 cusr  0.42
csys=  1.48 CPU)
 
[16:28:12.178] Bailout called.  Further testing stopped:  command
"/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1
 
[16:28:12.178] FAILED--Further testing stopped: command
"/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1
 
[16:28:12.178] make[2]: *** [Makefile:27: check] Error 255
[16:28:12.178] make[1]: *** [Makefile:48: check-recovery-recurse] Error 2

What failed was 027_stream_regress.pl, due to a trivial regression test output
change:

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/guc.out
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/guc.out    2022-02-13 16:20:53.931949802 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out    2022-02-13 16:28:00.481307866 +0000
@@ -850,7 +850,8 @@
            name
 ---------------------------
  default_statistics_target
-(1 row)
+ enable_indexskipscan
+(2 rows)

 -- Runtime-computed GUCs should be part of the preset category.
 SELECT name FROM tab_settings_flags


But there's pretty much no way to know from the error output that this failure
was in 027_stream_regress.pl (just an example, this is widespread).


For some reason prove's output is much worse when the output is redirected to
a file. When testing with prove going to a terminal, bail at least prints the
test name.

interactive:
t/000_fail.pl ........................ Bailout called.  Further testing stopped:  I am bailing
FAILED--Further testing stopped: I am bailing

file:
Bailout called.  Further testing stopped:  I am bailing
FAILED--Further testing stopped: I am bailing

the latter is just about undebuggable, particularly when using
PROVE_FLAGS='-j...'.


Even just a 'die' is better than bail, output wise (and better due to not
preventing other tests from running):

interactive:
t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok

file:
t/000_fail.pl ........................
Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok


Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we
should consider doing a global replace of the use of bailing. Best with a
central function signalling fatal error, rather than individual uses of die
or such.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Avoid open and lock the table Extendend Statistics (src/backend/commands/statscmds.c)
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] Avoid open and lock the table Extendend Statistics (src/backend/commands/statscmds.c)