Thread: bailing out in tap tests nearly always a bad idea
Hi, A recent cfbot failure reminded me of a topic I wanted to raise: The use of BAIL in tap tests. When a tap test exits with BAIL it doesn't just consider that test failed, but it also stops running further tap tests in the same group: https://metacpan.org/pod/Test::More "Indicates to the harness that things are going so badly all testing should terminate. This includes the running of any additional test scripts." There may be a few cases where stopping testing makes sense, but nearly all of the uses of bail in our tests look like they just want to fail the the current test, rather than other tests in the same prove invocation. Besides aborting other tests, it's also just hard to parse the output: https://cirrus-ci.com/task/6243385840238592?logs=test_world#L2329 ... [16:28:12.178] [16:27:23] t/024_archive_recovery.pl ............ ok 3440 ms ( 0.00 usr 0.00 sys + 1.43 cusr 0.64 csys= 2.07 CPU) [16:28:12.178] [16:27:26] t/025_stuck_on_old_timeline.pl ....... ok 3024 ms ( 0.00 usr 0.00 sys + 1.21 cusr 0.61 csys= 1.82 CPU) [16:28:12.178] [16:27:30] t/026_overwrite_contrecord.pl ........ ok 3693 ms ( 0.01 usr 0.00 sys + 1.05 cusr 0.42 csys= 1.48 CPU) [16:28:12.178] Bailout called. Further testing stopped: command "/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708 --schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress --outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1 [16:28:12.178] FAILED--Further testing stopped: command "/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress --dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708 --schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress --outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1 [16:28:12.178] make[2]: *** [Makefile:27: check] Error 255 [16:28:12.178] make[1]: *** [Makefile:48: check-recovery-recurse] Error 2 What failed was 027_stream_regress.pl, due to a trivial regression test output change: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/guc.out /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out --- /tmp/cirrus-ci-build/src/test/regress/expected/guc.out 2022-02-13 16:20:53.931949802 +0000 +++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out 2022-02-13 16:28:00.481307866 +0000 @@ -850,7 +850,8 @@ name --------------------------- default_statistics_target -(1 row) + enable_indexskipscan +(2 rows) -- Runtime-computed GUCs should be part of the preset category. SELECT name FROM tab_settings_flags But there's pretty much no way to know from the error output that this failure was in 027_stream_regress.pl (just an example, this is widespread). For some reason prove's output is much worse when the output is redirected to a file. When testing with prove going to a terminal, bail at least prints the test name. interactive: t/000_fail.pl ........................ Bailout called. Further testing stopped: I am bailing FAILED--Further testing stopped: I am bailing file: Bailout called. Further testing stopped: I am bailing FAILED--Further testing stopped: I am bailing the latter is just about undebuggable, particularly when using PROVE_FLAGS='-j...'. Even just a 'die' is better than bail, output wise (and better due to not preventing other tests from running): interactive: t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900) No subtests run t/001_stream_rep.pl .................. ok file: t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900) No subtests run t/001_stream_rep.pl .................. ok Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we should consider doing a global replace of the use of bailing. Best with a central function signalling fatal error, rather than individual uses of die or such. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we > should consider doing a global replace of the use of bailing. +1 > Best with a > central function signalling fatal error, rather than individual uses of die > or such. Huh, doesn't Test::More already provide a sane way to do this? If not, why isn't die() good enough? (I don't think you can realistically expect to prohibit die() anywhere in the TAP tests.) regards, tom lane
Hi, On 2022-02-13 18:32:59 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Best with a > > central function signalling fatal error, rather than individual uses of die > > or such. > > Huh, doesn't Test::More already provide a sane way to do this? I looked, and didn't see anything. But I'm not a perl person, so I might just have missed something. > If not, why isn't die() good enough? (I don't think you can > realistically expect to prohibit die() anywhere in the TAP tests.) The output of dying isn't great either: t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900) No subtests run it'd be nicer if that that showed the actual reason for failing, rather than the unhelpful "Dubious, test returned" stuff. But it's still better than BAIL_OUT. So I thought that putting the failure handling in a central routine would allow us to make the exit nicer in a central place / at a later stage... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Hi, > > On 2022-02-13 18:32:59 -0500, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > Best with a >> > central function signalling fatal error, rather than individual uses of die >> > or such. >> >> Huh, doesn't Test::More already provide a sane way to do this? > > I looked, and didn't see anything. But I'm not a perl person, so I might just > have missed something. > >> If not, why isn't die() good enough? (I don't think you can >> realistically expect to prohibit die() anywhere in the TAP tests.) die()-ing is the correct way to abort a single test script. > The output of dying isn't great either: > > t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900) > No subtests run > > it'd be nicer if that that showed the actual reason for failing, rather than > the unhelpful "Dubious, test returned" stuff. It does output the die() message, but it's on stderr, while the harness output is on stdout, so redirections might interfere. $ echo 'use Test::More; die "aieee";' > t/die-immediately.t $ echo 'use Test::More; pass "yay"; die "aieee";' > t/die-after-success.t $ prove --quiet t/die-after-success.t .. 1/? aieee at t/die-after-success.t line 1. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 1. t/die-after-success.t .. Dubious, test returned 255 (wstat 65280, 0xff00) All 1 subtests passed t/die-immediately.t .... aieee at t/die-immediately.t line 1. t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00) No subtests run > Greetings, > > Andres Freund - ilmari
Hi, On 2022-02-14 16:08:42 +0000, Dagfinn Ilmari Mannsåker wrote: > die()-ing is the correct way to abort a single test script. There's really no way to nice way to abort without the "Dubious, test returned 255 (wstat 65280, 0xff00)" stuff? Even just getting rid of the "Tests were run but no plan was declared and done_testing() was not seen." noise would be helpful. So I think using a fatal error routine that forced a failure to be recognized via ok(0, 'fatal error') and then does done_testing() would be better... > t/die-immediately.t .... aieee at t/die-immediately.t line 1. > t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00) > No subtests run Hm. Looks different when a test is including our test helpers. t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) No subtests run t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2. t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00) No subtests run So I think we broke something... I think it's the log file redirection stuff in INIT. Greetings, Andres Freund
On Sun, Feb 13, 2022 at 6:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Huh, doesn't Test::More already provide a sane way to do this? > If not, why isn't die() good enough? (I don't think you can > realistically expect to prohibit die() anywhere in the TAP tests.) +1 for die. There's very little reason to use BAIL_OUT. If we agree that we need a wrapper for die() for some reason, I can live with that, too, but die is simple and familiar to perl programmers. -- Robert Haas EDB: http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > Even just getting rid of the "Tests were run but no plan was declared and > done_testing() was not seen." noise would be helpful. So I think using a fatal > error routine that forced a failure to be recognized via ok(0, 'fatal error') > and then does done_testing() would be better... Maybe we could do something in an END block provided by Utils.pm? I still think that insisting that people avoid die() is going to be annoying. regards, tom lane
On 2/14/22 12:58, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> Even just getting rid of the "Tests were run but no plan was declared and >> done_testing() was not seen." noise would be helpful. So I think using a fatal >> error routine that forced a failure to be recognized via ok(0, 'fatal error') >> and then does done_testing() would be better... > Maybe we could do something in an END block provided by Utils.pm? > I still think that insisting that people avoid die() is going to > be annoying. > > See references to $SIG{__DIE__} in `perldoc-f die` and `perldoc perlvar` for how to set up a handler for die() cleanup. Maybe we could set up a handler that calls done_testing(). ? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 14.02.22 00:22, Andres Freund wrote: > Does anybody want to defend / explain the use of BAIL_OUT? I think this was my mistake when this was originally added. I agree it should be changed.
Hi, On 2022-02-14 09:46:20 -0800, Andres Freund wrote: > > t/die-immediately.t .... aieee at t/die-immediately.t line 1. > > t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00) > > No subtests run > > Hm. Looks different when a test is including our test helpers. > > t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) > No subtests run > t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2. > t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00) > No subtests run > > So I think we broke something... I think it's the log file redirection stuff > in INIT. Any chance somebody with more perl knowledge could look into this? Clearly our effort to copy all the output the original file descriptors isn't successful here. $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';" $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';" blorb at -e line 1. (note that the former will create a tmp_check in your current directory) Greetings, Andres Freund
Hi, On 2022-02-22 09:28:37 -0800, Andres Freund wrote: > On 2022-02-14 09:46:20 -0800, Andres Freund wrote: > > > t/die-immediately.t .... aieee at t/die-immediately.t line 1. > > > t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00) > > > No subtests run > > > > Hm. Looks different when a test is including our test helpers. > > > > t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) > > No subtests run > > t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2. > > t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00) > > No subtests run > > > > So I think we broke something... I think it's the log file redirection stuff > > in INIT. > > Any chance somebody with more perl knowledge could look into this? Clearly our > effort to copy all the output the original file descriptors isn't successful > here. > > $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';" > $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';" > blorb at -e line 1. So the problem is that die just outputs to stderr, but we've decided to have stderr connected to the output going to tap for some reason. I guess that prevents us from messing up the tap output, but it's still weird. Because of that redirection, die printing to stderr isn't visible to tap. Seems we can use black magic to "fix" that... Putting the following into INIT{} seems to do the trick: # Because of the above redirection the tap output wouldn't contain # information about tests failing due to die etc. Fix that by also # printing the failure to the original stderr. $SIG{__DIE__} = sub { # Don't redirect if it's a syntax error ($^S is undefined) or in an # eval block ($^S == 1). if (defined $^S and $^S == 0) { print $orig_stderr "$_[0]\n"; #fail("died: $_[0]"); #done_testing(); } }; $ cat /tmp/die_pg_utils.pl use PostgreSQL::Test::Utils; use Test::More; ok(1, "foo"); die 'blorb'; done_testing(); With the print we get something like: $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl ok 1 - foo blorb at /tmp/die_pg_utils.pl line 5. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 25 just after 1. With the fail() and done_testing() we get $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl ok 1 - foo not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5. # # Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5. # ' # at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 235. 1..2 # Looks like your test exited with 25 just after 2. The latter output doesn't confuse with stuff about plans and exit codes. But contains redundant messages (whatever) and it doesn't seem particularly clean to hijack a "test number". Greetings, Andres Freund
On 2/22/22 13:19, Andres Freund wrote: > Hi, > > On 2022-02-22 09:28:37 -0800, Andres Freund wrote: >> On 2022-02-14 09:46:20 -0800, Andres Freund wrote: >>>> t/die-immediately.t .... aieee at t/die-immediately.t line 1. >>>> t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00) >>>> No subtests run >>> Hm. Looks different when a test is including our test helpers. >>> >>> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) >>> No subtests run >>> t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2. >>> t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00) >>> No subtests run >>> >>> So I think we broke something... I think it's the log file redirection stuff >>> in INIT. >> Any chance somebody with more perl knowledge could look into this? Clearly our >> effort to copy all the output the original file descriptors isn't successful >> here. >> >> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';" >> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';" >> blorb at -e line 1. > So the problem is that die just outputs to stderr, but we've decided to have > stderr connected to the output going to tap for some reason. I guess that > prevents us from messing up the tap output, but it's still weird. Because of > that redirection, die printing to stderr isn't visible to tap. > > Seems we can use black magic to "fix" that... Putting the following into > INIT{} seems to do the trick: > > # Because of the above redirection the tap output wouldn't contain > # information about tests failing due to die etc. Fix that by also > # printing the failure to the original stderr. > $SIG{__DIE__} = sub > { > # Don't redirect if it's a syntax error ($^S is undefined) or in an > # eval block ($^S == 1). > if (defined $^S and $^S == 0) > { > print $orig_stderr "$_[0]\n"; > #fail("died: $_[0]"); > #done_testing(); > } > }; > > > $ cat /tmp/die_pg_utils.pl > use PostgreSQL::Test::Utils; > use Test::More; > > ok(1, "foo"); > die 'blorb'; > done_testing(); > > With the print we get something like: > > $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl > ok 1 - foo > blorb at /tmp/die_pg_utils.pl line 5. > > # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 25 just after 1. > > With the fail() and done_testing() we get > > $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl > ok 1 - foo > not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5. > # > # Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5. > # ' > # at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 235. > 1..2 > # Looks like your test exited with 25 just after 2. > > > The latter output doesn't confuse with stuff about plans and exit codes. But > contains redundant messages (whatever) and it doesn't seem particularly clean > to hijack a "test number". > I'll be surprised if we can't come up with something cleaner than that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote: > I'll be surprised if we can't come up with something cleaner than that. Suggestions?
On 2/22/22 15:54, Andres Freund wrote: > On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote: >> I'll be surprised if we can't come up with something cleaner than that. > Suggestions? If we just have the sig handler actions as: diag("died: $_[0]"); done_testing(); we get: ok 1 - foo # died: blorb at tst/tst.pl line 5. 1..1 # Looks like your test exited with 25 just after 1. Would that work? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-02-23 08:43:38 -0500, Andrew Dunstan wrote: > > On 2/22/22 15:54, Andres Freund wrote: > > On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote: > >> I'll be surprised if we can't come up with something cleaner than that. > > Suggestions? > > > If we just have the sig handler actions as: > > diag("died: $_[0]"); > done_testing(); > > we get: > > ok 1 - foo > # died: blorb at tst/tst.pl line 5. > 1..1 > # Looks like your test exited with 25 just after 1. > > > Would that work? Well, the if condition I had is needed, afaics. Otherwise we break eval() and / or risk crashing on syntax errors. If you're just talking about diag() vs ok(0, ...), I'm good with that. Greetings, Andres Freund
On 2/23/22 11:40, Andres Freund wrote: > Hi, > > On 2022-02-23 08:43:38 -0500, Andrew Dunstan wrote: >> On 2/22/22 15:54, Andres Freund wrote: >>> On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote: >>>> I'll be surprised if we can't come up with something cleaner than that. >>> Suggestions? >> >> If we just have the sig handler actions as: >> >> diag("died: $_[0]"); >> done_testing(); >> >> we get: >> >> ok 1 - foo >> # died: blorb at tst/tst.pl line 5. >> 1..1 >> # Looks like your test exited with 25 just after 1. >> >> >> Would that work? > Well, the if condition I had is needed, afaics. Otherwise we break eval() and > / or risk crashing on syntax errors. If you're just talking about diag() vs > ok(0, ...), I'm good with that. Yes, sorry, I meant this should be the contents of the condition block. I agree completely that it's necessary. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com