Thread: bailing out in tap tests nearly always a bad idea

bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

A recent cfbot failure reminded me of a topic I wanted to raise: The use of
BAIL in tap tests.

When a tap test exits with BAIL it doesn't just consider that test failed, but
it also stops running further tap tests in the same group:
https://metacpan.org/pod/Test::More
"Indicates to the harness that things are going so badly all testing should
terminate. This includes the running of any additional test scripts."

There may be a few cases where stopping testing makes sense, but nearly all of
the uses of bail in our tests look like they just want to fail the the current
test, rather than other tests in the same prove invocation.


Besides aborting other tests, it's also just hard to parse the output:
https://cirrus-ci.com/task/6243385840238592?logs=test_world#L2329
...
[16:28:12.178] [16:27:23] t/024_archive_recovery.pl ............ ok     3440 ms ( 0.00 usr  0.00 sys +  1.43 cusr  0.64
csys=  2.07 CPU)
 
[16:28:12.178] [16:27:26] t/025_stuck_on_old_timeline.pl ....... ok     3024 ms ( 0.00 usr  0.00 sys +  1.21 cusr  0.61
csys=  1.82 CPU)
 
[16:28:12.178] [16:27:30] t/026_overwrite_contrecord.pl ........ ok     3693 ms ( 0.01 usr  0.00 sys +  1.05 cusr  0.42
csys=  1.48 CPU)
 
[16:28:12.178] Bailout called.  Further testing stopped:  command
"/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1
 
[16:28:12.178] FAILED--Further testing stopped: command
"/tmp/cirrus-ci-build/src/test/recovery/../../../src/test/regress/pg_regress
--dlpath="/tmp/cirrus-ci-build/src/test/regress"--bindir= --host=/tmp/gFHAMHHTJ_ --port=52708
--schedule=../regress/parallel_schedule--max-concurrent-tests=20 --inputdir=../regress
--outputdir="/tmp/cirrus-ci-build/src/test/recovery/tmp_check""exited with value 1
 
[16:28:12.178] make[2]: *** [Makefile:27: check] Error 255
[16:28:12.178] make[1]: *** [Makefile:48: check-recovery-recurse] Error 2

What failed was 027_stream_regress.pl, due to a trivial regression test output
change:

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/guc.out
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/guc.out    2022-02-13 16:20:53.931949802 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/guc.out    2022-02-13 16:28:00.481307866 +0000
@@ -850,7 +850,8 @@
            name
 ---------------------------
  default_statistics_target
-(1 row)
+ enable_indexskipscan
+(2 rows)

 -- Runtime-computed GUCs should be part of the preset category.
 SELECT name FROM tab_settings_flags


But there's pretty much no way to know from the error output that this failure
was in 027_stream_regress.pl (just an example, this is widespread).


For some reason prove's output is much worse when the output is redirected to
a file. When testing with prove going to a terminal, bail at least prints the
test name.

interactive:
t/000_fail.pl ........................ Bailout called.  Further testing stopped:  I am bailing
FAILED--Further testing stopped: I am bailing

file:
Bailout called.  Further testing stopped:  I am bailing
FAILED--Further testing stopped: I am bailing

the latter is just about undebuggable, particularly when using
PROVE_FLAGS='-j...'.


Even just a 'die' is better than bail, output wise (and better due to not
preventing other tests from running):

interactive:
t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok

file:
t/000_fail.pl ........................
Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/001_stream_rep.pl .................. ok


Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we
should consider doing a global replace of the use of bailing. Best with a
central function signalling fatal error, rather than individual uses of die
or such.

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Does anybody want to defend / explain the use of BAIL_OUT? If not, I think we
> should consider doing a global replace of the use of bailing.

+1

> Best with a
> central function signalling fatal error, rather than individual uses of die
> or such.

Huh, doesn't Test::More already provide a sane way to do this?
If not, why isn't die() good enough?  (I don't think you can
realistically expect to prohibit die() anywhere in the TAP tests.)

            regards, tom lane



Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

On 2022-02-13 18:32:59 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Best with a
> > central function signalling fatal error, rather than individual uses of die
> > or such.
>
> Huh, doesn't Test::More already provide a sane way to do this?

I looked, and didn't see anything. But I'm not a perl person, so I might just
have missed something.


> If not, why isn't die() good enough?  (I don't think you can
> realistically expect to prohibit die() anywhere in the TAP tests.)

The output of dying isn't great either:

t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run

it'd be nicer if that that showed the actual reason for failing, rather than
the unhelpful "Dubious, test returned" stuff. But it's still better than
BAIL_OUT. So I thought that putting the failure handling in a central routine
would allow us to make the exit nicer in a central place / at a later stage...

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Dagfinn Ilmari Mannsåker
Date:
Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2022-02-13 18:32:59 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> > Best with a
>> > central function signalling fatal error, rather than individual uses of die
>> > or such.
>>
>> Huh, doesn't Test::More already provide a sane way to do this?
>
> I looked, and didn't see anything. But I'm not a perl person, so I might just
> have missed something.
>
>> If not, why isn't die() good enough?  (I don't think you can
>> realistically expect to prohibit die() anywhere in the TAP tests.)

die()-ing is the correct way to abort a single test script.

> The output of dying isn't great either:
>
> t/000_fail.pl ........................ Dubious, test returned 25 (wstat 6400, 0x1900)
> No subtests run
>
> it'd be nicer if that that showed the actual reason for failing, rather than
> the unhelpful "Dubious, test returned" stuff.

It does output the die() message, but it's on stderr, while the harness
output is on stdout, so redirections might interfere.

$ echo 'use Test::More; die "aieee";' > t/die-immediately.t
$ echo 'use Test::More; pass "yay"; die "aieee";' > t/die-after-success.t
$ prove --quiet 
t/die-after-success.t .. 1/? aieee at t/die-after-success.t line 1.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 1.
t/die-after-success.t .. Dubious, test returned 255 (wstat 65280, 0xff00)
All 1 subtests passed
t/die-immediately.t .... aieee at t/die-immediately.t line 1.
t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00)
No subtests run
  
> Greetings,
>
> Andres Freund

- ilmari



Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

On 2022-02-14 16:08:42 +0000, Dagfinn Ilmari Mannsåker wrote:
> die()-ing is the correct way to abort a single test script.

There's really no way to nice way to abort without the "Dubious, test returned
255 (wstat 65280, 0xff00)" stuff?

Even just getting rid of the "Tests were run but no plan was declared and
done_testing() was not seen." noise would be helpful. So I think using a fatal
error routine that forced a failure to be recognized via ok(0, 'fatal error')
and then does done_testing() would be better...


> t/die-immediately.t .... aieee at t/die-immediately.t line 1.
> t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00)
> No subtests run

Hm. Looks different when a test is including our test helpers.

t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2.
t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00)
No subtests run

So I think we broke something... I think it's the log file redirection stuff
in INIT.

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Robert Haas
Date:
On Sun, Feb 13, 2022 at 6:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Huh, doesn't Test::More already provide a sane way to do this?
> If not, why isn't die() good enough?  (I don't think you can
> realistically expect to prohibit die() anywhere in the TAP tests.)

+1 for die. There's very little reason to use BAIL_OUT. If we agree
that we need a wrapper for die() for some reason, I can live with
that, too, but die is simple and familiar to perl programmers.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: bailing out in tap tests nearly always a bad idea

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Even just getting rid of the "Tests were run but no plan was declared and
> done_testing() was not seen." noise would be helpful. So I think using a fatal
> error routine that forced a failure to be recognized via ok(0, 'fatal error')
> and then does done_testing() would be better...

Maybe we could do something in an END block provided by Utils.pm?
I still think that insisting that people avoid die() is going to
be annoying.

            regards, tom lane



Re: bailing out in tap tests nearly always a bad idea

From
Andrew Dunstan
Date:
On 2/14/22 12:58, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> Even just getting rid of the "Tests were run but no plan was declared and
>> done_testing() was not seen." noise would be helpful. So I think using a fatal
>> error routine that forced a failure to be recognized via ok(0, 'fatal error')
>> and then does done_testing() would be better...
> Maybe we could do something in an END block provided by Utils.pm?
> I still think that insisting that people avoid die() is going to
> be annoying.
>
>             


See references to $SIG{__DIE__} in `perldoc-f die` and `perldoc perlvar`
for how to set up a handler for die() cleanup. Maybe we could set up a
handler that calls done_testing(). ?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: bailing out in tap tests nearly always a bad idea

From
Peter Eisentraut
Date:
On 14.02.22 00:22, Andres Freund wrote:
> Does anybody want to defend / explain the use of BAIL_OUT?

I think this was my mistake when this was originally added.  I agree it 
should be changed.



Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > t/die-immediately.t .... aieee at t/die-immediately.t line 1.
> > t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00)
> > No subtests run
> 
> Hm. Looks different when a test is including our test helpers.
> 
> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> No subtests run
> t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2.
> t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00)
> No subtests run
> 
> So I think we broke something... I think it's the log file redirection stuff
> in INIT.

Any chance somebody with more perl knowledge could look into this? Clearly our
effort to copy all the output the original file descriptors isn't successful
here.

$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';"
$ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';"
blorb at -e line 1.

(note that the former will create a tmp_check in your current directory)

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

On 2022-02-22 09:28:37 -0800, Andres Freund wrote:
> On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
> > > t/die-immediately.t .... aieee at t/die-immediately.t line 1.
> > > t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00)
> > > No subtests run
> >
> > Hm. Looks different when a test is including our test helpers.
> >
> > t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> > No subtests run
> > t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2.
> > t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00)
> > No subtests run
> >
> > So I think we broke something... I think it's the log file redirection stuff
> > in INIT.
>
> Any chance somebody with more perl knowledge could look into this? Clearly our
> effort to copy all the output the original file descriptors isn't successful
> here.
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';"
> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';"
> blorb at -e line 1.

So the problem is that die just outputs to stderr, but we've decided to have
stderr connected to the output going to tap for some reason. I guess that
prevents us from messing up the tap output, but it's still weird. Because of
that redirection, die printing to stderr isn't visible to tap.

Seems we can use black magic to "fix" that... Putting the following into
INIT{} seems to do the trick:

    # Because of the above redirection the tap output wouldn't contain
    # information about tests failing due to die etc. Fix that by also
    # printing the failure to the original stderr.
    $SIG{__DIE__} = sub
    {
        # Don't redirect if it's a syntax error ($^S is undefined) or in an
        # eval block ($^S == 1).
        if (defined $^S and $^S == 0)
        {
            print $orig_stderr "$_[0]\n";
            #fail("died: $_[0]");
            #done_testing();
        }
    };


$ cat /tmp/die_pg_utils.pl
use PostgreSQL::Test::Utils;
use Test::More;

ok(1, "foo");
die 'blorb';
done_testing();

With the print we get something like:

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
blorb at /tmp/die_pg_utils.pl line 5.

# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 25 just after 1.

With the fail() and done_testing() we get

$ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
ok 1 - foo
not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5.
#
#   Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5.
# '
#   at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 235.
1..2
# Looks like your test exited with 25 just after 2.


The latter output doesn't confuse with stuff about plans and exit codes. But
contains redundant messages (whatever) and it doesn't seem particularly clean
to hijack a "test number".

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Andrew Dunstan
Date:
On 2/22/22 13:19, Andres Freund wrote:
> Hi,
>
> On 2022-02-22 09:28:37 -0800, Andres Freund wrote:
>> On 2022-02-14 09:46:20 -0800, Andres Freund wrote:
>>>> t/die-immediately.t .... aieee at t/die-immediately.t line 1.
>>>> t/die-immediately.t .... Dubious, test returned 255 (wstat 65280, 0xff00)
>>>> No subtests run
>>> Hm. Looks different when a test is including our test helpers.
>>>
>>> t/000_die_pg_test_utils.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
>>> No subtests run
>>> t/000_die_test_more.pl ...... error at t/000_die_test_more.pl line 2.
>>> t/000_die_test_more.pl ...... Dubious, test returned 255 (wstat 65280, 0xff00)
>>> No subtests run
>>>
>>> So I think we broke something... I think it's the log file redirection stuff
>>> in INIT.
>> Any chance somebody with more perl knowledge could look into this? Clearly our
>> effort to copy all the output the original file descriptors isn't successful
>> here.
>>
>> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use PostgreSQL::Test::Utils; die 'blorb';"
>> $ perl -I /home/andres/src/postgresql/src/test/perl/ -e "use Test::More; die 'blorb';"
>> blorb at -e line 1.
> So the problem is that die just outputs to stderr, but we've decided to have
> stderr connected to the output going to tap for some reason. I guess that
> prevents us from messing up the tap output, but it's still weird. Because of
> that redirection, die printing to stderr isn't visible to tap.
>
> Seems we can use black magic to "fix" that... Putting the following into
> INIT{} seems to do the trick:
>
>     # Because of the above redirection the tap output wouldn't contain
>     # information about tests failing due to die etc. Fix that by also
>     # printing the failure to the original stderr.
>     $SIG{__DIE__} = sub
>     {
>         # Don't redirect if it's a syntax error ($^S is undefined) or in an
>         # eval block ($^S == 1).
>         if (defined $^S and $^S == 0)
>         {
>             print $orig_stderr "$_[0]\n";
>             #fail("died: $_[0]");
>             #done_testing();
>         }
>     };
>
>
> $ cat /tmp/die_pg_utils.pl
> use PostgreSQL::Test::Utils;
> use Test::More;
>
> ok(1, "foo");
> die 'blorb';
> done_testing();
>
> With the print we get something like:
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
> ok 1 - foo
> blorb at /tmp/die_pg_utils.pl line 5.
>
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 25 just after 1.
>
> With the fail() and done_testing() we get
>
> $ perl -I /home/andres/src/postgresql/src/test/perl/ /tmp/die_pg_utils.pl
> ok 1 - foo
> not ok 2 - died: blorb at /tmp/die_pg_utils.pl line 5.
> #
> #   Failed test 'died: blorb at /tmp/die_pg_utils.pl line 5.
> # '
> #   at /home/andres/src/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 235.
> 1..2
> # Looks like your test exited with 25 just after 2.
>
>
> The latter output doesn't confuse with stuff about plans and exit codes. But
> contains redundant messages (whatever) and it doesn't seem particularly clean
> to hijack a "test number".
>



I'll be surprised if we can't come up with something cleaner than that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
> I'll be surprised if we can't come up with something cleaner than that.

Suggestions?



Re: bailing out in tap tests nearly always a bad idea

From
Andrew Dunstan
Date:
On 2/22/22 15:54, Andres Freund wrote:
> On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
>> I'll be surprised if we can't come up with something cleaner than that.
> Suggestions?


If we just have the sig handler actions as:

    diag("died: $_[0]");
    done_testing();   

we get:

    ok 1 - foo
    # died: blorb at tst/tst.pl line 5.
    1..1
    # Looks like your test exited with 25 just after 1.


Would that work?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: bailing out in tap tests nearly always a bad idea

From
Andres Freund
Date:
Hi,

On 2022-02-23 08:43:38 -0500, Andrew Dunstan wrote:
> 
> On 2/22/22 15:54, Andres Freund wrote:
> > On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
> >> I'll be surprised if we can't come up with something cleaner than that.
> > Suggestions?
> 
> 
> If we just have the sig handler actions as:
> 
>     diag("died: $_[0]");
>     done_testing();   
> 
> we get:
> 
>     ok 1 - foo
>     # died: blorb at tst/tst.pl line 5.
>     1..1
>     # Looks like your test exited with 25 just after 1.
> 
> 
> Would that work?

Well, the if condition I had is needed, afaics. Otherwise we break eval() and
/ or risk crashing on syntax errors.  If you're just talking about diag() vs
ok(0, ...), I'm good with that.

Greetings,

Andres Freund



Re: bailing out in tap tests nearly always a bad idea

From
Andrew Dunstan
Date:
On 2/23/22 11:40, Andres Freund wrote:
> Hi,
>
> On 2022-02-23 08:43:38 -0500, Andrew Dunstan wrote:
>> On 2/22/22 15:54, Andres Freund wrote:
>>> On 2022-02-22 15:10:30 -0500, Andrew Dunstan wrote:
>>>> I'll be surprised if we can't come up with something cleaner than that.
>>> Suggestions?
>>
>> If we just have the sig handler actions as:
>>
>>     diag("died: $_[0]");
>>     done_testing();   
>>
>> we get:
>>
>>     ok 1 - foo
>>     # died: blorb at tst/tst.pl line 5.
>>     1..1
>>     # Looks like your test exited with 25 just after 1.
>>
>>
>> Would that work?
> Well, the if condition I had is needed, afaics. Otherwise we break eval() and
> / or risk crashing on syntax errors.  If you're just talking about diag() vs
> ok(0, ...), I'm good with that.



Yes, sorry, I meant this should be the contents of the condition block.
I agree completely that it's necessary.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com