Thread: [HACKERS] A note about debugging TAP failures
I've whined before about how developer-unfriendly the TAP test infrastructure is. One concrete problem is that if there is a failure, there is absolutely no way to get any information beyond what is in the logs, because the test installation's data directory is unconditionally blown away at run end. I wanted to look at the core file from an assertion failure, but of course that wasn't there anymore. (I realize I could've reconfigured my kernel to drop the core file somewhere else, but what if I'd needed to look at the data directory per se?) I got around this for the immediate need with this expedient hack: diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index ae8d178..8f2f09c 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -138,7 +138,7 @@ sub tempdir return File::Temp::tempdir( $prefix . '_XXXX', DIR => $tmp_check, - CLEANUP => 1); + CLEANUP => 0);}sub tempdir_short but that still wasn't exactly friendly, since at the end of the run I had a bunch of randomly-named data directories and had to guess which one corresponded to the failing test script. I think we need to fix TestLib and/or PostgresNode so that there's a way to make TAP tests not auto-clean their data directories at end of run, without having to resort to editing the script like this. It'd also be helpful if the data directory pathname included the test script's name. regards, tom lane
> I think we need to fix TestLib and/or PostgresNode so that there's a way > to make TAP tests not auto-clean their data directories at end of run, > without having to resort to editing the script like this. It'd also be > helpful if the data directory pathname included the test script's name. Rely on an environment variable, say "PG_TAP_NO_CLEANUP"? Easy to test from a perl script? -- Fabien.
Hi, On 2017-04-22 13:51:30 -0400, Tom Lane wrote: > I think we need to fix TestLib and/or PostgresNode so that there's a way > to make TAP tests not auto-clean their data directories at end of run, > without having to resort to editing the script like this. I think leaving the directory around in case of failures would be a pretty bare minimum. It's sometimes also useful to keep the remains if everything was apparently successfull, but it's far less important imo. To me it sounds like we shouldn't use a randomly generated temporary name, but rather a static mapping based on the test's path or such, to create the relevant temp directory. Before a start a test could then clean that up, that way the amount of remaining cruft would be bound by the number of tests. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-04-22 13:51:30 -0400, Tom Lane wrote: >> I think we need to fix TestLib and/or PostgresNode so that there's a way >> to make TAP tests not auto-clean their data directories at end of run, >> without having to resort to editing the script like this. > I think leaving the directory around in case of failures would be a > pretty bare minimum. It's sometimes also useful to keep the remains if > everything was apparently successfull, but it's far less important imo. In the particular case I was interested in here, the test script thought everything was successful :-(. I'm working on fixing that little problem, but I do not believe that the TAP scripts are so bulletproof that there will never be a need to override their judgment. > To me it sounds like we shouldn't use a randomly generated temporary > name, but rather a static mapping based on the test's path or such, to > create the relevant temp directory. Before a start a test could then > clean that up, that way the amount of remaining cruft would be bound by > the number of tests. Yeah, that's a reasonable design. There doesn't seem to be a good reason to randomize the test directory names. regards, tom lane
On 2017-04-22 16:22:59 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-22 13:51:30 -0400, Tom Lane wrote: > >> I think we need to fix TestLib and/or PostgresNode so that there's a way > >> to make TAP tests not auto-clean their data directories at end of run, > >> without having to resort to editing the script like this. > > > I think leaving the directory around in case of failures would be a > > pretty bare minimum. It's sometimes also useful to keep the remains if > > everything was apparently successfull, but it's far less important imo. > > In the particular case I was interested in here, the test script thought > everything was successful :-(. I'm working on fixing that little problem, > but I do not believe that the TAP scripts are so bulletproof that there > will never be a need to override their judgment. Agreed. If paths are reproducible and cleaned up on next run it's also much less of an issue to just leave them around till the next run. Which we imo also should do for non-failing tmp_check directories. > > To me it sounds like we shouldn't use a randomly generated temporary > > name, but rather a static mapping based on the test's path or such, to > > create the relevant temp directory. Before a start a test could then > > clean that up, that way the amount of remaining cruft would be bound by > > the number of tests. > > Yeah, that's a reasonable design. There doesn't seem to be a good reason > to randomize the test directory names. Just as long as they're unique across the tree, to allow for parallelism.. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-04-22 16:22:59 -0400, Tom Lane wrote: >> In the particular case I was interested in here, the test script thought >> everything was successful :-(. I'm working on fixing that little problem, >> but I do not believe that the TAP scripts are so bulletproof that there >> will never be a need to override their judgment. > Agreed. If paths are reproducible and cleaned up on next run it's also > much less of an issue to just leave them around till the next run. > Which we imo also should do for non-failing tmp_check directories. Mm, not convinced. I think that the delete-on-success behavior was intentional, to limit the amount of disk space a buildfarm member would consume during a run. I don't mind that being the default, as long as there's a way to override it manually. regards, tom lane
On 2017-04-22 17:25:49 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-04-22 16:22:59 -0400, Tom Lane wrote: > >> In the particular case I was interested in here, the test script thought > >> everything was successful :-(. I'm working on fixing that little problem, > >> but I do not believe that the TAP scripts are so bulletproof that there > >> will never be a need to override their judgment. > > > Agreed. If paths are reproducible and cleaned up on next run it's also > > much less of an issue to just leave them around till the next run. > > Which we imo also should do for non-failing tmp_check directories. > > Mm, not convinced. I think that the delete-on-success behavior was > intentional, to limit the amount of disk space a buildfarm member would > consume during a run. I don't mind that being the default, as long as > there's a way to override it manually. I don't think there is, currently? I've so far resorted to just adding some random "foobar" to one of the input files..., or just using installcheck. - Andres
> On 22 Apr 2017, at 23:25, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@anarazel.de> writes: >> On 2017-04-22 16:22:59 -0400, Tom Lane wrote: >>> In the particular case I was interested in here, the test script thought >>> everything was successful :-(. I'm working on fixing that little problem, >>> but I do not believe that the TAP scripts are so bulletproof that there >>> will never be a need to override their judgment. > >> Agreed. If paths are reproducible and cleaned up on next run it's also >> much less of an issue to just leave them around till the next run. >> Which we imo also should do for non-failing tmp_check directories. > > Mm, not convinced. I think that the delete-on-success behavior was > intentional, to limit the amount of disk space a buildfarm member would > consume during a run. I don't mind that being the default, as long as > there's a way to override it manually. Since we have the name of the running testscript, can’t we just add that to the tempdir to make the name more descriptive? With the attached patch I get tmp_check/001_pgbench_data_main_ItEm when running tests in src/bin/pgbench for example which gives a decent clue to what was executed. To allow for retain-on-success, checking for an environment variable in the cleanup phase seems the simplest approach. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-04-22 23:59:11 +0200, Daniel Gustafsson wrote: > Since we have the name of the running testscript, can’t we just add that to the > tempdir to make the name more descriptive? With the attached patch I get > tmp_check/001_pgbench_data_main_ItEm when running tests in src/bin/pgbench for > example which gives a decent clue to what was executed. To allow for > retain-on-success, checking for an environment variable in the cleanup phase > seems the simplest approach. Because it means we'd still, by default, have to delete on failure. Otherwise you'd end up with an endless number of such partial directories. There's really no reason to do delete a failed directory ever, since the space usage is obviously bound. As there appears to be no benefit in the randomness of these directories, why work on retaining it? - Andres
> On 23 Apr 2017, at 00:06, Andres Freund <andres@anarazel.de> wrote: > > On 2017-04-22 23:59:11 +0200, Daniel Gustafsson wrote: >> Since we have the name of the running testscript, can’t we just add that to the >> tempdir to make the name more descriptive? With the attached patch I get >> tmp_check/001_pgbench_data_main_ItEm when running tests in src/bin/pgbench for >> example which gives a decent clue to what was executed. To allow for >> retain-on-success, checking for an environment variable in the cleanup phase >> seems the simplest approach. > > Because it means we'd still, by default, have to delete on > failure. Otherwise you'd end up with an endless number of such partial > directories. There's really no reason to do delete a failed directory > ever, since the space usage is obviously bound. > > As there appears to be no benefit in the randomness of these > directories, why work on retaining it? I’ve never managed a buildfarm animal so there might be something there I’m missing, but other than that I can’t see much reason (running concurrent make check's in the same directory doesn’t seem useful). If one wants to keep a directory around it’s easy enough to manually rename it. Skipping the tempdir and instead using ${testname}_data_${name} without a random suffix, we can achieve this with something along the lines of the attached PoC. It works as now (retain of failure, remove on success unless overridden) but that logic can easily be turned around if we want that. If it’s of interest I can pursue this after some sleep (tomorrow has become today at this point). cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Apr 23, 2017 at 7:48 AM, Daniel Gustafsson <daniel@yesql.se> wrote: > Skipping the tempdir and instead using ${testname}_data_${name} without a > random suffix, we can achieve this with something along the lines of the > attached PoC. It works as now (retain of failure, remove on success unless > overridden) but that logic can easily be turned around if we want that. If > it’s of interest I can pursue this after some sleep (tomorrow has become today > at this point). Yes, something like that may make sense as well for readability. Keeping folders in case of failures is something that I have been advocating in favor of for some time, but this never got into the tree :( There is a patch from Horiguchi-san that allows actually such a thing, have a look at patch 0006 on this email: https://www.postgresql.org/message-id/CAMsr+YEc2Ek=qJFJ2JeV7Spz69pOduoUJXBowZMppVYdeieRdA@mail.gmail.com The same concept rebased gives the attached. I didn't double-check the compatibility with past versions of Test:More though.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Apr 22, 2017 at 02:05:13PM -0700, Andres Freund wrote: > On 2017-04-22 16:22:59 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2017-04-22 13:51:30 -0400, Tom Lane wrote: > > >> I think we need to fix TestLib and/or PostgresNode so that there's a way > > >> to make TAP tests not auto-clean their data directories at end of run, > > >> without having to resort to editing the script like this. > > > > > I think leaving the directory around in case of failures would be a > > > pretty bare minimum. It's sometimes also useful to keep the remains if > > > everything was apparently successfull, but it's far less important imo. > > > > In the particular case I was interested in here, the test script thought > > everything was successful :-(. I'm working on fixing that little problem, > > but I do not believe that the TAP scripts are so bulletproof that there > > will never be a need to override their judgment. > > Agreed. If paths are reproducible and cleaned up on next run it's also > much less of an issue to just leave them around till the next run. > Which we imo also should do for non-failing tmp_check directories. In general yes, but I think it should be checked what the overall storage requirements will be if one set of all data directories is kept around. I was surprised the src/bin/pg_basebackup/t TAP tests took up several hundred megabytes (IIRC) when I ran them, so if other tests are similar, it might make a few animals run out of diskspace as soon as this is deployed without a heads-up to the operators. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
On 23 Apr. 2017 10:32, "Michael Paquier" <michael.paquier@gmail.com> wrote:
On Sun, Apr 23, 2017 at 7:48 AM, Daniel Gustafsson <daniel@yesql.se> wrote:Yes, something like that may make sense as well for readability.
> Skipping the tempdir and instead using ${testname}_data_${name} without a
> random suffix, we can achieve this with something along the lines of the
> attached PoC. It works as now (retain of failure, remove on success unless
> overridden) but that logic can easily be turned around if we want that. If
> it’s of interest I can pursue this after some sleep (tomorrow has become today
> at this point).
Keeping folders in case of failures is something that I have been
advocating in favor of for some time, but this never got into the tree
:(
Huh? We do keep test temp datadirs etc in case of failure. Just not on success.
Our definition of failure there sucks a bit though. We keep the datadirs if any test fails in a script. If the script its self crashes we still blow away the datadirs which is kind of counterintuitive.
I'd like to change the __DIE__ sig handler to only delete on clean script exit code, tap reporting success, and if some env bar like PG_TESTS_NOCLEAN is undefined. The later could also be used in pg_regress etc.
There is a patch from Horiguchi-san that allows actually such a thing,
have a look at patch 0006 on this email:
https://www.postgresql.org/message-id/CAMsr+YEc2Ek= qJFJ2JeV7Spz69pOduoUJXBowZMppV YdeieRdA@mail.gmail.com
The same concept rebased gives the attached. I didn't double-check the
compatibility with past versions of Test:More though..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Banck <michael.banck@credativ.de> writes: > On Sat, Apr 22, 2017 at 02:05:13PM -0700, Andres Freund wrote: >> Agreed. If paths are reproducible and cleaned up on next run it's also >> much less of an issue to just leave them around till the next run. >> Which we imo also should do for non-failing tmp_check directories. > In general yes, but I think it should be checked what the overall > storage requirements will be if one set of all data directories is kept > around. > I was surprised the src/bin/pg_basebackup/t TAP tests took up several > hundred megabytes (IIRC) when I ran them, so if other tests are similar, > it might make a few animals run out of diskspace as soon as this is > deployed without a heads-up to the operators. src/test/recovery/ alone eats 2.6GB if one sets CLEANUP => 0 as I did upthread. So I think leaving them there by default is a nonstarter. It might be okay to leave failed tests' dirs in place, but even there, I'd personally prefer a manual control knob. regards, tom lane
On Sun, Apr 23, 2017 at 10:05 PM, Craig Ringer <craig.ringer@2ndquadrant.com> wrote: > On 23 Apr. 2017 10:32, "Michael Paquier" <michael.paquier@gmail.com> wrote: > On Sun, Apr 23, 2017 at 7:48 AM, Daniel Gustafsson <daniel@yesql.se> wrote: >> Skipping the tempdir and instead using ${testname}_data_${name} without a >> random suffix, we can achieve this with something along the lines of the >> attached PoC. It works as now (retain of failure, remove on success >> unless >> overridden) but that logic can easily be turned around if we want that. >> If >> it’s of interest I can pursue this after some sleep (tomorrow has become >> today >> at this point). > > Yes, something like that may make sense as well for readability. > > Keeping folders in case of failures is something that I have been > advocating in favor of for some time, but this never got into the tree > :( > > Huh? We do keep test temp datadirs etc in case of failure. Just not on > success. Yes, you are right. I thought this was not the case. Happy to be wrong. > Our definition of failure there sucks a bit though. We keep the datadirs if > any test fails in a script. If the script its self crashes we still blow > away the datadirs which is kind of counterintuitive. Yes, I agree that it would make sense to keep them around in this case as well, having the data folder may help in debugging tests in some cases. > I'd like to change the __DIE__ sig handler to only delete on clean script > exit code, tap reporting success, and if some env bar like PG_TESTS_NOCLEAN > is undefined. The later could also be used in pg_regress etc. This looks like a sensible plan. -- Michael
On 2017-04-23 11:31:05 +0900, Michael Paquier wrote: > Keeping folders in case of failures is something that I have been > advocating in favor of for some time, but this never got into the tree > :( I don't think it'd be ok to do so unless you the randomness of dirnames is changed as you'd just accumulate cruft forever without manual intervention. So that's not too surprising imo... Greetings, Andres Freund
> On 23 Apr 2017, at 15:05, Craig Ringer <craig.ringer@2ndquadrant.com> wrote: > > On 23 Apr. 2017 10:32, "Michael Paquier" <michael.paquier@gmail.com <mailto:michael.paquier@gmail.com>> wrote: > On Sun, Apr 23, 2017 at 7:48 AM, Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote: > > Skipping the tempdir and instead using ${testname}_data_${name} without a > > random suffix, we can achieve this with something along the lines of the > > attached PoC. It works as now (retain of failure, remove on success unless > > overridden) but that logic can easily be turned around if we want that. If > > it’s of interest I can pursue this after some sleep (tomorrow has become today > > at this point). > > Yes, something like that may make sense as well for readability. > > Keeping folders in case of failures is something that I have been > advocating in favor of for some time, but this never got into the tree > :( > > Huh? We do keep test temp datadirs etc in case of failure. Just not on success. > > Our definition of failure there sucks a bit though. We keep the datadirs if any test fails in a script. If the script itsself crashes we still blow away the datadirs which is kind of counterintuitive. > > I'd like to change the __DIE__ sig handler to only delete on clean script exit code, tap reporting success, and if someenv bar like PG_TESTS_NOCLEAN is undefined. The later could also be used in pg_regress etc. That sounds like a good idea, even though END might be enough when not using tempdir() for the datadirs since die() should set a non-zero $? afaik (still doesn’t hurt to capture with __DIE__ though). I extended my previous test with this, and other comments in this thread: it produces non-random datadirs, retains them on die|exit|test-failure or if PG_TESTS_NOCLEAN is set. Theres also a check-clean target for cleaning out retained datadirs. Given that the datadirs do occupy quite some space, perhaps a PG_TESTS_DOCLEAN (or similar) would be good as well to always blow away the datadirs regardless of test status? I’m np Perl expert though so there might be better/cleaner ways to achieve this, in testing it seems to work though. rmtree() is supported at least since Perl 5.6 from what I can see. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 24 April 2017 at 20:01, Daniel Gustafsson <daniel@yesql.se> wrote: > I’m np Perl expert though so there might be better/cleaner ways to achieve > this, in testing it seems to work though. rmtree() is supported at least since > Perl 5.6 from what I can see. I'd rather just have the 'make' target nuke the relevant tmp_check dir before rerunning the tests. Right now it just deletes the logs. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training &Services
> On 24 Apr 2017, at 17:19, Craig Ringer <craig.ringer@2ndquadrant.com> wrote: > > On 24 April 2017 at 20:01, Daniel Gustafsson <daniel@yesql.se> wrote: > >> I’m np Perl expert though so there might be better/cleaner ways to achieve >> this, in testing it seems to work though. rmtree() is supported at least since >> Perl 5.6 from what I can see. > > I'd rather just have the 'make' target nuke the relevant tmp_check dir > before rerunning the tests. Right now it just deletes the logs. Makes sense, assuming a “clean slate” to run on seems a reasonable assumption for the test and it makes for simpler code in PostgresNode. Updated the patch which now have static datadir names; retains on PG_TESTS_NOCLEAN or any kind of test failure; and allows for cleaning datadirs without having to use make clean (seems the wrong size hammer for that nail). cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 4/25/17 11:00, Daniel Gustafsson wrote: > Makes sense, assuming a “clean slate” to run on seems a reasonable assumption > for the test and it makes for simpler code in PostgresNode. Updated the patch > which now have static datadir names; retains on PG_TESTS_NOCLEAN or any kind of > test failure; and allows for cleaning datadirs without having to use make clean > (seems the wrong size hammer for that nail). Committed. I had to make some changes in the pg_rewind tests. Those make data directories with conflicting names, which did not work anymore because the names are no longer random. Other than that this is pretty nice. Thanks! -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 05 Sep 2017, at 18:37, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 4/25/17 11:00, Daniel Gustafsson wrote: >> Makes sense, assuming a “clean slate” to run on seems a reasonable assumption >> for the test and it makes for simpler code in PostgresNode. Updated the patch >> which now have static datadir names; retains on PG_TESTS_NOCLEAN or any kind of >> test failure; and allows for cleaning datadirs without having to use make clean >> (seems the wrong size hammer for that nail). > > Committed. I had to make some changes in the pg_rewind tests. Those > make data directories with conflicting names, which did not work anymore > because the names are no longer random. Makes sense, +1 on those changes. > Other than that this is pretty nice. Thanks! Thanks for the commit! cheers ./daniel