Thread: Finding cause of test fails on the cfbot site

Finding cause of test fails on the cfbot site

From

Peter Smith

Date:

17 February 2021, 09:49:26

Hi -

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.

We can see the test log indicating what was the test that failed [2]
Test Summary Report
-------------------
t/002_twophase_streaming.pl (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 2 tests but ran 1.
Files=2, Tests=3,  4 wallclock secs ( 0.03 usr  0.00 sys +  1.59 cusr
0.81 csys =  2.44 CPU)
Result: FAIL
gmake[2]: *** [../../src/makefiles/pgxs.mk:440: check] Error 1
gmake[1]: *** [Makefile:94: check-test_decoding-recurse] Error 2
gmake: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2
*** Error code 2


Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

Thankyou.

---
[1] http://cfbot.cputube.org/
[2] https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Finding cause of test fails on the cfbot site

From

Tom Lane

Date:

17 February 2021, 16:06:51

Peter Smith <smithpb2250@gmail.com> writes:
> I saw that one of our commitfest entries (32/2914) is recently
> reporting a fail on the cfbot site [1]. I thought this was all ok a
> few days ago.
> ...
> Is there any other detailed information available anywhere, e.g.
> logs?, which might help us work out what was the cause of the test
> failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

            regards, tom lane

Re: Finding cause of test fails on the cfbot site

From

Andrew Dunstan

Date:

17 February 2021, 20:18:02

On 2/17/21 11:06 AM, Tom Lane wrote:
> Peter Smith <smithpb2250@gmail.com> writes:
>> I saw that one of our commitfest entries (32/2914) is recently
>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>> few days ago.
>> ...
>> Is there any other detailed information available anywhere, e.g.
>> logs?, which might help us work out what was the cause of the test
>> failure?
> AFAIK the cfbot doesn't capture anything beyond the session typescript.
> However, this doesn't look that hard to reproduce locally ... have you
> tried, using similar configure options to what that cfbot run did?
> Once you did reproduce it, there'd be logs under
> contrib/test_decoding/tmp_check/.
>
>             



yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Finding cause of test fails on the cfbot site

From

Thomas Munro

Date:

17 February 2021, 20:42:04

On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2/17/21 11:06 AM, Tom Lane wrote:
> > Peter Smith <smithpb2250@gmail.com> writes:
> >> I saw that one of our commitfest entries (32/2914) is recently
> >> reporting a fail on the cfbot site [1]. I thought this was all ok a
> >> few days ago.
> >> ...
> >> Is there any other detailed information available anywhere, e.g.
> >> logs?, which might help us work out what was the cause of the test
> >> failure?
> > AFAIK the cfbot doesn't capture anything beyond the session typescript.
> > However, this doesn't look that hard to reproduce locally ... have you
> > tried, using similar configure options to what that cfbot run did?
> > Once you did reproduce it, there'd be logs under
> > contrib/test_decoding/tmp_check/.
>
> yeah. The cfbot runs check-world which makes it difficult for it to know
> which log files to show when there's an error. That's a major part of
> the reason the buildfarm runs a much finer grained set of steps.

Yeah, it's hard to make it print out just the right logs without
dumping so much stuff that it's hard to see the wood for the trees;
perhaps if the Makefile had an option to dump relevant stuff for the
specific tests that failed, or perhaps the buildfarm is already better
at that and cfbot should just use the buildfarm client directly.  Hmm.
Another idea would be to figure out how to make a tarball of all log
files that you can download for inspection with better tools at home
when things go wrong.  It would rapidly blow through the 1GB limit for
stored "artefacts" on open source/community Cirrus accounts though, so
we'd need to figure out how to manage retention carefully.

For what it's worth, I tried to reproduce this particular on a couple
of systems, many times, with no luck.  It doesn't look like a freak CI
failure (there have been a few random terminations I can't explain
recently, but they look different, I think there was a Google Compute
Engine outage that might explain that), and it failed in exactly the
same way on Linux and FreeBSD.  I tried  locally on FreeBSD, on top of
commit a975ff4980d60f8cbd8d8cbcff70182ea53e787a (which is what the
last cfbot run did), because it conflicts with a recent change so it
doesn't apply on the tip of master right now.

Re: Finding cause of test fails on the cfbot site

From

Thomas Munro

Date:

17 February 2021, 22:54:36

On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... (there have been a few random terminations I can't explain
> recently, but they look different, I think there was a Google Compute
> Engine outage that might explain that), ...

There's also occasionally a failure like this[1], on FreeBSD only:

t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
Bailout called. Further testing stopped: system pg_basebackup failed
FAILED--Further testing stopped: system pg_basebackup failed

I have a sneaking suspicion this is a real problem on master, but more
logs will be needed to guess more about that...

[1] https://cirrus-ci.com/task/5046982650626048?command=test#L733

Re: Finding cause of test fails on the cfbot site

From

Peter Smith

Date:

17 February 2021, 23:48:23

On Thu, Feb 18, 2021 at 9:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... (there have been a few random terminations I can't explain
> > recently, but they look different, I think there was a Google Compute
> > Engine outage that might explain that), ...
>
> There's also occasionally a failure like this[1], on FreeBSD only:
>
> t/001_stream_rep.pl .................. ok
> t/002_archiving.pl ................... ok
> t/003_recovery_targets.pl ............ ok
> Bailout called. Further testing stopped: system pg_basebackup failed
> FAILED--Further testing stopped: system pg_basebackup failed
>
> I have a sneaking suspicion this is a real problem on master, but more
> logs will be needed to guess more about that...
>
> [1] https://cirrus-ci.com/task/5046982650626048?command=test#L733

Thanks for all the effort spent into looking at this.

Meanwhile, since you pointed out the patch is not applying on the HEAD
tip I can at least address that.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Finding cause of test fails on the cfbot site

From

Andrew Dunstan

Date:

18 February 2021, 16:01:03

On 2/17/21 3:42 PM, Thomas Munro wrote:
> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 2/17/21 11:06 AM, Tom Lane wrote:
>>> Peter Smith <smithpb2250@gmail.com> writes:
>>>> I saw that one of our commitfest entries (32/2914) is recently
>>>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>>>> few days ago.
>>>> ...
>>>> Is there any other detailed information available anywhere, e.g.
>>>> logs?, which might help us work out what was the cause of the test
>>>> failure?
>>> AFAIK the cfbot doesn't capture anything beyond the session typescript.
>>> However, this doesn't look that hard to reproduce locally ... have you
>>> tried, using similar configure options to what that cfbot run did?
>>> Once you did reproduce it, there'd be logs under
>>> contrib/test_decoding/tmp_check/.
>> yeah. The cfbot runs check-world which makes it difficult for it to know
>> which log files to show when there's an error. That's a major part of
>> the reason the buildfarm runs a much finer grained set of steps.
> Yeah, it's hard to make it print out just the right logs without
> dumping so much stuff that it's hard to see the wood for the trees;
> perhaps if the Makefile had an option to dump relevant stuff for the
> specific tests that failed, or perhaps the buildfarm is already better
> at that and cfbot should just use the buildfarm client directly.  Hmm.
> Another idea would be to figure out how to make a tarball of all log
> files that you can download for inspection with better tools at home
> when things go wrong.  It would rapidly blow through the 1GB limit for
> stored "artefacts" on open source/community Cirrus accounts though, so
> we'd need to figure out how to manage retention carefully.


I did some thinking about this. How about if we have the make files and
the msvc build system create a well known file with the location(s) to
search for log files if there's an error. Each bit of testing could
overwrite this file before starting testing, and then tools like cfbot
would know where to look for files to report? To keep things clean for
other users the file would only be created if, say,
PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
something like "$(top_builddir)/error_log_locations.txt", and individual
Makefiles would have entries something like:,


    override ERROR_LOG_LOCATIONS =
    $(top_builddir)/contrib/test_decoding/tmp_check/log


If this seems like a good idea I can go and try to make that happen.


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Finding cause of test fails on the cfbot site

From

Andrew Dunstan

Date:

19 February 2021, 14:54:34

On 2/18/21 11:01 AM, Andrew Dunstan wrote:
> On 2/17/21 3:42 PM, Thomas Munro wrote:
>> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>>> On 2/17/21 11:06 AM, Tom Lane wrote:
>>>> Peter Smith <smithpb2250@gmail.com> writes:
>>>>> I saw that one of our commitfest entries (32/2914) is recently
>>>>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>>>>> few days ago.
>>>>> ...
>>>>> Is there any other detailed information available anywhere, e.g.
>>>>> logs?, which might help us work out what was the cause of the test
>>>>> failure?
>>>> AFAIK the cfbot doesn't capture anything beyond the session typescript.
>>>> However, this doesn't look that hard to reproduce locally ... have you
>>>> tried, using similar configure options to what that cfbot run did?
>>>> Once you did reproduce it, there'd be logs under
>>>> contrib/test_decoding/tmp_check/.
>>> yeah. The cfbot runs check-world which makes it difficult for it to know
>>> which log files to show when there's an error. That's a major part of
>>> the reason the buildfarm runs a much finer grained set of steps.
>> Yeah, it's hard to make it print out just the right logs without
>> dumping so much stuff that it's hard to see the wood for the trees;
>> perhaps if the Makefile had an option to dump relevant stuff for the
>> specific tests that failed, or perhaps the buildfarm is already better
>> at that and cfbot should just use the buildfarm client directly.  Hmm.
>> Another idea would be to figure out how to make a tarball of all log
>> files that you can download for inspection with better tools at home
>> when things go wrong.  It would rapidly blow through the 1GB limit for
>> stored "artefacts" on open source/community Cirrus accounts though, so
>> we'd need to figure out how to manage retention carefully.
>
> I did some thinking about this. How about if we have the make files and
> the msvc build system create a well known file with the location(s) to
> search for log files if there's an error. Each bit of testing could
> overwrite this file before starting testing, and then tools like cfbot
> would know where to look for files to report? To keep things clean for
> other users the file would only be created if, say,
> PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
> something like "$(top_builddir)/error_log_locations.txt", and individual
> Makefiles would have entries something like:,
>
>
>     override ERROR_LOG_LOCATIONS =
>     $(top_builddir)/contrib/test_decoding/tmp_check/log
>
>
> If this seems like a good idea I can go and try to make that happen.
>
>

here's a very small and simple (and possibly naive)  POC patch that
demonstrates this and seems to do the right thing.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

error_log_locations_POC.patch