Thread: Inadequate failure reporting from poll_query_until

Inadequate failure reporting from poll_query_until

From

Tom Lane

Date:

15 October 2018, 23:19:16

I just spent some time puzzling over why we've seen so many recoveryCheck
failures in the buildfarm recently --- excluding failures with known
causes, such as the pg_verify_checksums business, that's basically all
we've seen for the last week.  I am suspicious that there's some
timing-dependent problems in WAL streaming startup and shutdown, but
can't prove anything yet.  However, one case catches my eye:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2018-10-15%2016%3A01%3A59

This is the sum total of the visible logged information:

    #
    not ok 2 - asterisk in synchronous_standby_names

    #   Failed test 'asterisk in synchronous_standby_names'
    #   at t/007_sync_rep.pl line 26.

The failing test is evidently this one:

    test_sync_state(
            $node_master, qq(standby1|1|sync
    standby2|1|potential
    standby3|1|potential),
            'asterisk in synchronous_standby_names',
            '*');

and when you drill down, you find that PostgresNode.pm's poll_query_until
must have timed out, and that blank line there is as much as it felt like
telling us about it.  Evidently what we were getting back from the server
did not match the expected output, but good luck guessing what it was.

I think we ought to fix poll_query_until so that upon failure, you get
a much more complete and intelligible report, along the lines of

poll_query_until timed out executing this query:
$query
expecting this output:
$expected
last actual query output:
$stdout
with stderr:
$stderr

Objections?  Thoughts about better formatting?

            regards, tom lane

Re: Inadequate failure reporting from poll_query_until

From

Andrew Dunstan

Date:

15 October 2018, 23:23:29


On 10/15/2018 04:19 PM, Tom Lane wrote:
> I just spent some time puzzling over why we've seen so many recoveryCheck
> failures in the buildfarm recently --- excluding failures with known
> causes, such as the pg_verify_checksums business, that's basically all
> we've seen for the last week.  I am suspicious that there's some
> timing-dependent problems in WAL streaming startup and shutdown, but
> can't prove anything yet.  However, one case catches my eye:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2018-10-15%2016%3A01%3A59
>
> This is the sum total of the visible logged information:
>
>      #
>      not ok 2 - asterisk in synchronous_standby_names
>      
>      #   Failed test 'asterisk in synchronous_standby_names'
>      #   at t/007_sync_rep.pl line 26.
>
> The failing test is evidently this one:
>
>      test_sync_state(
>              $node_master, qq(standby1|1|sync
>      standby2|1|potential
>      standby3|1|potential),
>              'asterisk in synchronous_standby_names',
>              '*');
>
> and when you drill down, you find that PostgresNode.pm's poll_query_until
> must have timed out, and that blank line there is as much as it felt like
> telling us about it.  Evidently what we were getting back from the server
> did not match the expected output, but good luck guessing what it was.
>
> I think we ought to fix poll_query_until so that upon failure, you get
> a much more complete and intelligible report, along the lines of
>
> poll_query_until timed out executing this query:
> $query
> expecting this output:
> $expected
> last actual query output:
> $stdout
> with stderr:
> $stderr
>
> Objections?  Thoughts about better formatting?


WFM.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services