Re: Improve error reporting in 027_stream_regress test - Mailing list pgsql-hackers

From Nazir Bilal Yavuz
Subject Re: Improve error reporting in 027_stream_regress test
Date
Msg-id CAN55FZ3oQmX0NLPg=VK7qT08+hP5gYo-rAYP9WZR1rCMSovdjw@mail.gmail.com
Whole thread Raw
In response to Re: Improve error reporting in 027_stream_regress test  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Improve error reporting in 027_stream_regress test
List pgsql-hackers
Hi,

Thanks for looking into this!

On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 01, 2025 at 10:57:11AM +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:
> >> One thing I don't yet like is that I think we should report if the primary is
> >> alive *before* reporting the diff and skipping reporting it if the primary
> >> crashed. It's not interesting to report a long diff if the server crashed IMO.
>
> Right.  That's just more noise one needs to dig into to find out
> what's wrong.
>
> > I agree with you. So, the current logic is:
> >
> > If primary is not alive: Do not report anything.
>
> Okay.
>
> > If only primary is alive: Report the entire diff file.
>
> Hmm.  Is that actually useful as we know that the standby has been
> stalen down when running the test?  Even if we report something, we
> could always trim the output, like the standby case.

I guess you are right. Also, I tried killing standby while the
027_stream_regress test was running and the test did not fail; instead
it continued until timeout. If that is always the case, then it makes
sense to report if and only if both primary and standby are alive.

> > If both primary and standby are alive: Report entire diff file and add
> > head+tail of diff to the failure message.
>
> +=item $node->is_alive()
> +
> +Check if the node is alive.
> +
> +=cut
> +
> +sub is_alive {
> +    my ($self) = @_;
> +
> +    my $host = $self->host;
> +    my $port = $self->port;
> +    my $null = File::Spec->devnull;
> +
> +    my $cmd = "pg_isready -h $host -p $port";
> +        return !system("$cmd >$null 2>&1");
> +}
>
> It's strange to re-add a dependency to system() while we have wrappers
> around it, like system_log() in Utils.pm.

system_log() logs which command is run. I actually do not want to log
the command because it needs to be run after the regression tests and
before the 'regression tests pass'. If I log the command it looks like
this:

# All 228 tests passed.
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
(21.939s) ok 2 - regression tests pass
(0.000s) ok 3 - primary is alive after the regression tests
(0.000s) ok 4 - standby is alive after the regression tests

I think it looks like these pg_isready commands run randomly.

> You may want to add a "local %ENV = $self->_get_env();"  to make sure
> that we use a command related to the build of the node initialized.

You are right, I will apply this.

> +sub regression_log_helper
> +{
> +       my ($diff_file, $lines_to_dump) = @_;
> +       my @lines;
> +
> +       open my $fh, '<', $diff_file or die "couldn't open file: $diff_file\n";
> +
> +       # Read all lines to process them below
> +       while (my $line = <$fh>)
> +       {
> +               push @lines, $line;
> +       }
> +       close $fh;
>
> We could use that as a routine of its own in Utils.pm?  It's not
> really something only for diff files, more something that we can use
> to trim the contents of a file, which could be the tail of a server
> log file as well..  It's always difficult to measure what a "good"
> number of lines is, but perhaps we could use an environment variable
> to make that configurable rather than enforce a policy of 50 lines
> because we think it's good enough?

I think this is a good idea. How about a function called file_trim()
with three arguments: the file name, a mode ('head' or 'tail'), and
the number of lines to trim from that end. This approach may require
reading the file more than once, but it is easy to use.

> The fact that we expect the primary and the standby to be alive once
> the tests are run and the addition of two tests related to them seems
> unrelated to the diff report improvements, so I'd suggest to do both
> things separately.

Correct, I will separate them.

--
Regards,
Nazir Bilal Yavuz
Microsoft



pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Explicitly enable meson features in CI
Next
From: Álvaro Herrera
Date:
Subject: Re: Fix lwlock.c and wait_event_names.txt discrepancy