Thread: recovery test failures on hoverfly

recovery test failures on hoverfly

From

Amit Kapila

Date:

11 June 2021, 12:08:34

I noticed that we are getting random failures [1][2][3] in the
recovery test on hoverfly. The failures are in 022_crash_temp_files
and 013_crash_restart. Both the tests failed due to same reason:

ack Broken pipe: write( 13, 'SELECT 1' ) at
/home/nm/src/build/IPC-Run-0.94/lib/IPC/Run/IO.pm line 558.

It seems the error happens in both the tests when after issuing a
KILL, we are trying to reconnect. Can we do anything for this?

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-11%2006%3A59%3A59
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-06%2007%3A09%3A53
[3] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-05%2008%3A40%3A49

-- 
With Regards,
Amit Kapila.

Re: recovery test failures on hoverfly

From

Michael Paquier

Date:

11 June 2021, 12:20:07

On Fri, Jun 11, 2021 at 05:38:34PM +0530, Amit Kapila wrote:
> It seems the error happens in both the tests when after issuing a
> KILL, we are trying to reconnect. Can we do anything for this?

This is the same problem as c757a3da and 6d41dd0, where we write a
query to a pipe but the kill, causing a failure, makes the test fail
with a SIGPIPE in IPC::Run as a query is sent down to a pipe.

I think that using SELECT 1 to test if the server has been restarted
is a bit crazy.  I would suggest to use instead a loop based on
pg_isready.
--
Michael

Attachment

signature.asc

Re: recovery test failures on hoverfly

From

Tom Lane

Date:

11 June 2021, 14:37:21

Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Jun 11, 2021 at 05:38:34PM +0530, Amit Kapila wrote:
>> It seems the error happens in both the tests when after issuing a
>> KILL, we are trying to reconnect. Can we do anything for this?

> This is the same problem as c757a3da and 6d41dd0, where we write a
> query to a pipe but the kill, causing a failure, makes the test fail
> with a SIGPIPE in IPC::Run as a query is sent down to a pipe.

Indeed.

> I think that using SELECT 1 to test if the server has been restarted
> is a bit crazy.  I would suggest to use instead a loop based on
> pg_isready.

The precedent of the previous fixes would seem to suggest seeing if
we can replace 'SELECT 1' with "undef".  Not sure if that'll work
without annoying changes to poll_query_until, though.

            regards, tom lane

Re: recovery test failures on hoverfly

From

Tom Lane

Date:

11 June 2021, 22:28:41

I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> This is the same problem as c757a3da and 6d41dd0, where we write a
>> query to a pipe but the kill, causing a failure, makes the test fail
>> with a SIGPIPE in IPC::Run as a query is sent down to a pipe.

> The precedent of the previous fixes would seem to suggest seeing if
> we can replace 'SELECT 1' with "undef".  Not sure if that'll work
> without annoying changes to poll_query_until, though.

I noticed that elver failed this same way today, so that got me
annoyed enough to pursue a fix.  Using "undef" as poll_query_until's
input almost works, except it turns out that it fails to notice psql
connection failures in that case!  It is *only* looking at psql's
stdout, not at either stderr or the exit status, which seems seriously
bogus in its own right; not least because poll_query_until's own
documentation claims it will continue waiting after an error, which
is exactly what it's not doing.  So I propose the attached.

(I first tried to make it check $result == 0, but it seems there are a
lot of cases where psql returns status 1 in these tests.  That seems
pretty bogus too, but probably beta is no time to change that
behavior.)

            regards, tom lane

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 45d1636128..2027cbf43d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2140,8 +2140,10 @@ sub poll_query_until

         $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
         chomp($stdout);
+        $stderr =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
+        chomp($stderr);

-        if ($stdout eq $expected)
+        if ($stdout eq $expected && $stderr eq '')
         {
             return 1;
         }
@@ -2154,8 +2156,6 @@ sub poll_query_until

     # The query result didn't change in 180 seconds. Give up. Print the
     # output from the last attempt, hopefully that's useful for debugging.
-    $stderr =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
-    chomp($stderr);
     diag qq(poll_query_until timed out executing this query:
 $query
 expecting this output:
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index e1c36abe97..868a50b33d 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -136,12 +136,8 @@ ok( pump_until(
 $monitor->finish;

 # Wait till server restarts
-is( $node->poll_query_until(
-        'postgres',
-        'SELECT $$restarted after sigquit$$;',
-        'restarted after sigquit'),
-    "1",
-    "reconnected after SIGQUIT");
+is($node->poll_query_until('postgres', undef, ''),
+    "1", "reconnected after SIGQUIT");


 # restart psql processes, now that the crash cycle finished
@@ -216,7 +212,7 @@ ok( pump_until(
 $monitor->finish;

 # Wait till server restarts
-is($node->poll_query_until('postgres', 'SELECT 1', '1'),
+is($node->poll_query_until('postgres', undef, ''),
     "1", "reconnected after SIGKILL");

 # Make sure the committed rows survived, in-progress ones not
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index b912f4b232..157ddba8cf 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -139,7 +139,7 @@ $killme->finish;
 $killme2->finish;

 # Wait till server restarts
-$node->poll_query_until('postgres', 'SELECT 1', '1');
+$node->poll_query_until('postgres', undef, '');

 # Check for temporary files
 is( $node->safe_psql(
@@ -228,7 +228,7 @@ $killme->finish;
 $killme2->finish;

 # Wait till server restarts
-$node->poll_query_until('postgres', 'SELECT 1', '1');
+$node->poll_query_until('postgres', undef, '');

 # Check for temporary files -- should be there
 is( $node->safe_psql(

Re: recovery test failures on hoverfly

From

Tom Lane

Date:

12 June 2021, 19:15:11

>> Michael Paquier <michael@paquier.xyz> writes:
>>> This is the same problem as c757a3da and 6d41dd0, where we write a
>>> query to a pipe but the kill, causing a failure, makes the test fail
>>> with a SIGPIPE in IPC::Run as a query is sent down to a pipe.

After checking the git logs, I realized that this failure is actually
new since 11e9caff8: before that, poll_query_until passed the query
on the command line not stdin, so it wasn't vulnerable to SIGPIPE.
So that explains why we only recently started to see this.

The fix I proposed seems to work fine in all branches, so I went
ahead and pushed it.

            regards, tom lane

Re: recovery test failures on hoverfly

From

Andrew Dunstan

Date:

12 June 2021, 21:19:38

On 6/12/21 3:15 PM, Tom Lane wrote:
>>> Michael Paquier <michael@paquier.xyz> writes:
>>>> This is the same problem as c757a3da and 6d41dd0, where we write a
>>>> query to a pipe but the kill, causing a failure, makes the test fail
>>>> with a SIGPIPE in IPC::Run as a query is sent down to a pipe.
> After checking the git logs, I realized that this failure is actually
> new since 11e9caff8: before that, poll_query_until passed the query
> on the command line not stdin, so it wasn't vulnerable to SIGPIPE.
> So that explains why we only recently started to see this.
>
> The fix I proposed seems to work fine in all branches, so I went
> ahead and pushed it.
>
>             


I'm a bit dubious about this. It doesn't seem more robust to insist that
we pass undef in certain cases. If passing the SQL via stdin is fragile,
as we also found to be the case with passing it via the command line,
perhaps we should try passing it via a tmp file. Then there would
presumably be no SIGPIPE.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: recovery test failures on hoverfly

From

Tom Lane

Date:

12 June 2021, 21:28:19

Andrew Dunstan <andrew@dunslane.net> writes:
> I'm a bit dubious about this. It doesn't seem more robust to insist that
> we pass undef in certain cases.

True, it'd be nicer if that didn't matter; mainly because people
will get it wrong in future.

> If passing the SQL via stdin is fragile,
> as we also found to be the case with passing it via the command line,
> perhaps we should try passing it via a tmp file. Then there would
> presumably be no SIGPIPE.

Seems kind of inefficient.  Maybe writing and reading a file would
be a negligible cost compared to everything else involved, but
I'm not sure.

Another angle is that the SIGPIPE complaints aren't necessarily
a bad thing: if psql doesn't read what we send, it's good to
know about that.  IMO the real problem is that the errors are
so darn nonrepeatable.  I wonder if there is a way to make them
more reproducible?

            regards, tom lane

Re: recovery test failures on hoverfly

From

Andrew Dunstan

Date:

12 June 2021, 21:50:46

On 6/12/21 5:28 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I'm a bit dubious about this. It doesn't seem more robust to insist that
>> we pass undef in certain cases.
> True, it'd be nicer if that didn't matter; mainly because people
> will get it wrong in future.


Right, that's what I'm worried about.


>
>> If passing the SQL via stdin is fragile,
>> as we also found to be the case with passing it via the command line,
>> perhaps we should try passing it via a tmp file. Then there would
>> presumably be no SIGPIPE.
> Seems kind of inefficient.  Maybe writing and reading a file would
> be a negligible cost compared to everything else involved, but
> I'm not sure.


Well, in poll_query_until we would of course set up the file outside the
loop. I suspect the cost would in fact be negligible.


Note, too that the psql and safe_psql methods also pass the query via stdin.


>
> Another angle is that the SIGPIPE complaints aren't necessarily
> a bad thing: if psql doesn't read what we send, it's good to
> know about that.  IMO the real problem is that the errors are
> so darn nonrepeatable.  I wonder if there is a way to make them
> more reproducible?
>
>             


I don't know.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: recovery test failures on hoverfly

From

Tom Lane

Date:

12 June 2021, 21:57:48

Andrew Dunstan <andrew@dunslane.net> writes:
> Note, too that the psql and safe_psql methods also pass the query via stdin.

Yeah.  We need all of these to act the same, IMO.  Recall that
the previous patches that introduced the undef hack were changing
callers of those routines, not poll_query_until.

            regards, tom lane