Thread: recovery test failures on hoverfly
I noticed that we are getting random failures [1][2][3] in the recovery test on hoverfly. The failures are in 022_crash_temp_files and 013_crash_restart. Both the tests failed due to same reason: ack Broken pipe: write( 13, 'SELECT 1' ) at /home/nm/src/build/IPC-Run-0.94/lib/IPC/Run/IO.pm line 558. It seems the error happens in both the tests when after issuing a KILL, we are trying to reconnect. Can we do anything for this? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-11%2006%3A59%3A59 [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-06%2007%3A09%3A53 [3] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-06-05%2008%3A40%3A49 -- With Regards, Amit Kapila.
On Fri, Jun 11, 2021 at 05:38:34PM +0530, Amit Kapila wrote: > It seems the error happens in both the tests when after issuing a > KILL, we are trying to reconnect. Can we do anything for this? This is the same problem as c757a3da and 6d41dd0, where we write a query to a pipe but the kill, causing a failure, makes the test fail with a SIGPIPE in IPC::Run as a query is sent down to a pipe. I think that using SELECT 1 to test if the server has been restarted is a bit crazy. I would suggest to use instead a loop based on pg_isready. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Jun 11, 2021 at 05:38:34PM +0530, Amit Kapila wrote: >> It seems the error happens in both the tests when after issuing a >> KILL, we are trying to reconnect. Can we do anything for this? > This is the same problem as c757a3da and 6d41dd0, where we write a > query to a pipe but the kill, causing a failure, makes the test fail > with a SIGPIPE in IPC::Run as a query is sent down to a pipe. Indeed. > I think that using SELECT 1 to test if the server has been restarted > is a bit crazy. I would suggest to use instead a loop based on > pg_isready. The precedent of the previous fixes would seem to suggest seeing if we can replace 'SELECT 1' with "undef". Not sure if that'll work without annoying changes to poll_query_until, though. regards, tom lane
I wrote: > Michael Paquier <michael@paquier.xyz> writes: >> This is the same problem as c757a3da and 6d41dd0, where we write a >> query to a pipe but the kill, causing a failure, makes the test fail >> with a SIGPIPE in IPC::Run as a query is sent down to a pipe. > The precedent of the previous fixes would seem to suggest seeing if > we can replace 'SELECT 1' with "undef". Not sure if that'll work > without annoying changes to poll_query_until, though. I noticed that elver failed this same way today, so that got me annoyed enough to pursue a fix. Using "undef" as poll_query_until's input almost works, except it turns out that it fails to notice psql connection failures in that case! It is *only* looking at psql's stdout, not at either stderr or the exit status, which seems seriously bogus in its own right; not least because poll_query_until's own documentation claims it will continue waiting after an error, which is exactly what it's not doing. So I propose the attached. (I first tried to make it check $result == 0, but it seems there are a lot of cases where psql returns status 1 in these tests. That seems pretty bogus too, but probably beta is no time to change that behavior.) regards, tom lane diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 45d1636128..2027cbf43d 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -2140,8 +2140,10 @@ sub poll_query_until $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; chomp($stdout); + $stderr =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; + chomp($stderr); - if ($stdout eq $expected) + if ($stdout eq $expected && $stderr eq '') { return 1; } @@ -2154,8 +2156,6 @@ sub poll_query_until # The query result didn't change in 180 seconds. Give up. Print the # output from the last attempt, hopefully that's useful for debugging. - $stderr =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; - chomp($stderr); diag qq(poll_query_until timed out executing this query: $query expecting this output: diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl index e1c36abe97..868a50b33d 100644 --- a/src/test/recovery/t/013_crash_restart.pl +++ b/src/test/recovery/t/013_crash_restart.pl @@ -136,12 +136,8 @@ ok( pump_until( $monitor->finish; # Wait till server restarts -is( $node->poll_query_until( - 'postgres', - 'SELECT $$restarted after sigquit$$;', - 'restarted after sigquit'), - "1", - "reconnected after SIGQUIT"); +is($node->poll_query_until('postgres', undef, ''), + "1", "reconnected after SIGQUIT"); # restart psql processes, now that the crash cycle finished @@ -216,7 +212,7 @@ ok( pump_until( $monitor->finish; # Wait till server restarts -is($node->poll_query_until('postgres', 'SELECT 1', '1'), +is($node->poll_query_until('postgres', undef, ''), "1", "reconnected after SIGKILL"); # Make sure the committed rows survived, in-progress ones not diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl index b912f4b232..157ddba8cf 100644 --- a/src/test/recovery/t/022_crash_temp_files.pl +++ b/src/test/recovery/t/022_crash_temp_files.pl @@ -139,7 +139,7 @@ $killme->finish; $killme2->finish; # Wait till server restarts -$node->poll_query_until('postgres', 'SELECT 1', '1'); +$node->poll_query_until('postgres', undef, ''); # Check for temporary files is( $node->safe_psql( @@ -228,7 +228,7 @@ $killme->finish; $killme2->finish; # Wait till server restarts -$node->poll_query_until('postgres', 'SELECT 1', '1'); +$node->poll_query_until('postgres', undef, ''); # Check for temporary files -- should be there is( $node->safe_psql(
>> Michael Paquier <michael@paquier.xyz> writes: >>> This is the same problem as c757a3da and 6d41dd0, where we write a >>> query to a pipe but the kill, causing a failure, makes the test fail >>> with a SIGPIPE in IPC::Run as a query is sent down to a pipe. After checking the git logs, I realized that this failure is actually new since 11e9caff8: before that, poll_query_until passed the query on the command line not stdin, so it wasn't vulnerable to SIGPIPE. So that explains why we only recently started to see this. The fix I proposed seems to work fine in all branches, so I went ahead and pushed it. regards, tom lane
On 6/12/21 3:15 PM, Tom Lane wrote: >>> Michael Paquier <michael@paquier.xyz> writes: >>>> This is the same problem as c757a3da and 6d41dd0, where we write a >>>> query to a pipe but the kill, causing a failure, makes the test fail >>>> with a SIGPIPE in IPC::Run as a query is sent down to a pipe. > After checking the git logs, I realized that this failure is actually > new since 11e9caff8: before that, poll_query_until passed the query > on the command line not stdin, so it wasn't vulnerable to SIGPIPE. > So that explains why we only recently started to see this. > > The fix I proposed seems to work fine in all branches, so I went > ahead and pushed it. > > I'm a bit dubious about this. It doesn't seem more robust to insist that we pass undef in certain cases. If passing the SQL via stdin is fragile, as we also found to be the case with passing it via the command line, perhaps we should try passing it via a tmp file. Then there would presumably be no SIGPIPE. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I'm a bit dubious about this. It doesn't seem more robust to insist that > we pass undef in certain cases. True, it'd be nicer if that didn't matter; mainly because people will get it wrong in future. > If passing the SQL via stdin is fragile, > as we also found to be the case with passing it via the command line, > perhaps we should try passing it via a tmp file. Then there would > presumably be no SIGPIPE. Seems kind of inefficient. Maybe writing and reading a file would be a negligible cost compared to everything else involved, but I'm not sure. Another angle is that the SIGPIPE complaints aren't necessarily a bad thing: if psql doesn't read what we send, it's good to know about that. IMO the real problem is that the errors are so darn nonrepeatable. I wonder if there is a way to make them more reproducible? regards, tom lane
On 6/12/21 5:28 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I'm a bit dubious about this. It doesn't seem more robust to insist that >> we pass undef in certain cases. > True, it'd be nicer if that didn't matter; mainly because people > will get it wrong in future. Right, that's what I'm worried about. > >> If passing the SQL via stdin is fragile, >> as we also found to be the case with passing it via the command line, >> perhaps we should try passing it via a tmp file. Then there would >> presumably be no SIGPIPE. > Seems kind of inefficient. Maybe writing and reading a file would > be a negligible cost compared to everything else involved, but > I'm not sure. Well, in poll_query_until we would of course set up the file outside the loop. I suspect the cost would in fact be negligible. Note, too that the psql and safe_psql methods also pass the query via stdin. > > Another angle is that the SIGPIPE complaints aren't necessarily > a bad thing: if psql doesn't read what we send, it's good to > know about that. IMO the real problem is that the errors are > so darn nonrepeatable. I wonder if there is a way to make them > more reproducible? > > I don't know. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > Note, too that the psql and safe_psql methods also pass the query via stdin. Yeah. We need all of these to act the same, IMO. Recall that the previous patches that introduced the undef hack were changing callers of those routines, not poll_query_until. regards, tom lane