On Thu, Jun 18, 2015 at 3:52 PM, Michael Paquier wrote:
> I think that it would be useful as well to improve the buildfarm
> output. Thoughts?
And after running the tests more or less 6~7 times in a row on a PI, I
have been able to trigger the problem and I think that I have found
its origin. First, the error has been triggered by the tests of
pg_rewind:
t/002_databases.pl ...
1..4
Bailout called. Further testing stopped: run pg_ctl failed: 256
Bail out! run pg_ctl failed: 256
FAILED--Further testing stopped: run pg_ctl failed: 256
Makefile:51: recipe for target 'check' failed
make[1]: *** [check] Error 255
And by looking at the logs obtained thanks to the previous patch I
could see the following (log attached for tests 1 and 2):
$ tail -n5 regress_log/regress_log_002_databases
waiting for server to start........ stopped waiting
pg_ctl: could not start server
Examine the log output.
LOG: received immediate shutdown request
LOG: received immediate shutdown request
pg_ctl should be able to start the server and should not fail here.
This is confirmed by the fact that first test has not stopped the
servers. On a clean run, the immediate shutdown request is received
and done:
waiting for server to shut down....LOG: received immediate shutdown request
LOG: unexpected EOF on standby connection
done
But in the case of the failure this does not happen:
LOG: received immediate shutdown request
LOG: unexpected EOF on standby connection
LOG: received immediate shutdown request
See the "done" is not here.
Now if we look at RewindTest.pm, there is the following code:
if ($test_master_datadir)
{
system
"pg_ctl -D $test_master_datadir -s -m immediate stop
2> /dev/null";
}
if ($test_standby_datadir)
{
system
"pg_ctl -D $test_standby_datadir -s -m immediate
stop 2> /dev/null";
}
And I think that the problem is triggered because we are missing a -w
switch here, meaning that we do not wait until the confirmation that
the server has stopped, and visibly if stop is slow enough the next
server to use cannot start because the port is already taken by the
server currently stopping.
Note as well that the last command of pg_ctl stop in
pg_ctl/t/002_status.pl does not use -w, so we have the same problem
there.
Attached is a patch fixing those problems and improving the log
facility as it really helped me out with those issues. The simplest
fix would be to include the -w switch missing in the tests of
pg_rewind and pg_ctl though.
It would be good to get that fixed, then I would be able to re-enable
the TAP tests of hamster. I have run the tests a dozen of times again
with this patch, and I could not trigger the failure anymore.
Regards,
--
Michael