Thread: The real reason why TAP testing isn't ready for prime time

The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

14 June 2015, 18:37:16

Buildfarm member hamster has failed a pretty significant fraction of
its recent runs in the BinInstallCheck step:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD

Since other critters aren't equally distressed, it seems likely that
this is just an out-of-disk-space type of problem.  But maybe it's
trying to tell us that there's a genuine platform-specific bug there.
In any case, I challenge anyone to figure out what's happening from
the information available from the buildfarm logs.

I don't know whether this is just that the buildfarm script isn't
collecting data that it should be.  But my experiences with the
TAP test scripts haven't been very positive.  When they fail, it
takes a lot of digging to find out why.  Basically, that entire
mechanism sucks as far as debuggability is concerned.

I think there is a good argument for turning this off in the buildfarm
until there is a better way of identifying and solving problems.  It is
not helping us that hamster is red half the time for undiscoverable
reasons.  That just conditions people to ignore it, and it may well be
masking real problems that the machine could be finding if it weren't
failing at this step.
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

14 June 2015, 23:26:31

On Mon, Jun 15, 2015 at 3:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Buildfarm member hamster has failed a pretty significant fraction of
> its recent runs in the BinInstallCheck step:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD
>
> Since other critters aren't equally distressed, it seems likely that
> this is just an out-of-disk-space type of problem.  But maybe it's
> trying to tell us that there's a genuine platform-specific bug there.
> In any case, I challenge anyone to figure out what's happening from
> the information available from the buildfarm logs.
>
> I don't know whether this is just that the buildfarm script isn't
> collecting data that it should be.  But my experiences with the
> TAP test scripts haven't been very positive.  When they fail, it
> takes a lot of digging to find out why.  Basically, that entire
> mechanism sucks as far as debuggability is concerned.

Indeed. I think that one step in the good direction would be to
replace all the calls to system and system_or_bail with a wrapper
routine that calls IPC::Run able to catch the logs and store those
logs in each test's base path. The same applies to pg_rewind tests.

> I think there is a good argument for turning this off in the buildfarm
> until there is a better way of identifying and solving problems.  It is
> not helping us that hamster is red half the time for undiscoverable
> reasons.  That just conditions people to ignore it, and it may well be
> masking real problems that the machine could be finding if it weren't
> failing at this step.

hamster is legendary slow and has a slow disc, hence it improves
chances of catching race conditions, and it is the only slow buildfarm
machine enabling the TAP tests (by comparison dangomushi has never
failed with the TAP tests) hence I would prefer thinking that the
problem is not specific to ArchLinux ARM. In this case the failure
seems to be related to the timing test servers stop and start even if
-w switch is used with pg_ctl, particularly that PGPORT is set to the
same value for all servers... Still, for the time being I don't mind
disabling them and just did so now. I will try to investigate further
on the machine itself.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

18 June 2015, 06:52:35

On Mon, Jun 15, 2015 at 8:26 AM, Michael Paquier wrote:
> hamster is legendary slow and has a slow disc, hence it improves
> chances of catching race conditions, and it is the only slow buildfarm
> machine enabling the TAP tests (by comparison dangomushi has never
> failed with the TAP tests) hence I would prefer thinking that the
> problem is not specific to ArchLinux ARM. In this case the failure
> seems to be related to the timing test servers stop and start even if
> -w switch is used with pg_ctl, particularly that PGPORT is set to the
> same value for all servers... Still, for the time being I don't mind
> disabling them and just did so now. I will try to investigate further
> on the machine itself.

So, I have been doing some progress on this thing.. And wrote the
patch attached that improves the way TAP tests log their output using
IPC::Run::run. This simply creates a log file as regress_log/test_name
for each test run, capturing all the command outputs. For example on
current HEAD, some commands of pg_ctl are simply silenced or have
their output redirected to /dev/null. With this patch, all their
output is redirected to the log file. This is as well compatible with
Windows, and logging of pg_rewind tests is unified with the new
structure of TestLib.pm. I am going to apply that to hamster and see
what is causing the error reported.

I think that it would be useful as well to improve the buildfarm
output. Thoughts?
Regards
--
Michael

Attachment

20160618_tap_improve_logs.patch

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

19 June 2015, 10:24:12

On Thu, Jun 18, 2015 at 3:52 PM, Michael Paquier wrote:
> I think that it would be useful as well to improve the buildfarm
> output. Thoughts?

And after running the tests more or less 6~7 times in a row on a PI, I
have been able to trigger the problem and I think that I have found
its origin. First, the error has been triggered by the tests of
pg_rewind:
t/002_databases.pl ...
1..4
Bailout called.  Further testing stopped:  run pg_ctl failed: 256
Bail out!  run pg_ctl failed: 256
FAILED--Further testing stopped: run pg_ctl failed: 256
Makefile:51: recipe for target 'check' failed
make[1]: *** [check] Error 255

And by looking at the logs obtained thanks to the previous patch I
could see the following (log attached for tests 1 and 2):
$ tail -n5 regress_log/regress_log_002_databases
waiting for server to start........ stopped waiting
pg_ctl: could not start server
Examine the log output.
LOG:  received immediate shutdown request
LOG:  received immediate shutdown request

pg_ctl should be able to start the server and should not fail here.
This is confirmed by the fact that first test has not stopped the
servers. On a clean run, the immediate shutdown request is received
and done:
waiting for server to shut down....LOG:  received immediate shutdown request
LOG:  unexpected EOF on standby connection
 done

But in the case of the failure this does not happen:
LOG:  received immediate shutdown request
LOG:  unexpected EOF on standby connection
LOG:  received immediate shutdown request
See the "done" is not here.

Now if we look at RewindTest.pm, there is the following code:
        if ($test_master_datadir)
        {
                system
                  "pg_ctl -D $test_master_datadir -s -m immediate stop
2> /dev/null";
        }
        if ($test_standby_datadir)
        {
                system
                  "pg_ctl -D $test_standby_datadir -s -m immediate
stop 2> /dev/null";
        }
And I think that the problem is triggered because we are missing a -w
switch here, meaning that we do not wait until the confirmation that
the server has stopped, and visibly if stop is slow enough the next
server to use cannot start because the port is already taken by the
server currently stopping.

Note as well that the last command of pg_ctl stop in
pg_ctl/t/002_status.pl does not use -w, so we have the same problem
there.

Attached is a patch fixing those problems and improving the log
facility as it really helped me out with those issues. The simplest
fix would be to include the -w switch missing in the tests of
pg_rewind and pg_ctl though.

It would be good to get that fixed, then I would be able to re-enable
the TAP tests of hamster. I have run the tests a dozen of times again
with this patch, and I could not trigger the failure anymore.
Regards,
--
Michael

Attachment

Re: The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

19 June 2015, 14:45:33

Michael Paquier <michael.paquier@gmail.com> writes:
> Attached is a patch fixing those problems and improving the log
> facility as it really helped me out with those issues. The simplest
> fix would be to include the -w switch missing in the tests of
> pg_rewind and pg_ctl though.

I agree with adding the -w switch and will go do that.  As far as the
rest of this patch goes, it seems like it could be made less invasive
if the logs got dumped into a subdirectory of tmp_check rather than
adding another top-level directory that has to be cleaned?
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

19 June 2015, 15:07:53

Michael Paquier <michael.paquier@gmail.com> writes:
> Now if we look at RewindTest.pm, there is the following code:
>         if ($test_master_datadir)
>         {
>                 system
>                   "pg_ctl -D $test_master_datadir -s -m immediate stop
> 2> /dev/null";
>         }
>         if ($test_standby_datadir)
>         {
>                 system
>                   "pg_ctl -D $test_standby_datadir -s -m immediate
> stop 2> /dev/null";
>         }
> And I think that the problem is triggered because we are missing a -w
> switch here, meaning that we do not wait until the confirmation that
> the server has stopped, and visibly if stop is slow enough the next
> server to use cannot start because the port is already taken by the
> server currently stopping.

After I woke up a bit more, I remembered that -w is already the default
for "pg_ctl stop", so your diagnosis here is incorrect.

I suspect that the real problem is the arbitrary decision to use -m
immediate.  The postmaster would ordinarily wait for its children to
die, but on a slow machine we could perhaps reach the end of that
5-second timeout, whereupon the postmaster would SIGKILL its children
*and exit immediately*.  I'm not sure how instantaneous SIGKILL is,
but it seems possible that we could end up trying to start the new
postmaster before all the children of the old one are dead.  If the
shmem interlock is working properly that ought to fail.

I wonder whether it's such a good idea for the postmaster to give
up waiting before all children are gone (postmaster.c:1722 in HEAD).
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Robert Haas

Date:

19 June 2015, 15:16:24

On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wonder whether it's such a good idea for the postmaster to give
> up waiting before all children are gone (postmaster.c:1722 in HEAD).

I doubt it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: The real reason why TAP testing isn't ready for prime time

From

Andres Freund

Date:

19 June 2015, 15:20:34

On 2015-06-19 11:16:18 -0400, Robert Haas wrote:
> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wonder whether it's such a good idea for the postmaster to give
> > up waiting before all children are gone (postmaster.c:1722 in HEAD).
> 
> I doubt it.

Seconded. It's pretty bad to possibly not be able to start again after
stopping it. I don't see the benefit in not waiting - sure, the
poastmaster exits faster, but postgres hasn't shut down at that point...

Re: The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

19 June 2015, 15:45:05

Andres Freund <andres@anarazel.de> writes:
> On 2015-06-19 11:16:18 -0400, Robert Haas wrote:
>> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder whether it's such a good idea for the postmaster to give
>>> up waiting before all children are gone (postmaster.c:1722 in HEAD).

>> I doubt it.

> Seconded. It's pretty bad to possibly not be able to start again after
> stopping it. I don't see the benefit in not waiting - sure, the
> poastmaster exits faster, but postgres hasn't shut down at that point...

Yeah.  I'll see about fixing this.  Hard to be sure if it will fix
hamster's issue though.
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Alvaro Herrera

Date:

19 June 2015, 16:56:36

Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-06-19 11:16:18 -0400, Robert Haas wrote:
> >> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> I wonder whether it's such a good idea for the postmaster to give
> >>> up waiting before all children are gone (postmaster.c:1722 in HEAD).
> 
> >> I doubt it.
> 
> > Seconded. It's pretty bad to possibly not be able to start again after
> > stopping it. I don't see the benefit in not waiting - sure, the
> > poastmaster exits faster, but postgres hasn't shut down at that point...
> 
> Yeah.  I'll see about fixing this.  Hard to be sure if it will fix
> hamster's issue though.

We discussed this when that patch got in (82233ce7ea42d6b).  The reason
for not waiting, it was argued, is that the most likely reason for those
processes not to have already gone away by the time we send SIGKILL was
that they are stuck somewhere in the kernel, and so we might not be able
to actually get them to go away with the SIGKILL.  As I recall, that was
the actual problem that MauMau was trying to get fixed.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: The real reason why TAP testing isn't ready for prime time

From

Andres Freund

Date:

19 June 2015, 17:02:32

On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:
> We discussed this when that patch got in (82233ce7ea42d6b).  The reason
> for not waiting, it was argued, is that the most likely reason for those
> processes not to have already gone away by the time we send SIGKILL was
> that they are stuck somewhere in the kernel, and so we might not be able
> to actually get them to go away with the SIGKILL.  As I recall, that was
> the actual problem that MauMau was trying to get fixed.

How does exiting before they're dead help? They're still going to be
attached to shared memeory and thus prevent a restart. I don't think
hiding the problem by exiting the postmaster helps at all.

I don't think it's our job to protect against processes stuck in the
kernel. If that happens something seriously has gone wrong, and we
shouldn't just continue without making that visible.

Re: The real reason why TAP testing isn't ready for prime time

From

Alvaro Herrera

Date:

19 June 2015, 17:16:45

Alvaro Herrera wrote:

> We discussed this when that patch got in (82233ce7ea42d6b).  The reason
> for not waiting, it was argued, is that the most likely reason for those
> processes not to have already gone away by the time we send SIGKILL was
> that they are stuck somewhere in the kernel, and so we might not be able
> to actually get them to go away with the SIGKILL.  As I recall, that was
> the actual problem that MauMau was trying to get fixed.

Hm, I reviewed the discussion and actually I'm wrong: I came up with
the unkillable process argument on my own.  MauMau's problem was a
deadlocked process because of trying to malloc() on its way out.
I still think that unkillable processes are an issue, but perhaps it's
sane to have postmaster wait for some time instead of just going away
immediately.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

19 June 2015, 18:27:18

Andres Freund <andres@anarazel.de> writes:
> On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:
>> We discussed this when that patch got in (82233ce7ea42d6b).  The reason
>> for not waiting, it was argued, is that the most likely reason for those
>> processes not to have already gone away by the time we send SIGKILL was
>> that they are stuck somewhere in the kernel, and so we might not be able
>> to actually get them to go away with the SIGKILL.  As I recall, that was
>> the actual problem that MauMau was trying to get fixed.

> How does exiting before they're dead help? They're still going to be
> attached to shared memeory and thus prevent a restart. I don't think
> hiding the problem by exiting the postmaster helps at all.

My thought exactly.
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Alvaro Herrera

Date:

19 June 2015, 19:19:36

Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote:
> >> We discussed this when that patch got in (82233ce7ea42d6b).  The reason
> >> for not waiting, it was argued, is that the most likely reason for those
> >> processes not to have already gone away by the time we send SIGKILL was
> >> that they are stuck somewhere in the kernel, and so we might not be able
> >> to actually get them to go away with the SIGKILL.  As I recall, that was
> >> the actual problem that MauMau was trying to get fixed.
> 
> > How does exiting before they're dead help? They're still going to be
> > attached to shared memeory and thus prevent a restart. I don't think
> > hiding the problem by exiting the postmaster helps at all.
> 
> My thought exactly.

I guess you have a point there.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

19 June 2015, 21:52:48

On Sat, Jun 20, 2015 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Now if we look at RewindTest.pm, there is the following code:
>>         if ($test_master_datadir)
>>         {
>>                 system
>>                   "pg_ctl -D $test_master_datadir -s -m immediate stop
>> 2> /dev/null";
>>         }
>>         if ($test_standby_datadir)
>>         {
>>                 system
>>                   "pg_ctl -D $test_standby_datadir -s -m immediate
>> stop 2> /dev/null";
>>         }
>> And I think that the problem is triggered because we are missing a -w
>> switch here, meaning that we do not wait until the confirmation that
>> the server has stopped, and visibly if stop is slow enough the next
>> server to use cannot start because the port is already taken by the
>> server currently stopping.
>
> After I woke up a bit more, I remembered that -w is already the default
> for "pg_ctl stop", so your diagnosis here is incorrect.

Ah right. I forgot that. Perhaps I got just lucky in my runs.

> I suspect that the real problem is the arbitrary decision to use -m
> immediate.  The postmaster would ordinarily wait for its children to
> die, but on a slow machine we could perhaps reach the end of that
> 5-second timeout, whereupon the postmaster would SIGKILL its children
> *and exit immediately*.  I'm not sure how instantaneous SIGKILL is,
> but it seems possible that we could end up trying to start the new
> postmaster before all the children of the old one are dead.  If the
> shmem interlock is working properly that ought to fail.
>
> I wonder whether it's such a good idea for the postmaster to give
> up waiting before all children are gone (postmaster.c:1722 in HEAD).

I don't think so as well.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

19 June 2015, 21:53:57

On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2015-06-19 11:16:18 -0400, Robert Haas wrote:
>>> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I wonder whether it's such a good idea for the postmaster to give
>>>> up waiting before all children are gone (postmaster.c:1722 in HEAD).
>
>>> I doubt it.
>
>> Seconded. It's pretty bad to possibly not be able to start again after
>> stopping it. I don't see the benefit in not waiting - sure, the
>> poastmaster exits faster, but postgres hasn't shut down at that point...
>
> Yeah.  I'll see about fixing this.  Hard to be sure if it will fix
> hamster's issue though.

I have re-enabled the TAP tests in hamster. What happens in the next
couple of days will be interesting to see.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

19 June 2015, 22:06:47

On Fri, Jun 19, 2015 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Attached is a patch fixing those problems and improving the log
>> facility as it really helped me out with those issues. The simplest
>> fix would be to include the -w switch missing in the tests of
>> pg_rewind and pg_ctl though.
>
> I agree with adding the -w switch and will go do that.  As far as the
> rest of this patch goes, it seems like it could be made less invasive
> if the logs got dumped into a subdirectory of tmp_check rather than
> adding another top-level directory that has to be cleaned?

tmp_check remains unused since dcae5fac (missing something perhaps?),
but I guess we could use it again to save the logs in it, and have
pg_rewind do the same.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

20 June 2015, 08:48:39

On Sat, Jun 20, 2015 at 7:06 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> As far as the
>> rest of this patch goes, it seems like it could be made less invasive
>> if the logs got dumped into a subdirectory of tmp_check rather than
>> adding another top-level directory that has to be cleaned?
>
> tmp_check remains unused since dcae5fac (missing something perhaps?),
> but I guess we could use it again to save the logs in it, and have
> pg_rewind do the same.

OK, so here is an updated patch. All the logs are saved in
tmp_check/log, and each log file is named with the name of the file
that launched the test to facilitate its identification. I noticed at
the same time that some commands were called in silent mode. I removed
that and now what they output is written as well in the log files.
Regards,
--
Michael

Attachment

20150620_tap_improve_logs_v3.patch

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

20 June 2015, 22:07:06

On Sat, Jun 20, 2015 at 6:53 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2015-06-19 11:16:18 -0400, Robert Haas wrote:
>>>> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> I wonder whether it's such a good idea for the postmaster to give
>>>>> up waiting before all children are gone (postmaster.c:1722 in HEAD).
>>
>>>> I doubt it.
>>
>>> Seconded. It's pretty bad to possibly not be able to start again after
>>> stopping it. I don't see the benefit in not waiting - sure, the
>>> poastmaster exits faster, but postgres hasn't shut down at that point...
>>
>> Yeah.  I'll see about fixing this.  Hard to be sure if it will fix
>> hamster's issue though.
>
> I have re-enabled the TAP tests in hamster. What happens in the next
> couple of days will be interesting to see.he perl

And, we get a failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2015-06-20%2017%3A59%3A01
I am not sure why buildfarm runs makes it more easily reproducible,
one of the reasons may be the perl scripts run underground.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Michael Paquier

Date:

30 July 2015, 02:57:17

On Sun, Jun 21, 2015 at 7:06 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> And, we get a failure:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2015-06-20%2017%3A59%3A01
> I am not sure why buildfarm runs makes it more easily reproducible,
> one of the reasons may be the perl scripts run underground.

hamster has not complained for a couple of weeks now, and the issue
was reproducible every 4~6 days:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD
Hence let's consider the issue as resolved.
-- 
Michael

Re: The real reason why TAP testing isn't ready for prime time

From

Tom Lane

Date:

30 July 2015, 03:18:03

Michael Paquier <michael.paquier@gmail.com> writes:
> hamster has not complained for a couple of weeks now, and the issue
> was reproducible every 4~6 days:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD
> Hence let's consider the issue as resolved.

Nah, I'm afraid not.  We are definitely still seeing a lot of unexplained
pg_ctl failures; as a recent example,

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-07-29%2018%3A00%3A19

Now maybe that's not the same thing as hamster's issue, but there's still
something funny going on.  That's why I've been bitching about the lack of
debug information.  I hope that we can make progress on understanding this
once Andrew pushes out the new buildfarm release (especially since some of
his machines are particularly prone to this type of failure, so I trust
he'll update those critters promptly).
        regards, tom lane

Re: The real reason why TAP testing isn't ready for prime time

From

Andrew Dunstan

Date:

30 July 2015, 13:43:20

On 07/29/2015 11:17 PM, Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> hamster has not complained for a couple of weeks now, and the issue
>> was reproducible every 4~6 days:
>> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD
>> Hence let's consider the issue as resolved.
> Nah, I'm afraid not.  We are definitely still seeing a lot of unexplained
> pg_ctl failures; as a recent example,
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-07-29%2018%3A00%3A19
>
> Now maybe that's not the same thing as hamster's issue, but there's still
> something funny going on.  That's why I've been bitching about the lack of
> debug information.  I hope that we can make progress on understanding this
> once Andrew pushes out the new buildfarm release (especially since some of
> his machines are particularly prone to this type of failure, so I trust
> he'll update those critters promptly).
>
>             



I want to deal with running the tests on MSVC first. Meanwhile I'll drop 
the latest run_build.pl on axolotl.

Anyone who wants to be on the bleeding edge can grab it from 
<https://raw.githubusercontent.com/PGBuildFarm/client-code/master/run_build.pl>

cheers

andrew