Thread: Intermittent stats test failures on buildfarm

Intermittent stats test failures on buildfarm

From
Tom Lane
Date:
I just spent a tedious hour digging through the buildfarm results
to see what I could learn about the intermittent failures we're seeing
in the stats regression test, such as here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&dt=2005-05-29%2018:25:09
This is seen in both Check and InstallCheck steps.  A variant pathology
is seen here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=2005-07-22%2007:58:01
Notice that only the heap stats columns are wrong in this case, not the
index stats.  I think that this variant behavior may have been fixed by
this patch:

2005-07-23 20:33  tgl
* src/backend/postmaster/pgstat.c: Fix some failures to initializetable entries induced by recent autovacuum
integration. Not clearthis explains recent stats problems, but it's definitely wrong.
 

but it's not certain since nobody traced through the code to exhibit
why those uninitialized table entries would have led to this particular
visible symptom.  But with no occurrences of that behavior since the
patch went in, I suspect it's fixed.

What we are left with turns out to be multiple occurrences of the first
pathology on exactly three buildfarm members:
ferret        Cygwinkudu        Solaris 9, x86dragonfly    Solaris 9, x86

There are no occurrences of the failure on the native-Windows machines,
nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC)
(though gerbil has one old occurrence of the second pathology, so maybe
that observation should be taken with a grain of salt).  And none
whatever on any other buildfarm member.

The same three machines are showing the failure in the 8.0 branch, too,
so it's not a recently-introduced issue.

And one thing more: kudu and dragonfly are actually the same machine,
same OS, different compilers.

So what to make of this?  Dunno, but it is clearly a very
platform-specific behavior.  Anyone see a connection between Cygwin
and Solaris?
        regards, tom lane


Re: Intermittent stats test failures on buildfarm

From
Kris Jurka
Date:

On Tue, 30 Aug 2005, Tom Lane wrote:

> What we are left with turns out to be multiple occurrences of the first
> pathology on exactly three buildfarm members:
>
>     ferret        Cygwin
>     kudu        Solaris 9, x86
>     dragonfly    Solaris 9, x86
>
> So what to make of this?  Dunno, but it is clearly a very
> platform-specific behavior.  Anyone see a connection between Cygwin
> and Solaris?
>

One thing to note about kudu and dragonfly is that they are running under 
vmware.  This, combined with cygwin's reputation, makes me suspect that 
the connection is that they are both struggling under load.  Although 
canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such 
failures.

I'm also in the process of moving, so I put this machine in a box last 
night and it won't be up and running for a week or two.  I do have very 
similar copies of the OS image running on other machines if you'd like me 
to test something specific.

Kris Jurka


Re: Intermittent stats test failures on buildfarm

From
Tom Lane
Date:
Kris Jurka <books@ejurka.com> writes:
> On Tue, 30 Aug 2005, Tom Lane wrote:
>> What we are left with turns out to be multiple occurrences of the first
>> pathology on exactly three buildfarm members:
>> 
>> ferret        Cygwin
>> kudu        Solaris 9, x86
>> dragonfly    Solaris 9, x86
>> 
>> So what to make of this?  Dunno, but it is clearly a very
>> platform-specific behavior.  Anyone see a connection between Cygwin
>> and Solaris?

> One thing to note about kudu and dragonfly is that they are running under 
> vmware.  This, combined with cygwin's reputation, makes me suspect that 
> the connection is that they are both struggling under load.  Although 
> canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such 
> failures.

Hmm.  One pretty obvious explanation of the failure is simply that the
machine is so loaded that the stats collector doesn't get to run for a
few seconds.  I had dismissed this idea because I figured the buildfarm
machine owners would schedule the tests to run at relatively low-load
times of day ... but maybe that's not true on these two machines?

We could try increasing the delay in the stats test, say from two
seconds to five.  If it is just a matter of load, that should result
in a very large drop in the frequency of the failure.
        regards, tom lane


Re: Intermittent stats test failures on buildfarm

From
"Rocco Altier"
Date:
Also, kookaburra (AIX) has a problem with the stats test as well.

What is most puzzling to me is that it only happens with cc (not gcc).
And I can only get it to happen when running a cronjob for the
buildfarm.  If I run it interactively, the stats collector will run
fine, or if I run the build script from the command line.

The environment between cron and from command line are not significantly
different, so I am at a bit of loss as to the reason why.

Any thoughts?
-rocco

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Tuesday, August 30, 2005 12:31 AM
> To: pgsql-hackers@postgreSQL.org
> Subject: [HACKERS] Intermittent stats test failures on buildfarm
>
>
> I just spent a tedious hour digging through the buildfarm results
> to see what I could learn about the intermittent failures we're seeing
> in the stats regression test, such as here:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&dt=20
> 05-05-29%2018:25:09
> This is seen in both Check and InstallCheck steps.  A variant
> pathology
> is seen here:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=20
> 05-07-22%2007:58:01
> Notice that only the heap stats columns are wrong in this
> case, not the
> index stats.  I think that this variant behavior may have
> been fixed by
> this patch:
>
> 2005-07-23 20:33  tgl
>
>     * src/backend/postmaster/pgstat.c: Fix some failures to
> initialize
>     table entries induced by recent autovacuum integration.
>  Not clear
>     this explains recent stats problems, but it's definitely wrong.
>
> but it's not certain since nobody traced through the code to exhibit
> why those uninitialized table entries would have led to this
> particular
> visible symptom.  But with no occurrences of that behavior since the
> patch went in, I suspect it's fixed.
>
> What we are left with turns out to be multiple occurrences of
> the first
> pathology on exactly three buildfarm members:
>
>     ferret        Cygwin
>     kudu        Solaris 9, x86
>     dragonfly    Solaris 9, x86
>
> There are no occurrences of the failure on the native-Windows
> machines,
> nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC)
> (though gerbil has one old occurrence of the second
> pathology, so maybe
> that observation should be taken with a grain of salt).  And none
> whatever on any other buildfarm member.
>
> The same three machines are showing the failure in the 8.0
> branch, too,
> so it's not a recently-introduced issue.
>
> And one thing more: kudu and dragonfly are actually the same machine,
> same OS, different compilers.
>
> So what to make of this?  Dunno, but it is clearly a very
> platform-specific behavior.  Anyone see a connection between Cygwin
> and Solaris?
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>               http://www.postgresql.org/docs/faq


Re: Intermittent stats test failures on buildfarm

From
Tom Lane
Date:
"Rocco Altier" <RoccoA@Routescape.com> writes:
> Also, kookaburra (AIX) has a problem with the stats test as well.

kookaburra's problem is entirely different, not intermittent in the
least.  The error diff shows that stats collection is off, and its
postmaster log says

LOG:  could not bind socket for statistics collector: Permission denied
LOG:  disabling statistics collector for lack of working socket

I have no idea what's causing that --- the only reason I know of for
EACCES from bind() is trying to bind to a privileged port number, and
one hopes we're not doing that.
        regards, tom lane


Re: Intermittent stats test failures on buildfarm

From
Andrew Dunstan
Date:

Tom Lane wrote:

>"Rocco Altier" <RoccoA@Routescape.com> writes:
>  
>
>>Also, kookaburra (AIX) has a problem with the stats test as well.
>>    
>>
>
>kookaburra's problem is entirely different, not intermittent in the
>least.  The error diff shows that stats collection is off, and its
>postmaster log says
>
>LOG:  could not bind socket for statistics collector: Permission denied
>LOG:  disabling statistics collector for lack of working socket
>
>I have no idea what's causing that --- the only reason I know of for
>EACCES from bind() is trying to bind to a privileged port number, and
>one hopes we're not doing that.
>
>
>  
>

The other things that's rather odd is that it's failing at the 
installcheck stage, which means it just passed this same test moments 
before are the check stage. Installcheck failures in buildfarm should 
always be regarded suspiciously.

cheers

andrew