Thread: Intermittent stats test failures on buildfarm
I just spent a tedious hour digging through the buildfarm results to see what I could learn about the intermittent failures we're seeing in the stats regression test, such as here: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&dt=2005-05-29%2018:25:09 This is seen in both Check and InstallCheck steps. A variant pathology is seen here: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=2005-07-22%2007:58:01 Notice that only the heap stats columns are wrong in this case, not the index stats. I think that this variant behavior may have been fixed by this patch: 2005-07-23 20:33 tgl * src/backend/postmaster/pgstat.c: Fix some failures to initializetable entries induced by recent autovacuum integration. Not clearthis explains recent stats problems, but it's definitely wrong. but it's not certain since nobody traced through the code to exhibit why those uninitialized table entries would have led to this particular visible symptom. But with no occurrences of that behavior since the patch went in, I suspect it's fixed. What we are left with turns out to be multiple occurrences of the first pathology on exactly three buildfarm members: ferret Cygwinkudu Solaris 9, x86dragonfly Solaris 9, x86 There are no occurrences of the failure on the native-Windows machines, nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC) (though gerbil has one old occurrence of the second pathology, so maybe that observation should be taken with a grain of salt). And none whatever on any other buildfarm member. The same three machines are showing the failure in the 8.0 branch, too, so it's not a recently-introduced issue. And one thing more: kudu and dragonfly are actually the same machine, same OS, different compilers. So what to make of this? Dunno, but it is clearly a very platform-specific behavior. Anyone see a connection between Cygwin and Solaris? regards, tom lane
On Tue, 30 Aug 2005, Tom Lane wrote: > What we are left with turns out to be multiple occurrences of the first > pathology on exactly three buildfarm members: > > ferret Cygwin > kudu Solaris 9, x86 > dragonfly Solaris 9, x86 > > So what to make of this? Dunno, but it is clearly a very > platform-specific behavior. Anyone see a connection between Cygwin > and Solaris? > One thing to note about kudu and dragonfly is that they are running under vmware. This, combined with cygwin's reputation, makes me suspect that the connection is that they are both struggling under load. Although canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such failures. I'm also in the process of moving, so I put this machine in a box last night and it won't be up and running for a week or two. I do have very similar copies of the OS image running on other machines if you'd like me to test something specific. Kris Jurka
Kris Jurka <books@ejurka.com> writes: > On Tue, 30 Aug 2005, Tom Lane wrote: >> What we are left with turns out to be multiple occurrences of the first >> pathology on exactly three buildfarm members: >> >> ferret Cygwin >> kudu Solaris 9, x86 >> dragonfly Solaris 9, x86 >> >> So what to make of this? Dunno, but it is clearly a very >> platform-specific behavior. Anyone see a connection between Cygwin >> and Solaris? > One thing to note about kudu and dragonfly is that they are running under > vmware. This, combined with cygwin's reputation, makes me suspect that > the connection is that they are both struggling under load. Although > canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such > failures. Hmm. One pretty obvious explanation of the failure is simply that the machine is so loaded that the stats collector doesn't get to run for a few seconds. I had dismissed this idea because I figured the buildfarm machine owners would schedule the tests to run at relatively low-load times of day ... but maybe that's not true on these two machines? We could try increasing the delay in the stats test, say from two seconds to five. If it is just a matter of load, that should result in a very large drop in the frequency of the failure. regards, tom lane
Also, kookaburra (AIX) has a problem with the stats test as well. What is most puzzling to me is that it only happens with cc (not gcc). And I can only get it to happen when running a cronjob for the buildfarm. If I run it interactively, the stats collector will run fine, or if I run the build script from the command line. The environment between cron and from command line are not significantly different, so I am at a bit of loss as to the reason why. Any thoughts? -rocco > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Sent: Tuesday, August 30, 2005 12:31 AM > To: pgsql-hackers@postgreSQL.org > Subject: [HACKERS] Intermittent stats test failures on buildfarm > > > I just spent a tedious hour digging through the buildfarm results > to see what I could learn about the intermittent failures we're seeing > in the stats regression test, such as here: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&dt=20 > 05-05-29%2018:25:09 > This is seen in both Check and InstallCheck steps. A variant > pathology > is seen here: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=20 > 05-07-22%2007:58:01 > Notice that only the heap stats columns are wrong in this > case, not the > index stats. I think that this variant behavior may have > been fixed by > this patch: > > 2005-07-23 20:33 tgl > > * src/backend/postmaster/pgstat.c: Fix some failures to > initialize > table entries induced by recent autovacuum integration. > Not clear > this explains recent stats problems, but it's definitely wrong. > > but it's not certain since nobody traced through the code to exhibit > why those uninitialized table entries would have led to this > particular > visible symptom. But with no occurrences of that behavior since the > patch went in, I suspect it's fixed. > > What we are left with turns out to be multiple occurrences of > the first > pathology on exactly three buildfarm members: > > ferret Cygwin > kudu Solaris 9, x86 > dragonfly Solaris 9, x86 > > There are no occurrences of the failure on the native-Windows > machines, > nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC) > (though gerbil has one old occurrence of the second > pathology, so maybe > that observation should be taken with a grain of salt). And none > whatever on any other buildfarm member. > > The same three machines are showing the failure in the 8.0 > branch, too, > so it's not a recently-introduced issue. > > And one thing more: kudu and dragonfly are actually the same machine, > same OS, different compilers. > > So what to make of this? Dunno, but it is clearly a very > platform-specific behavior. Anyone see a connection between Cygwin > and Solaris? > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > http://www.postgresql.org/docs/faq
"Rocco Altier" <RoccoA@Routescape.com> writes: > Also, kookaburra (AIX) has a problem with the stats test as well. kookaburra's problem is entirely different, not intermittent in the least. The error diff shows that stats collection is off, and its postmaster log says LOG: could not bind socket for statistics collector: Permission denied LOG: disabling statistics collector for lack of working socket I have no idea what's causing that --- the only reason I know of for EACCES from bind() is trying to bind to a privileged port number, and one hopes we're not doing that. regards, tom lane
Tom Lane wrote: >"Rocco Altier" <RoccoA@Routescape.com> writes: > > >>Also, kookaburra (AIX) has a problem with the stats test as well. >> >> > >kookaburra's problem is entirely different, not intermittent in the >least. The error diff shows that stats collection is off, and its >postmaster log says > >LOG: could not bind socket for statistics collector: Permission denied >LOG: disabling statistics collector for lack of working socket > >I have no idea what's causing that --- the only reason I know of for >EACCES from bind() is trying to bind to a privileged port number, and >one hopes we're not doing that. > > > > The other things that's rather odd is that it's failing at the installcheck stage, which means it just passed this same test moments before are the check stage. Installcheck failures in buildfarm should always be regarded suspiciously. cheers andrew