Thread: Gerbil build farm failure

Gerbil build farm failure

From
Bruce Momjian
Date:
I worked with Jim Nasby and we found this is the line that is failing on
Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X
if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)

This particular line was last modified in 2002.  However, this was a
file that was changed as part of the VACUUM tuple chain commit:
revision 1.81.4.2date: 2005/08/25 19:45:01;  author: tgl;  state: Exp;  lines: +7 -4Back-patch fixes for problems with
VACUUMdestroying t_ctid chains too soon,and with insufficient paranoia in code that follows t_ctid links.This patch
coversthe 8.0 branch.
 

and the date of the commit to 8.0.X corresponds to the date that
failures started to happen:
http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Gerbil build farm failure

From
"Jim C. Nasby"
Date:
On Tue, Sep 20, 2005 at 01:17:10PM -0400, Bruce Momjian wrote:
> I worked with Jim Nasby and we found this is the line that is failing on
> Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X
> 
>     if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
> 
> This particular line was last modified in 2002.  However, this was a
> file that was changed as part of the VACUUM tuple chain commit:
> 
>     revision 1.81.4.2
>     date: 2005/08/25 19:45:01;  author: tgl;  state: Exp;  lines: +7 -4
>     Back-patch fixes for problems with VACUUM destroying t_ctid chains too soon,
>     and with insufficient paranoia in code that follows t_ctid links.
>     This patch covers the 8.0 branch.
> 
> and the date of the commit to 8.0.X corresponds to the date that
> failures started to happen:
> 
>     http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE

BTW, I want to point out for others that when initdb dumps core trying
to get a stack trace out of the initdb binary will probably be useless,
because initdb is just calling other binaries. In this case we had
sucess with the postgres binary. Had I know this I would have had this
stack trace available a couple weeks ago. :(

http://lnk.nu/developer.postgresql.org/3zx.c is the annotated version of
tqual. As Bruce mentioned, the line referenced in the core file probably
isn't the culprit. http://lnk.nu/pgbuildfarm.org/3zz.pl has the list of
files that changed to break gerbil.

Here's the output from gdb:
#0  HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844
844     tqual.c: No such file or directory.       in tqual.c
(gdb) bt
#0  HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844
#1  0x0004bdd0 in heap_update ()
#2  0x000ec4b0 in ExecutorRun (queryDesc=0x0, direction=-4198192, count=16) at execMain.c:1592
(gdb)

I'm in the process of trying to get this machine moved someplace where I
could give a developer ssh access. That should hopefully happen by the
end of the week.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Gerbil build farm failure

From
Bruce Momjian
Date:
Now that we have backtrace, does anyone have a clue about the cause/fix?

---------------------------------------------------------------------------

Jim C. Nasby wrote:
> On Tue, Sep 20, 2005 at 01:17:10PM -0400, Bruce Momjian wrote:
> > I worked with Jim Nasby and we found this is the line that is failing on
> > Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X
> > 
> >     if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
> > 
> > This particular line was last modified in 2002.  However, this was a
> > file that was changed as part of the VACUUM tuple chain commit:
> > 
> >     revision 1.81.4.2
> >     date: 2005/08/25 19:45:01;  author: tgl;  state: Exp;  lines: +7 -4
> >     Back-patch fixes for problems with VACUUM destroying t_ctid chains too soon,
> >     and with insufficient paranoia in code that follows t_ctid links.
> >     This patch covers the 8.0 branch.
> > 
> > and the date of the commit to 8.0.X corresponds to the date that
> > failures started to happen:
> > 
> >     http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE
> 
> BTW, I want to point out for others that when initdb dumps core trying
> to get a stack trace out of the initdb binary will probably be useless,
> because initdb is just calling other binaries. In this case we had
> sucess with the postgres binary. Had I know this I would have had this
> stack trace available a couple weeks ago. :(
> 
> http://lnk.nu/developer.postgresql.org/3zx.c is the annotated version of
> tqual. As Bruce mentioned, the line referenced in the core file probably
> isn't the culprit. http://lnk.nu/pgbuildfarm.org/3zz.pl has the list of
> files that changed to break gerbil.
> 
> Here's the output from gdb:
> #0  HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844
> 844     tqual.c: No such file or directory.
>         in tqual.c
> (gdb) bt
> #0  HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844
> #1  0x0004bdd0 in heap_update ()
> #2  0x000ec4b0 in ExecutorRun (queryDesc=0x0, direction=-4198192, count=16) at execMain.c:1592
> (gdb)
> 
> I'm in the process of trying to get this machine moved someplace where I
> could give a developer ssh access. That should hopefully happen by the
> end of the week.
> -- 
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Gerbil build farm failure

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Now that we have backtrace, does anyone have a clue about the cause/fix?

The backtrace suggests a garbage snapshot value, but doesn't provide
nearly enough info to guess where it's coming from.  I'm waiting for the
promised ssh access...
        regards, tom lane


Re: Gerbil build farm failure

From
"Jim C. Nasby"
Date:
On Fri, Sep 23, 2005 at 12:56:33AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Fire lit under IT dept. Their initial plan was everything outbound but
> > SSH would be cut-off, which I nixed, but would that suffice in the short
> > term if it means getting the box on the net faster?
> 
> AFAICS, an ssh connection to an unprivileged account should be enough.
> I just need to be able to duplicate your build environment.

Ok, if that greases the wheels I'll have them do that. Hopefully they
can get it done tomorrow.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Gerbil build farm failure

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Fire lit under IT dept. Their initial plan was everything outbound but
> SSH would be cut-off, which I nixed, but would that suffice in the short
> term if it means getting the box on the net faster?

AFAICS, an ssh connection to an unprivileged account should be enough.
I just need to be able to duplicate your build environment.
        regards, tom lane


Re: Gerbil build farm failure

From
"Jim C. Nasby"
Date:
On Thu, Sep 22, 2005 at 08:03:43PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Now that we have backtrace, does anyone have a clue about the cause/fix?
> 
> The backtrace suggests a garbage snapshot value, but doesn't provide
> nearly enough info to guess where it's coming from.  I'm waiting for the
> promised ssh access...

Fire lit under IT dept. Their initial plan was everything outbound but
SSH would be cut-off, which I nixed, but would that suffice in the short
term if it means getting the box on the net faster?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Gerbil build farm failure

From
Michael Fuhr
Date:
Gerbil's looking better lately:

http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE

-- 
Michael Fuhr


Re: Gerbil build farm failure

From
Tom Lane
Date:
Michael Fuhr <mike@fuhr.org> writes:
> Gerbil's looking better lately:
> http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE

Yeah.  We've been poking at it off-list, and it seems that the problem
was a local build failure due to not having a clean copy of the
repository (ye olde junk-in-the-supposedly-clean-vpath-tree problem).
        regards, tom lane


Re: Gerbil build farm failure

From
"Jim C. Nasby"
Date:
On Mon, Sep 26, 2005 at 06:58:16PM -0400, Tom Lane wrote:
> Michael Fuhr <mike@fuhr.org> writes:
> > Gerbil's looking better lately:
> > http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE
> 
> Yeah.  We've been poking at it off-list, and it seems that the problem
> was a local build failure due to not having a clean copy of the
> repository (ye olde junk-in-the-supposedly-clean-vpath-tree problem).

Well, just to be clear, I first logged into that box after the problem
started. It's possible that someone else had mucked with the install,
but unlikely. I suspect that there was a real build issue of some kind
to start with. Since it's working now I guess it doesn't matter, but I'd
still suspect code from back when the problem started.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461