Thread: Gerbil build farm failure
I worked with Jim Nasby and we found this is the line that is failing on Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid) This particular line was last modified in 2002. However, this was a file that was changed as part of the VACUUM tuple chain commit: revision 1.81.4.2date: 2005/08/25 19:45:01; author: tgl; state: Exp; lines: +7 -4Back-patch fixes for problems with VACUUMdestroying t_ctid chains too soon,and with insufficient paranoia in code that follows t_ctid links.This patch coversthe 8.0 branch. and the date of the commit to 8.0.X corresponds to the date that failures started to happen: http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, Sep 20, 2005 at 01:17:10PM -0400, Bruce Momjian wrote: > I worked with Jim Nasby and we found this is the line that is failing on > Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X > > if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid) > > This particular line was last modified in 2002. However, this was a > file that was changed as part of the VACUUM tuple chain commit: > > revision 1.81.4.2 > date: 2005/08/25 19:45:01; author: tgl; state: Exp; lines: +7 -4 > Back-patch fixes for problems with VACUUM destroying t_ctid chains too soon, > and with insufficient paranoia in code that follows t_ctid links. > This patch covers the 8.0 branch. > > and the date of the commit to 8.0.X corresponds to the date that > failures started to happen: > > http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE BTW, I want to point out for others that when initdb dumps core trying to get a stack trace out of the initdb binary will probably be useless, because initdb is just calling other binaries. In this case we had sucess with the postgres binary. Had I know this I would have had this stack trace available a couple weeks ago. :( http://lnk.nu/developer.postgresql.org/3zx.c is the annotated version of tqual. As Bruce mentioned, the line referenced in the core file probably isn't the culprit. http://lnk.nu/pgbuildfarm.org/3zz.pl has the list of files that changed to break gerbil. Here's the output from gdb: #0 HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844 844 tqual.c: No such file or directory. in tqual.c (gdb) bt #0 HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844 #1 0x0004bdd0 in heap_update () #2 0x000ec4b0 in ExecutorRun (queryDesc=0x0, direction=-4198192, count=16) at execMain.c:1592 (gdb) I'm in the process of trying to get this machine moved someplace where I could give a developer ssh access. That should hopefully happen by the end of the week. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Now that we have backtrace, does anyone have a clue about the cause/fix? --------------------------------------------------------------------------- Jim C. Nasby wrote: > On Tue, Sep 20, 2005 at 01:17:10PM -0400, Bruce Momjian wrote: > > I worked with Jim Nasby and we found this is the line that is failing on > > Gerbil in the build farm during initdb: tqual.c, line 844 in 8.0.X > > > > if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid) > > > > This particular line was last modified in 2002. However, this was a > > file that was changed as part of the VACUUM tuple chain commit: > > > > revision 1.81.4.2 > > date: 2005/08/25 19:45:01; author: tgl; state: Exp; lines: +7 -4 > > Back-patch fixes for problems with VACUUM destroying t_ctid chains too soon, > > and with insufficient paranoia in code that follows t_ctid links. > > This patch covers the 8.0 branch. > > > > and the date of the commit to 8.0.X corresponds to the date that > > failures started to happen: > > > > http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE > > BTW, I want to point out for others that when initdb dumps core trying > to get a stack trace out of the initdb binary will probably be useless, > because initdb is just calling other binaries. In this case we had > sucess with the postgres binary. Had I know this I would have had this > stack trace available a couple weeks ago. :( > > http://lnk.nu/developer.postgresql.org/3zx.c is the annotated version of > tqual. As Bruce mentioned, the line referenced in the core file probably > isn't the culprit. http://lnk.nu/pgbuildfarm.org/3zz.pl has the list of > files that changed to break gerbil. > > Here's the output from gdb: > #0 HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844 > 844 tqual.c: No such file or directory. > in tqual.c > (gdb) bt > #0 HeapTupleSatisfiesSnapshot (tuple=0xfe28fc78, snapshot=0xd7, buffer=295) at tqual.c:844 > #1 0x0004bdd0 in heap_update () > #2 0x000ec4b0 in ExecutorRun (queryDesc=0x0, direction=-4198192, count=16) at execMain.c:1592 > (gdb) > > I'm in the process of trying to get this machine moved someplace where I > could give a developer ssh access. That should hopefully happen by the > end of the week. > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Now that we have backtrace, does anyone have a clue about the cause/fix? The backtrace suggests a garbage snapshot value, but doesn't provide nearly enough info to guess where it's coming from. I'm waiting for the promised ssh access... regards, tom lane
On Fri, Sep 23, 2005 at 12:56:33AM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Fire lit under IT dept. Their initial plan was everything outbound but > > SSH would be cut-off, which I nixed, but would that suffice in the short > > term if it means getting the box on the net faster? > > AFAICS, an ssh connection to an unprivileged account should be enough. > I just need to be able to duplicate your build environment. Ok, if that greases the wheels I'll have them do that. Hopefully they can get it done tomorrow. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Fire lit under IT dept. Their initial plan was everything outbound but > SSH would be cut-off, which I nixed, but would that suffice in the short > term if it means getting the box on the net faster? AFAICS, an ssh connection to an unprivileged account should be enough. I just need to be able to duplicate your build environment. regards, tom lane
On Thu, Sep 22, 2005 at 08:03:43PM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Now that we have backtrace, does anyone have a clue about the cause/fix? > > The backtrace suggests a garbage snapshot value, but doesn't provide > nearly enough info to guess where it's coming from. I'm waiting for the > promised ssh access... Fire lit under IT dept. Their initial plan was everything outbound but SSH would be cut-off, which I nixed, but would that suffice in the short term if it means getting the box on the net faster? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Gerbil's looking better lately: http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > Gerbil's looking better lately: > http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE Yeah. We've been poking at it off-list, and it seems that the problem was a local build failure due to not having a clean copy of the repository (ye olde junk-in-the-supposedly-clean-vpath-tree problem). regards, tom lane
On Mon, Sep 26, 2005 at 06:58:16PM -0400, Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > Gerbil's looking better lately: > > http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=gerbil&br=REL8_0_STABLE > > Yeah. We've been poking at it off-list, and it seems that the problem > was a local build failure due to not having a clean copy of the > repository (ye olde junk-in-the-supposedly-clean-vpath-tree problem). Well, just to be clear, I first logged into that box after the problem started. It's possible that someone else had mucked with the install, but unlikely. I suspect that there was a real build issue of some kind to start with. Since it's working now I guess it doesn't matter, but I'd still suspect code from back when the problem started. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461