Thread: Problems with recent CVS versions and Solaris.
Hi all, I regularly do a "cvs update" and compile and test PostgreSQL. Recently, since about 1 week, I've had a nasty problem. Doing an "initdb" seems to suck up all available memory and almost kills the system, before dying itself with a SEGV. The problem postgress process is:- /usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d template1 The system becomes VERY unresponsive when this postgres process starts running, so difficult to attach to with gdb. I'm stuck for a clue as to how to debug this. Is anyone else seeing this problem recently? Is it just a Solaris problem? (Solaris 2.6 on SPARCstation 5) Is it just me? :-( Help, Keith.
Keith Parks <emkxp01@mtcc.demon.co.uk> writes: > Recently, since about 1 week, I've had a nasty problem. > Doing an "initdb" seems to suck up all available memory and almost > kills the system, before dying itself with a SEGV. Hmm --- no such problem noted here, and I've been doing lots of initdbs... It must be somewhat platform-specific. See if you can get a coredump and backtrace. regards, tom lane
Oops, mailed it to myself instead of the list! It's been a long day... ------------- Begin Forwarded Message ------------- Date: Thu, 1 Jun 2000 23:31:01 +0100 (BST) From: Keith Parks <emkxp01@mtcc.demon.co.uk> Subject: Re: [HACKERS] Problems with recent CVS versions and Solaris. To: emkxp01@mtcc.demon.co.uk MIME-Version: 1.0 I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's excellent example from the archives, also attached. I'm not sure whether the stack frame shown is corrupt, it seems to just loop over and over again. (I got fed up after 400+ frames) The final few frames show us asking for more memory, the point at which things seem to go out of control. #0 0xef5d33b8 in _brk_unlocked () #1 0xef5ce2f8 in _sbrk_unlocked () #2 0xef5ce26c in sbrk () #3 0xef585bb0 in _morecore () #4 0xef58549c in _malloc_unlocked () #5 0xef5852b4 in malloc () #6 0x139198 in AllocSetAlloc (set=0x1bea10, size=4032) at aset.c:285 #7 0x139ea8 in GlobalMemoryAlloc (this=0x1bea08, size=4008) at mcxt.c:419 #8 0x1399ec in MemoryContextAlloc (context=0x1bea08, size=4008) at mcxt.c:224 #9 0x12c700 in InitSysCache (relname=0x180f40 "pg_proc", iname=0x180f08 "pg_proc_oid_index", id=18, nkeys=1, key=0x19a2f0, iScanfuncP=0x6e1c8 <ProcedureOidIndexScan>) at catcache.c:705 #10 0x1312d8 in SearchSysCacheTuple (cacheId=18, key1=184, key2=0, key3=0, key4=0) at syscache.c:509 Is this any help? I'm no expert in gdb, but I can follow instructions. ;-) Thanks, Keith. Keith Parks <emkxp01@mtcc.demon.co.uk> > > Hi all, > > I regularly do a "cvs update" and compile and test PostgreSQL. > > Recently, since about 1 week, I've had a nasty problem. > > Doing an "initdb" seems to suck up all available memory and almost > kills the system, before dying itself with a SEGV. > > The problem postgress process is:- > > /usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d > template1 > > The system becomes VERY unresponsive when this postgres process > starts running, so difficult to attach to with gdb. > > I'm stuck for a clue as to how to debug this. > > Is anyone else seeing this problem recently? > > Is it just a Solaris problem? > (Solaris 2.6 on SPARCstation 5) > > Is it just me? :-( > > Help, > > Keith. > ------------- End Forwarded Message -------------
Keith Parks <emkxp01@mtcc.demon.co.uk> writes: > I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's > excellent example from the archives, also attached. > I'm not sure whether the stack frame shown is corrupt, it seems to just > loop over and over again. (I got fed up after 400+ frames) What we've got here is the syscache trying to set up for a search of cache 18, which I believe is the pg_proc-indexed-on-OID cache. For that it needs the OID comparison function, "oideq" (OID 184). It's asking the funcmgr for oideq ... and funcmgr is turning around and asking the syscache for the pg_proc entry with OID 184. Ooops. I thought there was an interlock in there to report a useful message if a syscache got called recursively like this. Have to look at why it's not working. However, I guess your real problem is that the funcmgr is failing to find proc OID 184 in its own table of built-in functions. The reason this isn't a recursion under normal circumstances is that the comparison functions the syscaches need are all supposed to be hardwired into fmgr. My bet is that there is something snafu'd in your generation of fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of builtin functions is either empty or corrupt. Before wasting any additional time on it I'd recommend a make distclean, cvs update, configure and rebuild from scratch to see if the problem persists. I changed the Gen_fmgrtab.sh setup last week as part of the first round of fmgr checkins, and I wouldn't be surprised to find that you've just gotten burnt by out-of-sync files or some such (eg, a local file that needs to be rebuilt but is timestamped a bit newer than the cvs-supplied files it depends on). If you still see the problem with a virgin build, take a look at src/backend/utils/Gen_fmgrtab.sh and its output src/backend/utils/fmgrtab.c to see if you can figure out what's wrong. Could be that I introduced some kind of portability problem into Gen_fmgrtab.sh ... regards, tom lane
Tom, You ain't arf clever. Running Gen_fmgrtab.sh with a "set -x" shows:- const FmgrBuiltin fmgr_builtins[] = { + awk { printf (" { %d, \"%s\", %d, %s, %s, %s },\n"), \ $1, $(NF-1), $9, \ ($8 == "t") ? "true" : "false",\ ($4 == "11") ? "true" : "false", \ $(NF-1) } fmgr.raw awk: syntax error near line 3 awk: illegal statement near line 3 + cat /* dummy entry is easier than getting rid of comma after last real one */ { 0, NULL, 0, false, false, (PGFunction)NULL } }; /* Note fmgr_nbuiltins excludes the dummy entry */ const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)) - 1; Looks like the problem is that, Solaris's awk is "old" awk. If I change the awk to nawk I get valid output. I'm just about to start the clean build process with this change. Once it's started I'm off to bed. Will check in the morning. Thanks for your trouble, we just need a "portable" fix now. Thanks, Keith. Tom Lane <tgl@sss.pgh.pa.us> > > Keith Parks <emkxp01@mtcc.demon.co.uk> writes: > > I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's > > excellent example from the archives, also attached. > > > I'm not sure whether the stack frame shown is corrupt, it seems to just > > loop over and over again. (I got fed up after 400+ frames) > > What we've got here is the syscache trying to set up for a search of > cache 18, which I believe is the pg_proc-indexed-on-OID cache. > For that it needs the OID comparison function, "oideq" (OID 184). > It's asking the funcmgr for oideq ... and funcmgr is turning around > and asking the syscache for the pg_proc entry with OID 184. Ooops. > <snip> > My bet is that there is something snafu'd in your generation of > fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of > builtin functions is either empty or corrupt. > <snip> > > If you still see the problem with a virgin build, take a look at > src/backend/utils/Gen_fmgrtab.sh and its output > src/backend/utils/fmgrtab.c to see if you can figure out what's > wrong. Could be that I introduced some kind of portability problem > into Gen_fmgrtab.sh ... > > regards, tom lane
Keith Parks <emkxp01@mtcc.demon.co.uk> writes: > Running Gen_fmgrtab.sh with a "set -x" shows:- > const FmgrBuiltin fmgr_builtins[] = { > + awk { printf (" { %d, \"%s\", %d, %s, %s, %s },\n"), \ > $1, $(NF-1), $9, \ > ($8 == "t") ? "true" : "false", \ > ($4 == "11") ? "true" : "false", \ > $(NF-1) } fmgr.raw > awk: syntax error near line 3 > awk: illegal statement near line 3 Ugh. I think that the former version of the script didn't use conditional expressions (a ? b : c). Perhaps old versions of awk don't have those? If so we can probably work around it... regards, tom lane
> Ugh. I think that the former version of the script didn't use > conditional expressions (a ? b : c). Perhaps old versions of > awk don't have those? Indeed, the GNU awk manual says so very clearly :-( Keith, I've committed a new version of Gen_fmgrtab.sh.in; would you check that it works on your copy of awk? regards, tom lane
Thanks Tom, That's fixed it. It's a shame when you have to "dumb-down" your AWK programming to suit the lowest common standard :-( Thanks again, Keith. Tom Lane <tgl@sss.pgh.pa.us> > > > Ugh. I think that the former version of the script didn't use > > conditional expressions (a ? b : c). Perhaps old versions of > > awk don't have those? > > Indeed, the GNU awk manual says so very clearly :-( > > Keith, I've committed a new version of Gen_fmgrtab.sh.in; > would you check that it works on your copy of awk? > > regards, tom lane
Tom Lane writes: > Ugh. I think that the former version of the script didn't use > conditional expressions (a ? b : c). Perhaps old versions of > awk don't have those? If so we can probably work around it... While you're at it, you should use AC_PROG_AWK to potentially find the most modern and fastest awk on the system. Also, it seems that script has really little to no checking of exit statuses. A segfault during initdb is a really obscure place to find out about awk syntax errors. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes: > While you're at it, you should use AC_PROG_AWK to potentially find the > most modern and fastest awk on the system. Also, it seems that script has > really little to no checking of exit statuses. True. Wanna fix it? I'm not planning to touch it again soon... regards, tom lane