Thread: Problems with recent CVS versions and Solaris.

Problems with recent CVS versions and Solaris.

From
Keith Parks
Date:
Hi all,

I regularly do a "cvs update" and compile and test PostgreSQL.

Recently, since about 1 week, I've had a nasty problem.

Doing an "initdb" seems to suck up all available memory and almost
kills the system, before dying itself with a SEGV.

The problem postgress process is:-
 /usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d 
template1 
The system becomes VERY unresponsive when this postgres process
starts running, so difficult to attach to with gdb. 

I'm stuck for a clue as to how to debug this.

Is anyone else seeing this problem recently?

Is it just a Solaris problem?
(Solaris 2.6 on SPARCstation 5)

Is it just me? :-(

Help,

Keith.



Re: Problems with recent CVS versions and Solaris.

From
Tom Lane
Date:
Keith Parks <emkxp01@mtcc.demon.co.uk> writes:
> Recently, since about 1 week, I've had a nasty problem.
> Doing an "initdb" seems to suck up all available memory and almost
> kills the system, before dying itself with a SEGV.

Hmm --- no such problem noted here, and I've been doing lots of initdbs...

It must be somewhat platform-specific.  See if you can get a coredump
and backtrace.
        regards, tom lane


Re: Problems with recent CVS versions and Solaris.

From
Keith Parks
Date:
Oops, mailed it to myself instead of the list!

It's been a long day...


------------- Begin Forwarded Message -------------

Date: Thu, 1 Jun 2000 23:31:01 +0100 (BST)
From: Keith Parks <emkxp01@mtcc.demon.co.uk>
Subject: Re: [HACKERS] Problems with recent CVS versions and Solaris.
To: emkxp01@mtcc.demon.co.uk
MIME-Version: 1.0

I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
excellent example from the archives, also attached.

I'm not sure whether the stack frame shown is corrupt, it seems to just
loop over and over again. (I got fed up after 400+ frames)

The final few frames show us asking for more memory, the point at
which things seem to go out of control.

#0  0xef5d33b8 in _brk_unlocked ()
#1  0xef5ce2f8 in _sbrk_unlocked ()
#2  0xef5ce26c in sbrk ()
#3  0xef585bb0 in _morecore ()
#4  0xef58549c in _malloc_unlocked ()
#5  0xef5852b4 in malloc ()
#6  0x139198 in AllocSetAlloc (set=0x1bea10, size=4032) at aset.c:285
#7  0x139ea8 in GlobalMemoryAlloc (this=0x1bea08, size=4008) at mcxt.c:419
#8  0x1399ec in MemoryContextAlloc (context=0x1bea08, size=4008) at mcxt.c:224
#9  0x12c700 in InitSysCache (relname=0x180f40 "pg_proc",    iname=0x180f08 "pg_proc_oid_index", id=18, nkeys=1,
key=0x19a2f0,   iScanfuncP=0x6e1c8 <ProcedureOidIndexScan>) at catcache.c:705
 
#10 0x1312d8 in SearchSysCacheTuple (cacheId=18, key1=184, key2=0, key3=0,    key4=0) at syscache.c:509

Is this any help?

I'm no expert in gdb, but I can follow instructions. ;-)

Thanks,
Keith.


Keith Parks <emkxp01@mtcc.demon.co.uk>
> 
> Hi all,
> 
> I regularly do a "cvs update" and compile and test PostgreSQL.
> 
> Recently, since about 1 week, I've had a nasty problem.
> 
> Doing an "initdb" seems to suck up all available memory and almost
> kills the system, before dying itself with a SEGV.
> 
> The problem postgress process is:-
> 
>   /usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d 
> template1
>   
> The system becomes VERY unresponsive when this postgres process
> starts running, so difficult to attach to with gdb. 
> 
> I'm stuck for a clue as to how to debug this.
> 
> Is anyone else seeing this problem recently?
> 
> Is it just a Solaris problem?
> (Solaris 2.6 on SPARCstation 5)
> 
> Is it just me? :-(
> 
> Help,
> 
> Keith.
> 

------------- End Forwarded Message -------------


Re: Problems with recent CVS versions and Solaris.

From
Tom Lane
Date:
Keith Parks <emkxp01@mtcc.demon.co.uk> writes:
> I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
> excellent example from the archives, also attached.

> I'm not sure whether the stack frame shown is corrupt, it seems to just
> loop over and over again. (I got fed up after 400+ frames)

What we've got here is the syscache trying to set up for a search of
cache 18, which I believe is the pg_proc-indexed-on-OID cache.
For that it needs the OID comparison function, "oideq" (OID 184).
It's asking the funcmgr for oideq ... and funcmgr is turning around
and asking the syscache for the pg_proc entry with OID 184.  Ooops.

I thought there was an interlock in there to report a useful message if
a syscache got called recursively like this.  Have to look at why it's
not working.  However, I guess your real problem is that the funcmgr is
failing to find proc OID 184 in its own table of built-in functions.
The reason this isn't a recursion under normal circumstances is that the
comparison functions the syscaches need are all supposed to be hardwired
into fmgr.

My bet is that there is something snafu'd in your generation of
fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of
builtin functions is either empty or corrupt.

Before wasting any additional time on it I'd recommend a make distclean,
cvs update, configure and rebuild from scratch to see if the problem
persists.  I changed the Gen_fmgrtab.sh setup last week as part of the
first round of fmgr checkins, and I wouldn't be surprised to find that
you've just gotten burnt by out-of-sync files or some such (eg, a local
file that needs to be rebuilt but is timestamped a bit newer than the
cvs-supplied files it depends on).

If you still see the problem with a virgin build, take a look at
src/backend/utils/Gen_fmgrtab.sh and its output
src/backend/utils/fmgrtab.c to see if you can figure out what's
wrong.  Could be that I introduced some kind of portability problem
into Gen_fmgrtab.sh ...
        regards, tom lane


Re: Problems with recent CVS versions and Solaris.

From
Keith Parks
Date:
Tom,

You ain't arf clever.

Running Gen_fmgrtab.sh with a "set -x" shows:-

const FmgrBuiltin fmgr_builtins[] = {
+ awk { printf ("  { %d, \"%s\", %d, %s, %s, %s },\n"), \       $1, $(NF-1), $9, \       ($8 == "t") ? "true" :
"false",\       ($4 == "11") ? "true" : "false", \       $(NF-1) } fmgr.raw 
 
awk: syntax error near line 3
awk: illegal statement near line 3
+ cat  /* dummy entry is easier than getting rid of comma after last real one */ { 0, NULL, 0, false, false,
(PGFunction)NULL }
 
};

/* Note fmgr_nbuiltins excludes the dummy entry */
const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)) - 1;

Looks like the problem is that, Solaris's awk is "old" awk.

If I change the awk to nawk I get valid output.

I'm just about to start the clean build process with this change.

Once it's started I'm off to bed. Will check in the morning.

Thanks for your trouble, we just need a "portable" fix now.

Thanks,
Keith. 

Tom Lane <tgl@sss.pgh.pa.us>
> 
> Keith Parks <emkxp01@mtcc.demon.co.uk> writes:
> > I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
> > excellent example from the archives, also attached.
> 
> > I'm not sure whether the stack frame shown is corrupt, it seems to just
> > loop over and over again. (I got fed up after 400+ frames)
> 
> What we've got here is the syscache trying to set up for a search of
> cache 18, which I believe is the pg_proc-indexed-on-OID cache.
> For that it needs the OID comparison function, "oideq" (OID 184).
> It's asking the funcmgr for oideq ... and funcmgr is turning around
> and asking the syscache for the pg_proc entry with OID 184.  Ooops.
> 
<snip>
> My bet is that there is something snafu'd in your generation of
> fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of
> builtin functions is either empty or corrupt.
> 
<snip>
> 
> If you still see the problem with a virgin build, take a look at
> src/backend/utils/Gen_fmgrtab.sh and its output
> src/backend/utils/fmgrtab.c to see if you can figure out what's
> wrong.  Could be that I introduced some kind of portability problem
> into Gen_fmgrtab.sh ...
> 
>             regards, tom lane



Re: Problems with recent CVS versions and Solaris.

From
Tom Lane
Date:
Keith Parks <emkxp01@mtcc.demon.co.uk> writes:
> Running Gen_fmgrtab.sh with a "set -x" shows:-

> const FmgrBuiltin fmgr_builtins[] = {
> + awk { printf ("  { %d, \"%s\", %d, %s, %s, %s },\n"), \
>         $1, $(NF-1), $9, \
>         ($8 == "t") ? "true" : "false", \
>         ($4 == "11") ? "true" : "false", \
>         $(NF-1) } fmgr.raw 
> awk: syntax error near line 3
> awk: illegal statement near line 3

Ugh.  I think that the former version of the script didn't use
conditional expressions (a ? b : c).  Perhaps old versions of
awk don't have those?  If so we can probably work around it...
        regards, tom lane


Re: Problems with recent CVS versions and Solaris.

From
Tom Lane
Date:
> Ugh.  I think that the former version of the script didn't use
> conditional expressions (a ? b : c).  Perhaps old versions of
> awk don't have those?

Indeed, the GNU awk manual says so very clearly :-(

Keith, I've committed a new version of Gen_fmgrtab.sh.in;
would you check that it works on your copy of awk?
        regards, tom lane


Re: Problems with recent CVS versions and Solaris.

From
Keith Parks
Date:
Thanks Tom,

That's fixed it.

It's a shame when you have to "dumb-down" your AWK programming
to suit the lowest common standard :-(

Thanks again,
Keith.


Tom Lane <tgl@sss.pgh.pa.us>
> 
> > Ugh.  I think that the former version of the script didn't use
> > conditional expressions (a ? b : c).  Perhaps old versions of
> > awk don't have those?
> 
> Indeed, the GNU awk manual says so very clearly :-(
> 
> Keith, I've committed a new version of Gen_fmgrtab.sh.in;
> would you check that it works on your copy of awk?
> 
>             regards, tom lane



Re: Problems with recent CVS versions and Solaris.

From
Peter Eisentraut
Date:
Tom Lane writes:

> Ugh.  I think that the former version of the script didn't use
> conditional expressions (a ? b : c).  Perhaps old versions of
> awk don't have those?  If so we can probably work around it...

While you're at it, you should use AC_PROG_AWK to potentially find the
most modern and fastest awk on the system. Also, it seems that script has
really little to no checking of exit statuses. A segfault during initdb is
a really obscure place to find out about awk syntax errors.

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: Problems with recent CVS versions and Solaris.

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> While you're at it, you should use AC_PROG_AWK to potentially find the
> most modern and fastest awk on the system. Also, it seems that script has
> really little to no checking of exit statuses.

True.  Wanna fix it?  I'm not planning to touch it again soon...
        regards, tom lane