Thread: failed runcheck

failed runcheck

From
Patrick Welche
Date:
First a core dump which can be relieved by:

Index: catalog.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v
retrieving revision 1.34
diff -c -u -r1.34 catalog.c
--- catalog.c    2000/10/16 14:52:02    1.34
+++ catalog.c    2000/10/21 17:07:09
@@ -173,7 +173,7 @@boolIsSystemRelationName(const char *relname){
-    if (relname[0] && relname[1] && relname[2])
+    if (relname && relname[0] && relname[1] && relname[2])        return (relname[0] == 'p' &&
relname[1]== 'g' &&                relname[2] == '_');
 

(symptoms at the end of message)

But now the bit I don't see how to solve: the regression postmaster doesn't
startup because it can't find tmp_check/data/base/1/1259. The only files I
see are 1/{1255,PG_VERSION}. Where does 1259 come from?

Cheers,

Patrick


#0  IsSystemRelationName (relname=0x0) at catalog.c:176
#1  0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2  0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3  0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,    att=0x8173600) at relcache.c:1193
#4  0x8120c12 in RelationCacheInitialize () at relcache.c:1953
#5  0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1", username=0x0)   at postinit.c:329
#6  0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at bootstrap.c:358
#7  0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
#8  0x806367e in ___start ()
(gdb) print *relation
$3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000',  rd_isnailed = 0 '\000', rd_unlinked = 0
'\000',rd_indexfound = 0 '\000',  rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0,  rd_indexlist
=0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId = 0}},  rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128,
rd_istrat= 0x0,  rd_support = 0x0, trigdesc = 0x0}
 



Re: failed runcheck

From
Bruce Momjian
Date:
Applied

> First a core dump which can be relieved by:
> 
> Index: catalog.c
> ===================================================================
> RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v
> retrieving revision 1.34
> diff -c -u -r1.34 catalog.c
> --- catalog.c    2000/10/16 14:52:02    1.34
> +++ catalog.c    2000/10/21 17:07:09
> @@ -173,7 +173,7 @@
>  bool
>  IsSystemRelationName(const char *relname)
>  {
> -    if (relname[0] && relname[1] && relname[2])
> +    if (relname && relname[0] && relname[1] && relname[2])
>          return (relname[0] == 'p' &&
>                  relname[1] == 'g' &&
>                  relname[2] == '_');
> 
> (symptoms at the end of message)
> 
> But now the bit I don't see how to solve: the regression postmaster doesn't
> startup because it can't find tmp_check/data/base/1/1259. The only files I
> see are 1/{1255,PG_VERSION}. Where does 1259 come from?
> 
> Cheers,
> 
> Patrick
> 
> 
> #0  IsSystemRelationName (relname=0x0) at catalog.c:176
> #1  0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
> #2  0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
> #3  0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22, 
>     att=0x8173600) at relcache.c:1193
> #4  0x8120c12 in RelationCacheInitialize () at relcache.c:1953
> #5  0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1", username=0x0)
>     at postinit.c:329
> #6  0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at bootstrap.c:358
> #7  0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
> #8  0x806367e in ___start ()
> (gdb) print *relation
> $3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000', 
>   rd_isnailed = 0 '\000', rd_unlinked = 0 '\000', rd_indexfound = 0 '\000', 
>   rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0, 
>   rd_indexlist = 0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId = 0}}, 
>   rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128, rd_istrat = 0x0, 
>   rd_support = 0x0, trigdesc = 0x0}
> 
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: failed runcheck

From
Tom Lane
Date:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> [ core dump due to ]
> #0  IsSystemRelationName (relname=0x0) at catalog.c:176
> #1  0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
> #2  0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
> #3  0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22, 
>     att=0x8173600) at relcache.c:1193
> #4  0x8120c12 in RelationCacheInitialize () at relcache.c:1953

and proposes to fix this by having IsSystemRelationName make an
arbitrary decision about whether a NULL input pointer should be
considered to represent a system relation name or not.  I do not
like that, because AFAICS the decision is completely arbitrary.
IsSystemRelationName shouldn't be called with a NULL pointer
in the first place, and it has every right to cause a coredump
if that happens.

It looks to me like the immediate bug here is that
RelationGetPhysicalRelationName is returning a NULL pointer.  Probably
there are some missing or out-of-order steps in relcache initialization?

I've been offline for a couple days due to DSL line failure :-( so
I haven't seen Vadim's latest checkins.  But I'm betting this is a bug
in the changes to use OIDs as physical relnames.  Do we even need
RelationGetPhysicalRelationName anymore, and if so, what does it mean?

Next question: why is RelationInitLockInfo using
RelationGetPhysicalRelationName to get the input data for
IsSharedSystemRelationName --- shouldn't that be a test on logical
relation name?  Or maybe the entire premise of
IsSharedSystemRelationName is bogus now, and we ought to use some other
way to decide if a relation is cross-database or not?
        regards, tom lane


Re: failed runcheck

From
Bruce Momjian
Date:
> Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> > [ core dump due to ]
> > #0  IsSystemRelationName (relname=0x0) at catalog.c:176
> > #1  0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
> > #2  0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
> > #3  0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22, 
> >     att=0x8173600) at relcache.c:1193
> > #4  0x8120c12 in RelationCacheInitialize () at relcache.c:1953
> 
> and proposes to fix this by having IsSystemRelationName make an
> arbitrary decision about whether a NULL input pointer should be
> considered to represent a system relation name or not.  I do not
> like that, because AFAICS the decision is completely arbitrary.
> IsSystemRelationName shouldn't be called with a NULL pointer
> in the first place, and it has every right to cause a coredump
> if that happens.

I have removed the change.  It will dump core again.

> 
> It looks to me like the immediate bug here is that
> RelationGetPhysicalRelationName is returning a NULL pointer.  Probably
> there are some missing or out-of-order steps in relcache initialization?
> 
> I've been offline for a couple days due to DSL line failure :-( so

That is bad.  Mine has been great, but I had my first DSL burp for 3
hours on Friday after 4 months of continuous uptime.

> I haven't seen Vadim's latest checkins.  But I'm betting this is a bug
> in the changes to use OIDs as physical relnames.  Do we even need
> RelationGetPhysicalRelationName anymore, and if so, what does it mean?

Good bet.

> 
> Next question: why is RelationInitLockInfo using
> RelationGetPhysicalRelationName to get the input data for
> IsSharedSystemRelationName --- shouldn't that be a test on logical
> relation name?  Or maybe the entire premise of
> IsSharedSystemRelationName is bogus now, and we ought to use some other
> way to decide if a relation is cross-database or not?

No, because if they create a temp table that masks a system table in the
current session, you want the physical name so it can know if it is a
real system table, or a temp/fake one.

You can ask why would someone try this, but it will work, or do
something while trying.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: failed runcheck

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Next question: why is RelationInitLockInfo using
>> RelationGetPhysicalRelationName to get the input data for
>> IsSharedSystemRelationName --- shouldn't that be a test on logical
>> relation name?  Or maybe the entire premise of
>> IsSharedSystemRelationName is bogus now, and we ought to use some other
>> way to decide if a relation is cross-database or not?

> No, because if they create a temp table that masks a system table in the
> current session, you want the physical name so it can know if it is a
> real system table, or a temp/fake one.

Well, you clearly don't want to be fooled by temp relations.  I was
sorta visualizing a check based on relation OIDs instead of names...
        regards, tom lane


Re: failed runcheck

From
"Vadim Mikheev"
Date:
Did you run make distclean? I've run regtests before committing changes.

Vadim

----- Original Message -----
From: "Patrick Welche" <prlw1@newn.cam.ac.uk>
To: <pgsql-hackers@postgresql.org>
Sent: Saturday, October 21, 2000 10:17 AM
Subject: [HACKERS] failed runcheck


> First a core dump which can be relieved by:
>
> Index: catalog.c
> ===================================================================
> RCS file:
/home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v
> retrieving revision 1.34
> diff -c -u -r1.34 catalog.c
> --- catalog.c 2000/10/16 14:52:02 1.34
> +++ catalog.c 2000/10/21 17:07:09
> @@ -173,7 +173,7 @@
>  bool
>  IsSystemRelationName(const char *relname)
>  {
> - if (relname[0] && relname[1] && relname[2])
> + if (relname && relname[0] && relname[1] && relname[2])
>   return (relname[0] == 'p' &&
>   relname[1] == 'g' &&
>   relname[2] == '_');
>
> (symptoms at the end of message)
>
> But now the bit I don't see how to solve: the regression postmaster
doesn't
> startup because it can't find tmp_check/data/base/1/1259. The only files I
> see are 1/{1255,PG_VERSION}. Where does 1259 come from?
>
> Cheers,
>
> Patrick
>
>
> #0  IsSystemRelationName (relname=0x0) at catalog.c:176
> #1  0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
> #2  0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
> #3  0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
>     att=0x8173600) at relcache.c:1193
> #4  0x8120c12 in RelationCacheInitialize () at relcache.c:1953
> #5  0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1",
username=0x0)
>     at postinit.c:329
> #6  0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at
bootstrap.c:358
> #7  0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
> #8  0x806367e in ___start ()
> (gdb) print *relation
> $3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000',
>   rd_isnailed = 0 '\000', rd_unlinked = 0 '\000', rd_indexfound = 0
'\000',
>   rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0,
>   rd_indexlist = 0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId =
0}},
>   rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128, rd_istrat = 0x0,
>   rd_support = 0x0, trigdesc = 0x0}
>



Re: failed runcheck

From
Patrick Welche
Date:
On Sat, Oct 21, 2000 at 01:48:39PM -0700, Vadim Mikheev wrote:
> Did you run make distclean? I've run regtests before committing changes.

Just made sure - different computer - fresh cvs update/distclean/configure/make
cd src/test/regress
gmake clean
gmake all
gmake runcheck

same coredump

#1  0x807f4be in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
^^^^^^^^^^^
ie. relname[0] at catalog.c:176 is dereferencing a null pointer.

Cheers,

Patrick


Re: failed runcheck

From
Tom Lane
Date:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> On Sat, Oct 21, 2000 at 01:48:39PM -0700, Vadim Mikheev wrote:
>> Did you run make distclean? I've run regtests before committing changes.

> Just made sure - different computer - fresh cvs update/distclean/configure/make
> same coredump

> #1  0x807f4be in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
>                                              ^^^^^^^^^^^
> ie. relname[0] at catalog.c:176 is dereferencing a null pointer.

Interesting.  Current sources pass regress tests on my machine, same as
for Vadim.  I think you have found a platform-specific bug.

Could you dig into it a little further and try to determine where the
NULL is coming from?
        regards, tom lane


Re: failed runcheck

From
"Ross J. Reedstrom"
Date:
On Sun, Oct 22, 2000 at 01:26:07AM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> Next question: why is RelationInitLockInfo using
> >> RelationGetPhysicalRelationName to get the input data for
> >> IsSharedSystemRelationName --- shouldn't that be a test on logical
> >> relation name?  Or maybe the entire premise of
> >> IsSharedSystemRelationName is bogus now, and we ought to use some other
> >> way to decide if a relation is cross-database or not?
> 
> > No, because if they create a temp table that masks a system table in the
> > current session, you want the physical name so it can know if it is a
> > real system table, or a temp/fake one.
> 
> Well, you clearly don't want to be fooled by temp relations.  I was
> sorta visualizing a check based on relation OIDs instead of names...
> 

Well, when I did a test implementation of OID filenames, lo these many
moons ago, I hacked around this problem by adding the (fixed) shared
system table oids to the static array that is searched for matches by
IsSharedSystemRelationName.

Admittedly, a hack, but it got past all the regresion tests.

Ross
-- 
Open source code is like a natural resource, it's the result of providing
food and sunshine to programmers, and then staying out of their way.
[...] [It] is not going away because it has utility for both the developers 
and users independent of economic motivations.  Jim Flynn, Sunnyvale, Calif.


Re: failed runcheck

From
Tom Lane
Date:
"Ross J. Reedstrom" <reedstrm@rice.edu> writes:
>> Well, you clearly don't want to be fooled by temp relations.  I was
>> sorta visualizing a check based on relation OIDs instead of names...

> Well, when I did a test implementation of OID filenames, lo these many
> moons ago, I hacked around this problem by adding the (fixed) shared
> system table oids to the static array that is searched for matches by
> IsSharedSystemRelationName.
> Admittedly, a hack, but it got past all the regresion tests.

Not a hack at all, IMHO, since all the shared system rels have
nailed-down OIDs.  It's pure historical artifact that
IsSharedSystemRelationName wasn't IsSharedSystemRelationOID in the
first place.

We probably want to be thinking about merging the "shared system
relation" concept together with the "tablespace" concept once we start
to implement tablespaces.  But for now, I think testing the OIDs is
fine.
        regards, tom lane


Re: failed runcheck

From
Patrick Welche
Date:
On Mon, Oct 23, 2000 at 10:26:39AM -0400, Tom Lane wrote:
> 
> Could you dig into it a little further and try to determine where the
> NULL is coming from?

All clear now! (I did do another cvs update in the meantime, but either way,
I can't now repeat the previously repeatable core dump)

Cheers,

Patrick