Re: [HACKERS] New regression driver - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] New regression driver
Date
Msg-id 7165.943143009@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] New regression driver  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] New regression driver
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:
> wieck@debis.com (Jan Wieck) writes:
>> It is in utils/cache/catcache.c line 996.  The  comments  say
>> that  the  code  should  prevent  the  backend  from entering
>> infinite recursion while loading new cache entries.

> I will look at this.  I don't think that the catcaches live in
> shared memory, so the problem is probably not what you suggest.
> The fact that the behavior is different under load may point to a
> real problem, not just an insufficiently clever debugging check.

Indeed, this is a real bug, and commenting out the code that caught
it is not the right fix!

What is happening is that utils/inval.c is trying to initialize some
variables that contain OIDs of system relations.  This means calling
the catcache routines in order to look up relation names in pg_class.
However, if a shared cache inval message arrives from another backend
while that's happening, we recursively invoke inval.c to deal with the
message.  And inval.c sees that its OID variables aren't initialized
yet, so it recursively calls the catcache routines to try to get them
initialized.  Or, if just the first one's been initialized so far,
ValidateHacks() assumes they're all valid, and you can end up at the
elog(FATAL) panic at the bottom of CacheIdInvalidate().  I've got a core
dump which contains a ten-deep recursion between inval.c and syscache.c,
culminating in elog(FATAL) because the eleventh incoming sinval message
was just slow enough to let inval.c's first OID variable get filled in
before it arrived.

In short: we don't deal very robustly with cache invals happening
during backend startup.  Send invals at a new backend with just the
right timing, and it'll choke.

I am not sure if this bug is of long standing or if we introduced it
since 6.5.  It's possible I created it while messing with the relcache
stuff a month or two ago.  But I can easily believe that it's been
there a long time and we never had a way of reproducing the problem
with any reliability before.

I think the fix is to rip out inval.c's attempt to look up system
relation names, and just give it hardwired knowledge of their OIDs.
Even though it sort-of works to do the lookups, it's bad practice for
routines that are potentially called during catcache initialization
to depend on the catcache to be already working.  And there are other
places that already have hardwired knowledge of the system relation
OIDs, so...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Getting OID in psql of recent insert
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] New regression driver