Re: error: could not find pg_class tuple for index 2662 - Mailing list pgsql-hackers

From daveg
Subject Re: error: could not find pg_class tuple for index 2662
Date
Msg-id 20110803115731.GA14353@sonic.net
Whole thread Raw
In response to Re: error: could not find pg_class tuple for index 2662  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: error: could not find pg_class tuple for index 2662
List pgsql-hackers
On Mon, Aug 01, 2011 at 01:23:49PM -0400, Tom Lane wrote:
> daveg <daveg@sonic.net> writes:
> > On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:
> >> I think we need to start adding some instrumentation so we can get a
> >> better handle on what's going on in your database.  If I were to send
> >> you a source-code patch for the server that adds some more logging
> >> printout when this happens, would you be willing/able to run a patched
> >> build on your machine?
> 
> > Yes we can run an instrumented server so long as the instrumentation does
> > not interfere with normal operation. However, scheduling downtime to switch
> > binaries is difficult, and generally needs to be happen on a weekend, but
> > sometimes can be expedited. I'll look into that.
> 
> OK, attached is a patch against 9.0 branch that will re-scan pg_class
> after a failure of this sort occurs, and log what it sees in the tuple
> header fields for each tuple for the target index.  This should give us
> some useful information.  It might be worthwhile for you to also log the
> results of
> 
> select relname,pg_relation_filenode(oid) from pg_class
> where relname like 'pg_class%';
> 
> in your script that does VACUUM FULL, just before and after each time it
> vacuums pg_class.  That will help in interpreting the relfilenodes in
> the log output.

We have installed the patch and have encountered the error as usual.
However there is no additional output from the patch. I'm speculating
that the pg_class scan in ScanPgRelationDetailed() fails to return
tuples somehow.


I have also been trying to trace it further by reading the code, but have not
got any solid hypothesis yet. In the absence of any debugging output I've
been trying to deduce the call tree leading to the original failure. So far
it looks like this:

RelationReloadIndexInfo(Relation)   // Relation is 2662 and !rd_isvalid   pg_class_tuple = ScanPgRelation(2662,
indexOK=false) // returns NULL       pg_class_desc = heap_open(1259, ACC_SHARE)           r = relation_open(1259,
ACC_SHARE)// locks oid, ensures RelationIsValid(r)               r = RelationIdGetRelation(1259)                   r =
RelationIdCacheLookup(1259)  // assume success                   if !rd_isvalid:
RelationClearRelation(r,true)                           RelationInitPhysicalAddr(r) // r is pg_class relcache
 

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Further news on Clang - spurious warnings
Next
From: Dimitri Fontaine
Date:
Subject: Re: Transient plans versus the SPI API