error: could not find pg_class tuple for index 2662 - Mailing list pgsql-hackers

From daveg
Subject error: could not find pg_class tuple for index 2662
Date
Msg-id 20110728002853.GA15578@sonic.net
Whole thread Raw
Responses Re: error: could not find pg_class tuple for index 2662
List pgsql-hackers
My client has been seeing regular instances of the following sort of problem:

...03:06:09.453 exec_simple_query, postgres.c:90003:06:12.042 XX000: could not find pg_class tuple for index 2662 at
character1303:06:12.042 RelationReloadIndexInfo, relcache.c:174003:06:12.042 INSERT INTO zzz_k(k) SELECT
...03:06:12.04500000: statement: ABORT03:06:12.045 exec_simple_query, postgres.c:90003:06:12.045 00000: duration: 0.100
ms03:06:12.045exec_simple_query, postgres.c:112803:06:12.046 00000: statement: INSERT INTO temp_807
VALUES(...)03:06:12.046 exec_simple_query, postgres.c:90003:06:12.046 XX000: could not find pg_class tuple for index
2662at character 1303:06:12.046 RelationReloadIndexInfo, relcache.c:174003:06:12.046 INSERT INTO temp_807
     VALUES (...)03:06:12.096 08P01: unexpected EOF on client connection03:06:12.096 SocketBackend,
postgres.c:34803:06:12.096XX000: could not find pg_class tuple for index 266203:06:12.096 RelationReloadIndexInfo,
relcache.c:174003:06:12.12100000: disconnection: session time: 0:06:08.537 user=ZZZ database=ZZZ_0103:06:12.121
log_disconnections,postgres.c:4339
 


The above happens regularly (but not completely predictably) corresponding
with a daily cronjob that checks the catalogs for bloat and does vacuum full
and/or reindex as needed. Since some of the applications make very heavy
use of temp tables this will usually mean pg_class and pg_index get vacuum
full and reindex.

Sometimes queries will fail due to being unable to open a tables containing
file. On investigation the file will be absent in both the catalogs and the
filesystem so I don't know what table it refers to:
20:41:19.063  ERROR:  could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or
directory20:41:19.063 STATEMENT:  insert into r_ar__30                      select aid, mid, pid, sum(wdata) as wdata,
...
--20:41:19.430  ERROR:  could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or
directory20:41:19.430 STATEMENT: SELECT nextval('j_id_seq')
 


Finallly, I have seen a several instances of failure to read data by
vacuum full itself:
03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index;03:05:45.699 exec_simple_query,
postgres.c:90003:05:46.142XX001: could not read block 65 in file "pg_tblspc/16401/PG_9.0_201008051/16416/1049146489":
readonly 0 of 8192 bytes03:05:46.142 mdread, md.c:65603:05:46.142 vacuum full pg_catalog.pg_index;
 

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

This seems possibly related to the issues in:
 Bizarre buildfarm failure on baiji: can't find pg_class_oid_index
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02038.phpBroken HOT chains in system catalogs
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00777.php

As far as I can tell from the logs I have, once a session sees one of these
errors any subsequent query will hit it again until the session exits.
However, it does not seem to harm other sessions or leave any persistant
damage (crossing fingers and hoping here).

I'm ready to do any testing/investigation/instrumented builds etc that may be
helpful in resolving this.

Regards

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Ripping out pg_restore's attempts to parse SQL before sending it
Next
From: Robert Haas
Date:
Subject: cheaper snapshots