Re: DISCARD TEMP results in "ERROR: cache lookup failed for type 0" - Mailing list pgsql-bugs

From Tom Lane
Subject Re: DISCARD TEMP results in "ERROR: cache lookup failed for type 0"
Date
Msg-id 4709.1562365346@sss.pgh.pa.us
Whole thread Raw
In response to DISCARD TEMP results in "ERROR: cache lookup failed for type 0"  (Manuel Rigger <rigger.manuel@gmail.com>)
Responses Re: DISCARD TEMP results in "ERROR: cache lookup failed for type 0"
List pgsql-bugs
Manuel Rigger <rigger.manuel@gmail.com> writes:
> CREATE TEMP TABLE t0(c0 INT GENERATED ALWAYS AS IDENTITY) PARTITION BY
> HASH((t0.c0));
> VACUUM FULL;
> DISCARD TEMP; -- unexpected: ERROR: cache lookup failed for type 0

Blech.  The proximate cause is fairly obvious from poking around with a
debugger:

#0  errfinish (dummy=0) at elog.c:414
#1  0x00000000008a2694 in elog_finish (elevel=<value optimized out>,
    fmt=<value optimized out>) at elog.c:1376
#2  0x00000000008898ae in get_typlenbyvalalign (typid=0, typlen=0x2d42118,
    typbyval=0x2d42138, typalign=0x2d42158 "") at lsyscache.c:2057
#3  0x000000000088c2e0 in RelationBuildPartitionKey (relation=0x7f727c583a80)
    at partcache.c:233
    get_typlenbyvalalign(key->parttypid[i], ...);
#4  0x00000000008957a7 in RelationBuildDesc (
    targetRelId=<value optimized out>, insertIt=true) at relcache.c:1189
#5  0x0000000000897ce6 in RelationIdGetRelation (relationId=53072)
    at relcache.c:1956
#6  0x00000000004ca6b0 in relation_open (relationId=53072,
    lockmode=<value optimized out>) at heapam.c:1135
#7  0x0000000000535c89 in heap_drop_with_catalog (relid=53072) at heap.c:1828
#8  0x0000000000533074 in doDeletion (object=0x2d4dfe4, depRel=0x7ffd1d23b3d8,
    flags=<value optimized out>) at dependency.c:1134
#9  deleteOneObject (object=0x2d4dfe4, depRel=0x7ffd1d23b3d8,
    flags=<value optimized out>) at dependency.c:1036
#10 0x000000000053310f in deleteObjectsInList (targetObjects=0x2c367b8,
    depRel=0x7ffd1d23b3d8, flags=29) at dependency.c:260
#11 0x0000000000533328 in performDeletion (object=0x7ffd1d23b410,
    behavior=DROP_CASCADE, flags=29) at dependency.c:341
#12 0x000000000053d283 in RemoveTempRelations () at namespace.c:4140
    ...
(gdb) f 3
(gdb) p *relation->rd_att
$2 = {natts = 1, tdtypeid = 53074, tdtypmod = -1, tdhasoid = false,
  tdrefcount = 1, constr = 0x0, attrs = 0x7f727c6f68e0}
(gdb) p *relation->rd_att->attrs
$3 = {attrelid = 53072, attname = {
    data = "........pg.dropped.1........", '\000' <repeats 35 times>},
  atttypid = 0, attstattarget = 0, attlen = 4, attnum = 1, attndims = 0,
  attcacheoff = 0, atttypmod = -1, attbyval = true, attstorage = 112 'p',
  attalign = 105 'i', attnotnull = false, atthasdef = false,
  atthasmissing = false, attidentity = 97 'a', attisdropped = true,
  attislocal = true, attinhcount = 0, attcollation = 0}

So we're getting the failure because, when we come to delete the whole
temp table, its partitioning column has already been dropped, causing
RelationBuildPartitionKey (and hence any relation-opening operation)
to fail.  That means that order-of-operations in dependency.c is at
the root of the issue.

The relevant pg_depend entries are (your OIDs will vary):

# select objid, pg_describe_object(classid,objid,objsubid) as obj,
 pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype from
 pg_depend where ...;

 objid |        obj         |          ref          | deptype
-------+--------------------+-----------------------+---------
 53071 | type t0_c0_seq     | sequence t0_c0_seq    | i
 53070 | sequence t0_c0_seq | schema pg_temp_3      | n
 53074 | type t0            | table t0              | i
 53073 | type t0[]          | type t0               | i
 53072 | table t0           | schema pg_temp_3      | n
 53070 | sequence t0_c0_seq | column c0 of table t0 | i
(6 rows)

deleteObjectsInList's target list is

(gdb) p *targetObjects
$1 = {refs = 0x2d87450, extras = 0x2d2c8f8, numrefs = 7, maxrefs = 32}
(gdb) p targetObjects->refs[0]
$2 = {classId = 1247, objectId = 53071, objectSubId = 0}
(gdb) p targetObjects->refs[1]
$3 = {classId = 1259, objectId = 53070, objectSubId = 0}
(gdb) p targetObjects->refs[2]
$4 = {classId = 1259, objectId = 53072, objectSubId = 1}
(gdb) p targetObjects->refs[3]
$5 = {classId = 1247, objectId = 53073, objectSubId = 0}
(gdb) p targetObjects->refs[4]
$6 = {classId = 1247, objectId = 53074, objectSubId = 0}
(gdb) p targetObjects->refs[5]
$7 = {classId = 1259, objectId = 53072, objectSubId = 0}
(gdb) p targetObjects->refs[6]
$8 = {classId = 2615, objectId = 16982, objectSubId = 0}

So what's evidently happening is that it lands on the sequence first
and follows that to column c0, making a deletion target entry (refs[2])
for that column, and only afterwards finds the whole table and makes a
deletion target entry for the table (refs[5]).  The behavior is dependent
on the order of entries in pg_depend, so you might or might not see it
(and that's why the VACUUM FULL contributes; it's changing the entry
order).

The example doesn't reproduce in v12/HEAD, but I fear that that is only a
cosmetic side-effect of commit f1ad067fc (Sort the dependent objects
before recursing in findDependentObjects()).  [ pokes at it... ]
Yup, you can break it in HEAD too, if you arrange for the sequence
and the table to have different relative OIDs:

regression=# create temp table t0(c0 int not null) partition by range(c0);
CREATE TABLE
regression=# alter table t0 alter column c0 add generated always as identity;
ALTER TABLE
regression=# select objid, pg_describe_object(classid,objid,objsubid) as obj,
pg_describe_object(refclassid,refobjid,refobjsubid)as ref, deptype from pg_depend where ... 
 objid |        obj         |          ref          | deptype
-------+--------------------+-----------------------+---------
 43530 | type t0            | table t0              | i
 43529 | type t0[]          | type t0               | i
 43528 | table t0           | schema pg_temp_3      | n
 43534 | type t0_c0_seq     | sequence t0_c0_seq    | i
 43533 | sequence t0_c0_seq | schema pg_temp_3      | n
 43533 | sequence t0_c0_seq | column c0 of table t0 | i
(6 rows)

regression=# discard temp;
psql: ERROR:  cache lookup failed for type 0

ISTM the real fix here probably involves having some explicit
dependencies between the table and its partitioning columns,
rather than relying on this check:

regression=# alter table t0 drop column c0;
psql: ERROR:  cannot drop column named in partition key
regression=# \errverbose
psql: error: ERROR:  42P16: cannot drop column named in partition key
LOCATION:  ATExecDropColumn, tablecmds.c:7041

which is evidently not even in the right place to guard this with
any security.  With an explicit dependency, we could force a drop
of a partitioning column to be turned into a drop of the whole table.
Without one, well, you can break it really easily:

regression=# create domain d1 as int;
CREATE DOMAIN
regression=# create table t1 (f1 d1) partition by range(f1);
CREATE TABLE
regression=# drop domain d1 cascade;
psql: NOTICE:  drop cascades to column f1 of table t1
DROP DOMAIN
regression=# \d t1
psql: ERROR:  cache lookup failed for type 0

However, we can only apply a fix like that in HEAD and maybe v12;
it seems too late to be adding such dependencies in v10/v11.
Not sure what to do about those branches.

            regards, tom lane



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #15896: pg_upgrade from 10-or-earlier: TRAP: FailedAssertion(»!(metad->btm_version >= 3)«
Next
From: Manuel Rigger
Date:
Subject: Re: VACUUM FULL results in deadlock