Re: SP-GiST confusion: indexed column's type vs. index column type - Mailing list pgsql-hackers

From Tom Lane
Subject Re: SP-GiST confusion: indexed column's type vs. index column type
Date
Msg-id 3928777.1617405863@sss.pgh.pa.us
Whole thread Raw
In response to Re: SP-GiST confusion: indexed column's type vs. index column type  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: SP-GiST confusion: indexed column's type vs. index column type
List pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Apr 2, 2021 at 9:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I propose changing things so that
>> (A) attType really is the input (heap) data type.
>> (B) We enforce that leafType agrees with the opclass opckeytype,
>> ensuring the index tupdesc can be used for leaf tuples.
>> (C) The tupdesc passed back for an index-only scan reports the
>> input (heap) data type.
>> 
>> This might be too much change for the back branches.  Given the
>> lack of complaints to date, I think we can just fix it in HEAD.

> +1 to fixing it on HEAD only.

Here's a draft patch for that, in case anyone wants to look it
over.

The confusion went even deeper than I thought, as some of the code
mistakenly thought that reconstructed "leafValue" values were of the
leaf data type rather than the input attribute type.  (Which is not
too surprising, given that that's such a misleading name, but the
docs are clear and correct on the point.)

Also, both the code and docs thought that the "reconstructedValue"
datums that are passed down the tree during a search should be of
the leaf data type.  This seems to me to be arrant nonsense.
As an example, if you made an opclass that indexes 1-D arrays
by labeling each inner node with successive array elements,
right down to the leaf which is the last array element, it will
absolutely not work for the reconstructedValues to be of the
leaf type --- they have to be of the array type.  (As I said
in commit 1ebdec8c0, this'd be a fairly poorly-chosen opclass
design, but it seems like it ought to physically work.)

Given the amount of confusion here, I don't have a lot of confidence
that an opclass that wants to reconstruct values while having
leafType different from input type will work even with this patch.
I'm strongly tempted to make a src/test/modules module that
implements exactly the silly design given above, just so we have
some coverage for this scenario.

            regards, tom lane

diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index ea88ae45e5..ba66deb003 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -330,19 +330,25 @@ typedef struct spgConfigOut
      </para>

      <para>
-      <structfield>leafType</structfield> is typically the same as
-      <structfield>attType</structfield>.  For the reasons of backward
-      compatibility, method <function>config</function> can
-      leave <structfield>leafType</structfield> uninitialized; that would
-      give the same effect as setting <structfield>leafType</structfield> equal
-      to <structfield>attType</structfield>.  When <structfield>attType</structfield>
-      and <structfield>leafType</structfield> are different, then optional
+      <structfield>leafType</structfield> must match the index storage type
+      defined by the operator class's <structfield>opckeytype</structfield>
+      catalog entry.  For reasons of backward compatibility,
+      method <function>config</function> can
+      leave <structfield>leafType</structfield> uninitialized (zero);
+      that is interpreted as meaning <structfield>opckeytype</structfield>.
+      (Recall that <structfield>opckeytype</structfield> can in turn be zero,
+      implying the storage type is the same as the operator class's input
+      type.)  In what follows, <structfield>leafType</structfield> should be
+      understood as referring to the fully resolved leaf data type.
+     </para>
+
+     <para>
+      When <structfield>attType</structfield>
+      and <structfield>leafType</structfield> are different, the optional
       method <function>compress</function> must be provided.
       Method <function>compress</function> is responsible
       for transformation of datums to be indexed from <structfield>attType</structfield>
       to <structfield>leafType</structfield>.
-      Note: both consistent functions will get <structfield>scankeys</structfield>
-      unchanged, without transformation using <function>compress</function>.
      </para>
      </listitem>
     </varlistentry>
@@ -677,8 +683,7 @@ typedef struct spgInnerConsistentOut
        <structfield>reconstructedValue</structfield> is the value reconstructed for the
        parent tuple; it is <literal>(Datum) 0</literal> at the root level or if the
        <function>inner_consistent</function> function did not provide a value at the
-       parent level. <structfield>reconstructedValue</structfield> is always of
-       <structname>spgConfigOut</structname>.<structfield>leafType</structfield> type.
+       parent level.
        <structfield>traversalValue</structfield> is a pointer to any traverse data
        passed down from the previous call of <function>inner_consistent</function>
        on the parent index tuple, or NULL at the root level.
@@ -713,9 +718,14 @@ typedef struct spgInnerConsistentOut
        necessarily so, so an array is used.)
        If value reconstruction is needed, set
        <structfield>reconstructedValues</structfield> to an array of the values
-       of <structname>spgConfigOut</structname>.<structfield>leafType</structfield> type
        reconstructed for each child node to be visited; otherwise, leave
        <structfield>reconstructedValues</structfield> as NULL.
+       The reconstructed values are assumed to be of the indexed column's
+       type, that is <structname>spgConfigIn</structname>.<structfield>attType</structfield>.
+       (However, since the core system will do nothing with them except
+       possibly copy them, it is sufficient for them to have the
+       same <literal>typlen</literal> and <literal>typbyval</literal>
+       properties as <structfield>attType</structfield>.)
        If ordered search is performed, set <structfield>distances</structfield>
        to an array of distance values according to <structfield>orderbys</structfield>
        array (nodes with lowest distances will be processed first).  Leave it
@@ -797,8 +807,7 @@ typedef struct spgLeafConsistentOut
        <structfield>reconstructedValue</structfield> is the value reconstructed for the
        parent tuple; it is <literal>(Datum) 0</literal> at the root level or if the
        <function>inner_consistent</function> function did not provide a value at the
-       parent level. <structfield>reconstructedValue</structfield> is always of
-       <structname>spgConfigOut</structname>.<structfield>leafType</structfield> type.
+       parent level.
        <structfield>traversalValue</structfield> is a pointer to any traverse data
        passed down from the previous call of <function>inner_consistent</function>
        on the parent index tuple, or NULL at the root level.
@@ -816,8 +825,8 @@ typedef struct spgLeafConsistentOut
        The function must return <literal>true</literal> if the leaf tuple matches the
        query, or <literal>false</literal> if not.  In the <literal>true</literal> case,
        if <structfield>returnData</structfield> is <literal>true</literal> then
-       <structfield>leafValue</structfield> must be set to the value of
-       <structname>spgConfigIn</structname>.<structfield>attType</structfield> type
+       <structfield>leafValue</structfield> must be set to the value (of type
+       <structname>spgConfigIn</structname>.<structfield>attType</structfield>)
        originally supplied to be indexed for this leaf tuple.  Also,
        <structfield>recheck</structfield> may be set to <literal>true</literal> if the match
        is uncertain and so the operator(s) must be re-applied to the actual
@@ -834,7 +843,7 @@ typedef struct spgLeafConsistentOut
    </variablelist>

  <para>
-  The optional user-defined method are:
+  The optional user-defined methods are:
  </para>

  <variablelist>
@@ -842,15 +851,22 @@ typedef struct spgLeafConsistentOut
      <term><function>Datum compress(Datum in)</function></term>
      <listitem>
       <para>
-       Converts the data item into a format suitable for physical storage in
-       a leaf tuple of index page.  It accepts
+       Converts a data item into a format suitable for physical storage in
+       a leaf tuple of the index.  It accepts a value of type
        <structname>spgConfigIn</structname>.<structfield>attType</structfield>
-       value and returns
-       <structname>spgConfigOut</structname>.<structfield>leafType</structfield>
-       value.  Output value should not be toasted.
+       and returns a value of type
+       <structname>spgConfigOut</structname>.<structfield>leafType</structfield>.
+       The output value should not be toasted.
+      </para>
+
+      <para>
+       Note: the <function>compress</function> method is only applied to
+       values to be stored.  The consistent methods receive query scankeys
+       unchanged, without transformation using <function>compress</function>.
       </para>
      </listitem>
     </varlistentry>
+
     <varlistentry>
      <term><function>options</function></term>
      <listitem>
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 20e67c3f7d..0c30d6d0e8 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -81,7 +81,7 @@ pairingheap_SpGistSearchItem_cmp(const pairingheap_node *a,
 static void
 spgFreeSearchItem(SpGistScanOpaque so, SpGistSearchItem *item)
 {
-    if (!so->state.attLeafType.attbyval &&
+    if (!so->state.attType.attbyval &&
         DatumGetPointer(item->value) != NULL)
         pfree(DatumGetPointer(item->value));

@@ -296,6 +296,7 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 {
     IndexScanDesc scan;
     SpGistScanOpaque so;
+    TupleDesc    outTupDesc;
     int            i;

     scan = RelationGetIndexScan(rel, keysz, orderbysz);
@@ -314,8 +315,21 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
                                              "SP-GiST traversal-value context",
                                              ALLOCSET_DEFAULT_SIZES);

-    /* Set up indexTupDesc and xs_hitupdesc in case it's an index-only scan */
-    so->indexTupDesc = scan->xs_hitupdesc = RelationGetDescr(rel);
+    /*
+     * Set up indexTupDesc and xs_hitupdesc in case it's an index-only scan.
+     * (It's rather annoying to do this work when it might be wasted, but for
+     * most opclasses we can re-use the index reldesc instead of making one.)
+     */
+    if (so->state.attType.type ==
+        TupleDescAttr(RelationGetDescr(rel), 0)->atttypid)
+        outTupDesc = RelationGetDescr(rel);
+    else
+    {
+        outTupDesc = CreateTemplateTupleDesc(1);
+        TupleDescInitEntry(outTupDesc, 1, NULL,
+                           so->state.attType.type, -1, 0);
+    }
+    so->indexTupDesc = scan->xs_hitupdesc = outTupDesc;

     /* Allocate various arrays needed for order-by scans */
     if (scan->numberOfOrderBys > 0)
@@ -447,9 +461,10 @@ spgNewHeapItem(SpGistScanOpaque so, int level, ItemPointer heapPtr,
     item->level = level;
     item->heapPtr = *heapPtr;
     /* copy value to queue cxt out of tmp cxt */
+    /* caution: "leafValue" is of type attType not leafType */
     item->value = isnull ? (Datum) 0 :
-        datumCopy(leafValue, so->state.attLeafType.attbyval,
-                  so->state.attLeafType.attlen);
+        datumCopy(leafValue, so->state.attType.attbyval,
+                  so->state.attType.attlen);
     item->traversalValue = NULL;
     item->isLeaf = true;
     item->recheck = recheck;
@@ -589,10 +604,11 @@ spgMakeInnerItem(SpGistScanOpaque so,
         : parentItem->level;

     /* Must copy value out of temp context */
+    /* (we assume reconstructedValues are of type attType) */
     item->value = out->reconstructedValues
         ? datumCopy(out->reconstructedValues[i],
-                    so->state.attLeafType.attbyval,
-                    so->state.attLeafType.attlen)
+                    so->state.attType.attbyval,
+                    so->state.attType.attlen)
         : (Datum) 0;

     /*
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 949352ada7..084a3f78fc 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -23,6 +23,7 @@
 #include "access/xact.h"
 #include "catalog/pg_amop.h"
 #include "commands/vacuum.h"
+#include "nodes/nodeFuncs.h"
 #include "storage/bufmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
@@ -89,6 +90,66 @@ spghandler(PG_FUNCTION_ARGS)
     PG_RETURN_POINTER(amroutine);
 }

+/*
+ * GetIndexInputType
+ *        Determine the nominal input data type for an index column
+ *
+ * We define the "nominal" input type as the associated opclass's opcintype,
+ * or if that is a polymorphic type, the base type of the heap column or
+ * expression that is the index's input.  The reason for preferring the
+ * opcintype is that non-polymorphic opclasses probably don't want to hear
+ * about binary-compatible input types.  For instance, if a text opclass
+ * is being used with a varchar heap column, we want to report "text" not
+ * "varchar".  Likewise, opclasses don't want to hear about domain types,
+ * so if we do consult the actual input type, we make sure to flatten domains.
+ *
+ * At some point maybe this should go somewhere else, but it's not clear
+ * if any other index AMs have a use for it.
+ */
+static Oid
+GetIndexInputType(Relation index, AttrNumber indexcol)
+{
+    Oid            opcintype;
+    AttrNumber    heapcol;
+    List       *indexprs;
+    ListCell   *indexpr_item;
+
+    Assert(index->rd_index != NULL);
+    Assert(indexcol > 0 && indexcol <= index->rd_index->indnkeyatts);
+    opcintype = index->rd_opcintype[indexcol - 1];
+    if (!IsPolymorphicType(opcintype))
+        return opcintype;
+    heapcol = index->rd_index->indkey.values[indexcol - 1];
+    if (heapcol != 0)            /* Simple index column? */
+        return getBaseType(get_atttype(index->rd_index->indrelid, heapcol));
+
+    /*
+     * If the index expressions are already cached, skip calling
+     * RelationGetIndexExpressions, as it will make a copy which is overkill.
+     * We're not going to modify the trees, and we're not going to do anything
+     * that would invalidate the relcache entry before we're done.
+     */
+    if (index->rd_indexprs)
+        indexprs = index->rd_indexprs;
+    else
+        indexprs = RelationGetIndexExpressions(index);
+    indexpr_item = list_head(indexprs);
+    for (int i = 1; i <= index->rd_index->indnkeyatts; i++)
+    {
+        if (index->rd_index->indkey.values[i - 1] == 0)
+        {
+            /* expression column */
+            if (indexpr_item == NULL)
+                elog(ERROR, "wrong number of index expressions");
+            if (i == indexcol)
+                return getBaseType(exprType((Node *) lfirst(indexpr_item)));
+            indexpr_item = lnext(indexprs, indexpr_item);
+        }
+    }
+    elog(ERROR, "wrong number of index expressions");
+    return InvalidOid;            /* keep compiler quiet */
+}
+
 /* Fill in a SpGistTypeDesc struct with info about the specified data type */
 static void
 fillTypeDesc(SpGistTypeDesc *desc, Oid type)
@@ -109,6 +170,7 @@ spgGetCache(Relation index)
     if (index->rd_amcache == NULL)
     {
         Oid            atttype;
+        Oid            indexatttype;
         spgConfigIn in;
         FmgrInfo   *procinfo;
         Buffer        metabuffer;
@@ -121,11 +183,11 @@ spgGetCache(Relation index)
         Assert(index->rd_att->natts == 1);

         /*
-         * Get the actual data type of the indexed column from the index
-         * tupdesc.  We pass this to the opclass config function so that
+         * Get the actual (well, nominal) data type of the column being
+         * indexed.  We pass this to the opclass config function so that
          * polymorphic opclasses are possible.
          */
-        atttype = TupleDescAttr(index->rd_att, 0)->atttypid;
+        atttype = GetIndexInputType(index, 1);

         /* Call the config function to get config info for the opclass */
         in.attType = atttype;
@@ -136,18 +198,30 @@ spgGetCache(Relation index)
                           PointerGetDatum(&in),
                           PointerGetDatum(&cache->config));

+        /*
+         * The leafType had better agree with the actual index column type,
+         * which index.c will have derived from the opclass's opcintype.
+         */
+        indexatttype = TupleDescAttr(RelationGetDescr(index), 0)->atttypid;
+        if (OidIsValid(cache->config.leafType) &&
+            cache->config.leafType != indexatttype)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("SP-GiST leaf data type %s does not match declared type %s",
+                            format_type_be(cache->config.leafType),
+                            format_type_be(indexatttype))));
+
         /* Get the information we need about each relevant datatype */
         fillTypeDesc(&cache->attType, atttype);

-        if (OidIsValid(cache->config.leafType) &&
-            cache->config.leafType != atttype)
+        if (indexatttype != atttype)
         {
             if (!OidIsValid(index_getprocid(index, 1, SPGIST_COMPRESS_PROC)))
                 ereport(ERROR,
                         (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                          errmsg("compress method must be defined when leaf type is different from input type")));

-            fillTypeDesc(&cache->attLeafType, cache->config.leafType);
+            fillTypeDesc(&cache->attLeafType, indexatttype);
         }
         else
         {
diff --git a/src/backend/access/spgist/spgvalidate.c b/src/backend/access/spgist/spgvalidate.c
index 8bc3889a4d..b2f1ffa30f 100644
--- a/src/backend/access/spgist/spgvalidate.c
+++ b/src/backend/access/spgist/spgvalidate.c
@@ -43,6 +43,7 @@ spgvalidate(Oid opclassoid)
     Form_pg_opclass classform;
     Oid            opfamilyoid;
     Oid            opcintype;
+    Oid            opckeytype;
     char       *opclassname;
     HeapTuple    familytup;
     Form_pg_opfamily familyform;
@@ -57,6 +58,7 @@ spgvalidate(Oid opclassoid)
     spgConfigOut configOut;
     Oid            configOutLefttype = InvalidOid;
     Oid            configOutRighttype = InvalidOid;
+    Oid            configOutLeafType = InvalidOid;

     /* Fetch opclass information */
     classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
@@ -66,6 +68,7 @@ spgvalidate(Oid opclassoid)

     opfamilyoid = classform->opcfamily;
     opcintype = classform->opcintype;
+    opckeytype = classform->opckeytype;
     opclassname = NameStr(classform->opcname);

     /* Fetch opfamily information */
@@ -118,13 +121,20 @@ spgvalidate(Oid opclassoid)
                 configOutLefttype = procform->amproclefttype;
                 configOutRighttype = procform->amprocrighttype;

+                /* Identify actual leaf datum type */
+                if (OidIsValid(configOut.leafType))
+                    configOutLeafType = configOut.leafType;
+                else if (OidIsValid(opckeytype))
+                    configOutLeafType = opckeytype;
+                else
+                    configOutLeafType = procform->amproclefttype;
+
                 /*
                  * When leaf and attribute types are the same, compress
                  * function is not required and we set corresponding bit in
                  * functionset for later group consistency check.
                  */
-                if (!OidIsValid(configOut.leafType) ||
-                    configOut.leafType == configIn.attType)
+                if (configOutLeafType == configIn.attType)
                 {
                     foreach(lc, grouplist)
                     {
@@ -156,7 +166,7 @@ spgvalidate(Oid opclassoid)
                     ok = false;
                 else
                     ok = check_amproc_signature(procform->amproc,
-                                                configOut.leafType, true,
+                                                configOutLeafType, true,
                                                 1, 1, procform->amproclefttype);
                 break;
             case SPGIST_OPTIONS_PROC:
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index b6f95421d8..eda93e6d01 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -151,8 +151,8 @@ typedef struct SpGistState
 typedef struct SpGistSearchItem
 {
     pairingheap_node phNode;    /* pairing heap node */
-    Datum        value;            /* value reconstructed from parent or
-                                 * leafValue if heaptuple */
+    Datum        value;            /* value reconstructed from parent, or
+                                 * leafValue at leaf level */
     void       *traversalValue; /* opclass-specific traverse value */
     int            level;            /* level of items on this page */
     ItemPointerData heapPtr;    /* heap info, if heap tuple */
@@ -208,7 +208,7 @@ typedef struct SpGistScanOpaqueData

     /* These fields are only used in amgettuple scans: */
     bool        want_itup;        /* are we reconstructing tuples? */
-    TupleDesc    indexTupDesc;    /* if so, tuple descriptor for them */
+    TupleDesc    indexTupDesc;    /* if so, descriptor for reconstructed tuples */
     int            nPtrs;            /* number of TIDs found on current page */
     int            iPtr;            /* index for scanning through same */
     ItemPointerData heapPtrs[MaxIndexTuplesPerPage];    /* TIDs from cur page */

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: libpq debug log
Next
From: Bruce Momjian
Date:
Subject: Re: Have I found an interval arithmetic bug?