Re: SP-GiST bug. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: SP-GiST bug.
Date
Msg-id 26213.1402007953@sss.pgh.pa.us
Whole thread Raw
In response to Re: SP-GiST bug.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> I think what we have to do is use a different dummy value for the node
> label of a pushed-down allTheSame tuple than we do for end-of-string.
> This requires widening the node labels from uint8 to (at least) int16.
> However, I think that's essentially free because pass-by-value node
> labels are stored as Datums anyway.  In fact, it might even be upward
> compatible, at least for cases that aren't broken today.

Here's a draft patch along these lines.  AFAICT it is fully upward
compatible with existing indexes, although in cases where the current code
creates a zero node label for strings ending at that level, this code will
choose to push new strings into a new child node with label -1 instead, so
that there's some very small added inefficiency.  I don't think we need to
tell people to reindex if we use this fix, though.  I went with new
dummy labels of -1 and -2 partially so that an existing zero label would
not be a hazard for future insertions, and partly with the thought that if
we ever allow embedded \0 in strings, this coding would be able to handle
it with minimal adjustment.

Note that this fix slightly reduces the maximum safe length of a prefix
string (SPGIST_MAX_PREFIX_LENGTH).  In an existing index, there might be
prefixes longer than that.  The code will still work, but it's
theoretically possible that addition of nodes to such a tuple could result
in "SPGiST inner tuple size exceeds maximum" errors, if you managed to
populate all possible next-byte values at that tuple.  I think the odds of
this being a problem in the field are negligible (and if it did happen,
reindexing would fix the problem).

            regards, tom lane

diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index 5b7a5a0..b278916 100644
*** a/src/backend/access/spgist/spgtextproc.c
--- b/src/backend/access/spgist/spgtextproc.c
***************
*** 3,8 ****
--- 3,34 ----
   * spgtextproc.c
   *      implementation of radix tree (compressed trie) over text
   *
+  * In a text_ops SPGiST index, inner tuples can have a prefix which is the
+  * common prefix of all strings indexed under that tuple.  The node labels
+  * represent the next byte of the string(s) after the prefix.  Assuming we
+  * always use the longest possible prefix, we will get more than one node
+  * label unless the prefix length is restricted by SPGIST_MAX_PREFIX_LENGTH.
+  *
+  * To reconstruct the indexed string for any index entry, concatenate the
+  * inner-tuple prefixes and node labels starting at the root and working
+  * down to the leaf entry, then append the datum in the leaf entry.
+  * (While descending the tree, "level" is the number of bytes reconstructed
+  * so far.)
+  *
+  * However, there are two special cases for node labels: -1 indicates that
+  * there are no more bytes after the prefix-so-far, and -2 indicates that we
+  * had to split an existing prefix-less allTheSame tuple (in such a case we
+  * have to create a node label that doesn't correspond to any string byte).
+  * In either case, the node label does not contribute anything to the
+  * reconstructed string.
+  *
+  * Previously, we used a node label of zero for both special cases, but
+  * this was problematic because one can't tell whether a string ending at
+  * the current level can be pushed down into such a child node.  For
+  * backwards compatibility, we still support such node labels for reading;
+  * but no new entries will ever be pushed down into a zero-labeled child.
+  * No new entries ever get pushed into a -2-labeled child, either.
+  *
   *
   * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
***************
*** 24,51 ****

  /*
   * In the worst case, an inner tuple in a text radix tree could have as many
!  * as 256 nodes (one for each possible byte value).  Each node can take 16
!  * bytes on MAXALIGN=8 machines.  The inner tuple must fit on an index page
!  * of size BLCKSZ.  Rather than assuming we know the exact amount of overhead
   * imposed by page headers, tuple headers, etc, we leave 100 bytes for that
   * (the actual overhead should be no more than 56 bytes at this writing, so
   * there is slop in this number).  So we can safely create prefixes up to
!  * BLCKSZ - 256 * 16 - 100 bytes long.  Unfortunately, because 256 * 16 is
!  * already 4K, there is no safe prefix length when BLCKSZ is less than 8K;
   * it is always possible to get "SPGiST inner tuple size exceeds maximum"
   * if there are too many distinct next-byte values at a given place in the
   * tree.  Since use of nonstandard block sizes appears to be negligible in
   * the field, we just live with that fact for now, choosing a max prefix
   * size of 32 bytes when BLCKSZ is configured smaller than default.
   */
! #define SPGIST_MAX_PREFIX_LENGTH    Max((int) (BLCKSZ - 256 * 16 - 100), 32)

  /* Struct for sorting values in picksplit */
  typedef struct spgNodePtr
  {
      Datum        d;
      int            i;
!     uint8        c;
  } spgNodePtr;


--- 50,78 ----

  /*
   * In the worst case, an inner tuple in a text radix tree could have as many
!  * as 258 nodes (one for each possible byte value, plus the two special
!  * cases).  Each node can take 16 bytes on MAXALIGN=8 machines.
!  * The inner tuple must fit on an index page of size BLCKSZ.
!  * Rather than assuming we know the exact amount of overhead
   * imposed by page headers, tuple headers, etc, we leave 100 bytes for that
   * (the actual overhead should be no more than 56 bytes at this writing, so
   * there is slop in this number).  So we can safely create prefixes up to
!  * BLCKSZ - 258 * 16 - 100 bytes long.  Unfortunately, because 258 * 16 is
!  * over 4K, there is no safe prefix length when BLCKSZ is less than 8K;
   * it is always possible to get "SPGiST inner tuple size exceeds maximum"
   * if there are too many distinct next-byte values at a given place in the
   * tree.  Since use of nonstandard block sizes appears to be negligible in
   * the field, we just live with that fact for now, choosing a max prefix
   * size of 32 bytes when BLCKSZ is configured smaller than default.
   */
! #define SPGIST_MAX_PREFIX_LENGTH    Max((int) (BLCKSZ - 258 * 16 - 100), 32)

  /* Struct for sorting values in picksplit */
  typedef struct spgNodePtr
  {
      Datum        d;
      int            i;
!     int16        c;
  } spgNodePtr;


*************** spg_text_config(PG_FUNCTION_ARGS)
*** 56,62 ****
      spgConfigOut *cfg = (spgConfigOut *) PG_GETARG_POINTER(1);

      cfg->prefixType = TEXTOID;
!     cfg->labelType = CHAROID;
      cfg->canReturnData = true;
      cfg->longValuesOK = true;    /* suffixing will shorten long values */
      PG_RETURN_VOID();
--- 83,89 ----
      spgConfigOut *cfg = (spgConfigOut *) PG_GETARG_POINTER(1);

      cfg->prefixType = TEXTOID;
!     cfg->labelType = INT2OID;
      cfg->canReturnData = true;
      cfg->longValuesOK = true;    /* suffixing will shorten long values */
      PG_RETURN_VOID();
*************** commonPrefix(const char *a, const char *
*** 107,118 ****
  }

  /*
!  * Binary search an array of uint8 datums for a match to c
   *
   * On success, *i gets the match location; on failure, it gets where to insert
   */
  static bool
! searchChar(Datum *nodeLabels, int nNodes, uint8 c, int *i)
  {
      int            StopLow = 0,
                  StopHigh = nNodes;
--- 134,145 ----
  }

  /*
!  * Binary search an array of int16 datums for a match to c
   *
   * On success, *i gets the match location; on failure, it gets where to insert
   */
  static bool
! searchChar(Datum *nodeLabels, int nNodes, int16 c, int *i)
  {
      int            StopLow = 0,
                  StopHigh = nNodes;
*************** searchChar(Datum *nodeLabels, int nNodes
*** 120,126 ****
      while (StopLow < StopHigh)
      {
          int            StopMiddle = (StopLow + StopHigh) >> 1;
!         uint8        middle = DatumGetUInt8(nodeLabels[StopMiddle]);

          if (c < middle)
              StopHigh = StopMiddle;
--- 147,153 ----
      while (StopLow < StopHigh)
      {
          int            StopMiddle = (StopLow + StopHigh) >> 1;
!         int16        middle = DatumGetInt16(nodeLabels[StopMiddle]);

          if (c < middle)
              StopHigh = StopMiddle;
*************** spg_text_choose(PG_FUNCTION_ARGS)
*** 145,160 ****
      text       *inText = DatumGetTextPP(in->datum);
      char       *inStr = VARDATA_ANY(inText);
      int            inSize = VARSIZE_ANY_EXHDR(inText);
!     uint8        nodeChar = '\0';
!     int            i = 0;
      int            commonLen = 0;

      /* Check for prefix match, set nodeChar to first byte after prefix */
      if (in->hasPrefix)
      {
          text       *prefixText = DatumGetTextPP(in->prefixDatum);
!         char       *prefixStr = VARDATA_ANY(prefixText);
!         int            prefixSize = VARSIZE_ANY_EXHDR(prefixText);

          commonLen = commonPrefix(inStr + in->level,
                                   prefixStr,
--- 172,190 ----
      text       *inText = DatumGetTextPP(in->datum);
      char       *inStr = VARDATA_ANY(inText);
      int            inSize = VARSIZE_ANY_EXHDR(inText);
!     char       *prefixStr = NULL;
!     int            prefixSize = 0;
      int            commonLen = 0;
+     int16        nodeChar = 0;
+     int            i = 0;

      /* Check for prefix match, set nodeChar to first byte after prefix */
      if (in->hasPrefix)
      {
          text       *prefixText = DatumGetTextPP(in->prefixDatum);
!
!         prefixStr = VARDATA_ANY(prefixText);
!         prefixSize = VARSIZE_ANY_EXHDR(prefixText);

          commonLen = commonPrefix(inStr + in->level,
                                   prefixStr,
*************** spg_text_choose(PG_FUNCTION_ARGS)
*** 164,172 ****
          if (commonLen == prefixSize)
          {
              if (inSize - in->level > commonLen)
!                 nodeChar = *(uint8 *) (inStr + in->level + commonLen);
              else
!                 nodeChar = '\0';
          }
          else
          {
--- 194,202 ----
          if (commonLen == prefixSize)
          {
              if (inSize - in->level > commonLen)
!                 nodeChar = *(unsigned char *) (inStr + in->level + commonLen);
              else
!                 nodeChar = -1;
          }
          else
          {
*************** spg_text_choose(PG_FUNCTION_ARGS)
*** 184,190 ****
                      formTextDatum(prefixStr, commonLen);
              }
              out->result.splitTuple.nodeLabel =
!                 UInt8GetDatum(*(prefixStr + commonLen));

              if (prefixSize - commonLen == 1)
              {
--- 214,220 ----
                      formTextDatum(prefixStr, commonLen);
              }
              out->result.splitTuple.nodeLabel =
!                 Int16GetDatum(*(unsigned char *) (prefixStr + commonLen));

              if (prefixSize - commonLen == 1)
              {
*************** spg_text_choose(PG_FUNCTION_ARGS)
*** 203,213 ****
      }
      else if (inSize > in->level)
      {
!         nodeChar = *(uint8 *) (inStr + in->level);
      }
      else
      {
!         nodeChar = '\0';
      }

      /* Look up nodeChar in the node label array */
--- 233,243 ----
      }
      else if (inSize > in->level)
      {
!         nodeChar = *(unsigned char *) (inStr + in->level);
      }
      else
      {
!         nodeChar = -1;
      }

      /* Look up nodeChar in the node label array */
*************** spg_text_choose(PG_FUNCTION_ARGS)
*** 233,254 ****
      else if (in->allTheSame)
      {
          /*
!          * Can't use AddNode action, so split the tuple.  The upper tuple has
!          * the same prefix as before and uses an empty node label for the
!          * lower tuple.  The lower tuple has no prefix and the same node
!          * labels as the original tuple.
           */
          out->resultType = spgSplitTuple;
!         out->result.splitTuple.prefixHasPrefix = in->hasPrefix;
!         out->result.splitTuple.prefixPrefixDatum = in->prefixDatum;
!         out->result.splitTuple.nodeLabel = UInt8GetDatum('\0');
          out->result.splitTuple.postfixHasPrefix = false;
      }
      else
      {
          /* Add a node for the not-previously-seen nodeChar value */
          out->resultType = spgAddNode;
!         out->result.addNode.nodeLabel = UInt8GetDatum(nodeChar);
          out->result.addNode.nodeN = i;
      }

--- 263,303 ----
      else if (in->allTheSame)
      {
          /*
!          * Can't use AddNode action, so split the tuple.  There are two cases:
!          * If the existing tuple has a prefix, the new upper tuple has a
!          * prefix one byte shorter (possibly empty) and uses the last byte of
!          * the old prefix as node label for the lower tuple.  If the existing
!          * tuple has no prefix, the new upper tuple doesn't either, and it
!          * uses the dummy node label -2 for the lower tuple.  In either case,
!          * the new lower tuple has no prefix and the same node labels as the
!          * original tuple.
           */
          out->resultType = spgSplitTuple;
!         if (prefixSize > 0)
!         {
!             if (prefixSize > 1)
!             {
!                 out->result.splitTuple.prefixHasPrefix = true;
!                 out->result.splitTuple.prefixPrefixDatum =
!                     formTextDatum(prefixStr, prefixSize - 1);
!             }
!             else
!                 out->result.splitTuple.prefixHasPrefix = false;
!             out->result.splitTuple.nodeLabel =
!                 Int16GetDatum(*(unsigned char *) (prefixStr + prefixSize - 1));
!         }
!         else
!         {
!             out->result.splitTuple.prefixHasPrefix = false;
!             out->result.splitTuple.nodeLabel = Int16GetDatum(-2);
!         }
          out->result.splitTuple.postfixHasPrefix = false;
      }
      else
      {
          /* Add a node for the not-previously-seen nodeChar value */
          out->resultType = spgAddNode;
!         out->result.addNode.nodeLabel = Int16GetDatum(nodeChar);
          out->result.addNode.nodeN = i;
      }

*************** cmpNodePtr(const void *a, const void *b)
*** 262,273 ****
      const spgNodePtr *aa = (const spgNodePtr *) a;
      const spgNodePtr *bb = (const spgNodePtr *) b;

!     if (aa->c == bb->c)
!         return 0;
!     else if (aa->c > bb->c)
!         return 1;
!     else
!         return -1;
  }

  Datum
--- 311,317 ----
      const spgNodePtr *aa = (const spgNodePtr *) a;
      const spgNodePtr *bb = (const spgNodePtr *) b;

!     return aa->c - bb->c;
  }

  Datum
*************** spg_text_picksplit(PG_FUNCTION_ARGS)
*** 319,333 ****
          text       *texti = DatumGetTextPP(in->datums[i]);

          if (commonLen < VARSIZE_ANY_EXHDR(texti))
!             nodes[i].c = *(uint8 *) (VARDATA_ANY(texti) + commonLen);
          else
!             nodes[i].c = '\0';    /* use \0 if string is all common */
          nodes[i].i = i;
          nodes[i].d = in->datums[i];
      }

      /*
!      * Sort by label bytes so that we can group the values into nodes.  This
       * also ensures that the nodes are ordered by label value, allowing the
       * use of binary search in searchChar.
       */
--- 363,377 ----
          text       *texti = DatumGetTextPP(in->datums[i]);

          if (commonLen < VARSIZE_ANY_EXHDR(texti))
!             nodes[i].c = *(unsigned char *) (VARDATA_ANY(texti) + commonLen);
          else
!             nodes[i].c = -1;    /* use -1 if string is all common */
          nodes[i].i = i;
          nodes[i].d = in->datums[i];
      }

      /*
!      * Sort by label values so that we can group the values into nodes.  This
       * also ensures that the nodes are ordered by label value, allowing the
       * use of binary search in searchChar.
       */
*************** spg_text_picksplit(PG_FUNCTION_ARGS)
*** 346,352 ****

          if (i == 0 || nodes[i].c != nodes[i - 1].c)
          {
!             out->nodeLabels[out->nNodes] = UInt8GetDatum(nodes[i].c);
              out->nNodes++;
          }

--- 390,396 ----

          if (i == 0 || nodes[i].c != nodes[i - 1].c)
          {
!             out->nodeLabels[out->nNodes] = Int16GetDatum(nodes[i].c);
              out->nNodes++;
          }

*************** spg_text_inner_consistent(PG_FUNCTION_AR
*** 377,385 ****

      /*
       * Reconstruct values represented at this tuple, including parent data,
!      * prefix of this tuple if any, and the node label if any.  in->level
!      * should be the length of the previously reconstructed value, and the
!      * number of bytes added here is prefixSize or prefixSize + 1.
       *
       * Note: we assume that in->reconstructedValue isn't toasted and doesn't
       * have a short varlena header.  This is okay because it must have been
--- 421,429 ----

      /*
       * Reconstruct values represented at this tuple, including parent data,
!      * prefix of this tuple if any, and the node label if it's non-dummy.
!      * in->level should be the length of the previously reconstructed value,
!      * and the number of bytes added here is prefixSize or prefixSize + 1.
       *
       * Note: we assume that in->reconstructedValue isn't toasted and doesn't
       * have a short varlena header.  This is okay because it must have been
*************** spg_text_inner_consistent(PG_FUNCTION_AR
*** 422,438 ****

      for (i = 0; i < in->nNodes; i++)
      {
!         uint8        nodeChar = DatumGetUInt8(in->nodeLabels[i]);
          int            thisLen;
          bool        res = true;
          int            j;

!         /* If nodeChar is zero, don't include it in data */
!         if (nodeChar == '\0')
              thisLen = maxReconstrLen - 1;
          else
          {
!             ((char *) VARDATA(reconstrText))[maxReconstrLen - 1] = nodeChar;
              thisLen = maxReconstrLen;
          }

--- 466,482 ----

      for (i = 0; i < in->nNodes; i++)
      {
!         int16        nodeChar = DatumGetInt16(in->nodeLabels[i]);
          int            thisLen;
          bool        res = true;
          int            j;

!         /* If nodeChar is a dummy value, don't include it in data */
!         if (nodeChar <= 0)
              thisLen = maxReconstrLen - 1;
          else
          {
!             ((unsigned char *) VARDATA(reconstrText))[maxReconstrLen - 1] = nodeChar;
              thisLen = maxReconstrLen;
          }

*************** spg_text_inner_consistent(PG_FUNCTION_AR
*** 447,453 ****
               * If it's a collation-aware operator, but the collation is C, we
               * can treat it as non-collation-aware.  With non-C collation we
               * need to traverse whole tree :-( so there's no point in making
!              * any check here.
               */
              if (strategy > 10)
              {
--- 491,499 ----
               * If it's a collation-aware operator, but the collation is C, we
               * can treat it as non-collation-aware.  With non-C collation we
               * need to traverse whole tree :-( so there's no point in making
!              * any check here.  (Note also that our reconstructed value may
!              * well end with a partial multibyte character, so that applying
!              * any encoding-sensitive test to it would be risky anyhow.)
               */
              if (strategy > 10)
              {

pgsql-hackers by date:

Previous
From: Linos
Date:
Subject: Re: performance regression in 9.2/9.3
Next
From: Michael Paquier
Date:
Subject: Re: slotname vs slot_name