Re: distinct estimate of a hard-coded VALUES list - Mailing list pgsql-hackers

From Tom Lane
Subject Re: distinct estimate of a hard-coded VALUES list
Date
Msg-id 10078.1471955305@sss.pgh.pa.us
Whole thread Raw
In response to Re: distinct estimate of a hard-coded VALUES list  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] distinct estimate of a hard-coded VALUES list  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 08/22/2016 07:42 PM, Alvaro Herrera wrote:
>> Also, if we patch it this way and somebody has a slow query because of a
>> lot of duplicate values, it's easy to solve the problem by
>> de-duplicating.  But with the current code, people that have the
>> opposite problem has no way to work around it.

> I certainly agree it's better when a smart user can fix his query plan
> by deduplicating the values than when we end up generating a poor plan
> due to assuming some users are somewhat dumb.

Well, that seems to be the majority opinion, so attached is a patch
to do it.  Do we want to sneak this into 9.6, or wait?

> I wonder how expensive would it be to actually count the number of
> distinct values - there certainly are complex data types where the
> comparisons are fairly expensive, but I would not expect those to be
> used in explicit VALUES lists.

You'd have to sort the values before de-duping, and deal with VALUES
expressions that aren't simple literals.  And you'd have to do it a
lot --- by my count, get_variable_numdistinct is invoked six times
on the Var representing the VALUES output in Jeff's example query.
Maybe you could cache the results somewhere, but that adds even
more complication.  Given the lack of prior complaints about this,
it's hard to believe it's worth the trouble.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 56943f2..43e6220 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** get_variable_numdistinct(VariableStatDat
*** 4768,4773 ****
--- 4768,4784 ----
           */
          stadistinct = 2.0;
      }
+     else if (vardata->rel && vardata->rel->rtekind == RTE_VALUES)
+     {
+         /*
+          * If the Var represents a column of a VALUES RTE, assume it's unique.
+          * This could of course be very wrong, but it should tend to be true
+          * in well-written queries.  We could consider examining the RTE
+          * contents to get some real statistics; but that only works if the
+          * expressions are constants, and it would be pretty expensive anyway.
+          */
+         stadistinct = -1.0;        /* unique (and all non null) */
+     }
      else
      {
          /*
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index fcfb0d4..25715c9 100644
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct PlannerInfo
*** 399,405 ****
   *
   *        relid - RTE index (this is redundant with the relids field, but
   *                is provided for convenience of access)
!  *        rtekind - distinguishes plain relation, subquery, or function RTE
   *        min_attr, max_attr - range of valid AttrNumbers for rel
   *        attr_needed - array of bitmapsets indicating the highest joinrel
   *                in which each attribute is needed; if bit 0 is set then
--- 399,405 ----
   *
   *        relid - RTE index (this is redundant with the relids field, but
   *                is provided for convenience of access)
!  *        rtekind - copy of RTE's rtekind field
   *        min_attr, max_attr - range of valid AttrNumbers for rel
   *        attr_needed - array of bitmapsets indicating the highest joinrel
   *                in which each attribute is needed; if bit 0 is set then
*************** typedef struct RelOptInfo
*** 512,518 ****
      /* information about a base rel (not set for join rels!) */
      Index        relid;
      Oid            reltablespace;    /* containing tablespace */
!     RTEKind        rtekind;        /* RELATION, SUBQUERY, or FUNCTION */
      AttrNumber    min_attr;        /* smallest attrno of rel (often <0) */
      AttrNumber    max_attr;        /* largest attrno of rel */
      Relids       *attr_needed;    /* array indexed [min_attr .. max_attr] */
--- 512,518 ----
      /* information about a base rel (not set for join rels!) */
      Index        relid;
      Oid            reltablespace;    /* containing tablespace */
!     RTEKind        rtekind;        /* RELATION, SUBQUERY, FUNCTION, etc */
      AttrNumber    min_attr;        /* smallest attrno of rel (often <0) */
      AttrNumber    max_attr;        /* largest attrno of rel */
      Relids       *attr_needed;    /* array indexed [min_attr .. max_attr] */

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Logical decoding of sequence advances, part II
Next
From: Tom Lane
Date:
Subject: Re: Block level parallel vacuum WIP