Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date
Msg-id 26313.1329351899@sss.pgh.pa.us
Whole thread Raw
In response to Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
[ getting back to this after assorted distractions ]

Andres Freund <andres@anarazel.de> writes:
> On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>>>> Looks pretty bogus to me.  You're essentially assuming that the side of
>>>> the join without statistics is unique, which is a mighty dubious
>>>> assumption.

>>> It sure is a bit dubious. But assuming that a semijoin that has max of n
>>> rows on the inner side results in half of the outer sides rows (>> n) is
>>> pretty bogus as well.

> In the current example we have an estimate for the distinctness of the LHS. I
> don't see how guesstimating the RHS number of tuples in a semijoin to
> vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
> the whole thing.

The real problem is that the estimate we want to use involves the ratio
nd2/nd1.  If either of those numbers is mere fantasy, so is the ratio.
It doesn't really help to say "well, we can upper-bound this number
here", because sometimes a too large result is as bad as too small.

>> One thing I've considered but not done anything about is that in a lot
>> of practical cases for this, the aggregation or grouping properties of
>> the sub-select would provide adequate reason for assuming its output is
>> more or less unique, so that taking ndistinct equal to number of rows
>> actually is sane.  But it would need a bit of thought about what
>> properties we want to treat as justifying such an assumption, and then
>> some code to see if the join key is a Var coming out of such a
>> sub-select.  (Actually, what such a patch would probably look like is
>> modifying examine_simple_variable to not just punt when it finds the
>> Var came from an aggregating subquery.)

> Yes, having that would be great, but be a bit more invasive than I like to
> think right now. This thing is actually a problem for people in the field...

I'm not sure it's that bad.  Attached is a simple patch to notice that
the subquery is "SELECT DISTINCT foo" and if so assume the result is
unique.  This takes care of at least the first of your original
examples.  I'm not sure whether fixing this single case is enough to get
excited about, but it's a step forward anyway.

Your second example, involving a WITH, would properly be handled by
teaching examine_simple_variable to drill down into CTEs.  I lacked the
round tuits to make it do so initially, and still do, but if you'd like
to have a go at it ...

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6d78068476e520f7dd2da6c0c8d48d93e0649768..5a4fa77848b5e44abd913e536f0f83cc06bb85fb 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 110,115 ****
--- 110,116 ----
  #include "optimizer/predtest.h"
  #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
+ #include "parser/parse_clause.h"
  #include "parser/parse_coerce.h"
  #include "parser/parsetree.h"
  #include "utils/builtins.h"
*************** examine_simple_variable(PlannerInfo *roo
*** 4357,4375 ****
      {
          /*
           * Plain subquery (not one that was converted to an appendrel).
-          *
-          * Punt if subquery uses set operations, GROUP BY, or DISTINCT --- any
-          * of these will mash underlying columns' stats beyond recognition.
-          * (Set ops are particularly nasty; if we forged ahead, we would
-          * return stats relevant to only the leftmost subselect...)
           */
          Query       *subquery = rte->subquery;
          RelOptInfo *rel;
          TargetEntry *ste;

          if (subquery->setOperations ||
!             subquery->groupClause ||
!             subquery->distinctClause)
              return;

          /*
--- 4358,4378 ----
      {
          /*
           * Plain subquery (not one that was converted to an appendrel).
           */
          Query       *subquery = rte->subquery;
          RelOptInfo *rel;
          TargetEntry *ste;

+         /*
+          * Punt if subquery uses set operations or GROUP BY, as these will
+          * mash underlying columns' stats beyond recognition.  (Set ops are
+          * particularly nasty; if we forged ahead, we would return stats
+          * relevant to only the leftmost subselect...)  DISTINCT is also
+          * problematic, but we check that later because there is a possibility
+          * of learning something even with it.
+          */
          if (subquery->setOperations ||
!             subquery->groupClause)
              return;

          /*
*************** examine_simple_variable(PlannerInfo *roo
*** 4415,4420 ****
--- 4418,4437 ----
                   rte->eref->aliasname, var->varattno);
          var = (Var *) ste->expr;

+         /*
+          * If subquery uses DISTINCT, we can't make use of any stats for the
+          * variable ... but, if it's the only DISTINCT column, we are entitled
+          * to consider it unique.  We do the test this way so that it works
+          * for cases involving DISTINCT ON.
+          */
+         if (subquery->distinctClause)
+         {
+             if (list_length(subquery->distinctClause) == 1 &&
+                 targetIsInSortList(ste, InvalidOid, subquery->distinctClause))
+                 vardata->isunique = true;
+             return;
+         }
+
          /* Can only handle a simple Var of subquery's query level */
          if (var && IsA(var, Var) &&
              var->varlevelsup == 0)

pgsql-bugs by date:

Previous
From: Félix GERZAGUET
Date:
Subject: Re: BUG #6452: psql: can't change client encoding from the command line
Next
From: yannick godbout
Date:
Subject: probleme d'installation