Re: limit in subquery causes poor selectivity estimation - Mailing list pgsql-hackers

From Tom Lane
Subject Re: limit in subquery causes poor selectivity estimation
Date
Msg-id 19020.1314981919@sss.pgh.pa.us
Whole thread Raw
In response to Re: limit in subquery causes poor selectivity estimation  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: limit in subquery causes poor selectivity estimation
Re: limit in subquery causes poor selectivity estimation
List pgsql-hackers
Peter Eisentraut <peter_e@gmx.net> writes:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs.  It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?

> It turns out that this is a regression introduced in 8.4.8;

Well, the fact that examine_variable punts on subqueries is certainly
not a "regression introduced in 8.4.8"; it's always been like that.

I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides.  Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.

I'm not sure what we could do instead.  Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around.  It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.

On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch.  I have a draft patch that I've been playing with
(attached).  The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT.  ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values).  But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do.  Comments?

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ef29374fcccae23cb663c04470f12c22321a0e2c..a703f4a910c0e5f521f09cf9564a05c73cf803b8 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 95,100 ****
--- 95,101 ----
  #include "access/sysattr.h"
  #include "catalog/index.h"
  #include "catalog/pg_collation.h"
+ #include "catalog/pg_inherits_fn.h"
  #include "catalog/pg_opfamily.h"
  #include "catalog/pg_statistic.h"
  #include "catalog/pg_type.h"
*************** static double convert_one_bytea_to_scala
*** 168,173 ****
--- 169,176 ----
                              int rangelo, int rangehi);
  static char *convert_string_datum(Datum value, Oid typid);
  static double convert_timevalue_to_scalar(Datum value, Oid typid);
+ static void examine_variable_recurse(Query *subquery, AttrNumber resno,
+                          VariableStatData *vardata);
  static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
                     Oid sortop, Datum *min, Datum *max);
  static bool get_actual_variable_range(PlannerInfo *root,
*************** examine_variable(PlannerInfo *root, Node
*** 4176,4196 ****
          }
          else if (rte->rtekind == RTE_RELATION)
          {
              vardata->statsTuple = SearchSysCache3(STATRELATTINH,
                                                  ObjectIdGetDatum(rte->relid),
                                                  Int16GetDatum(var->varattno),
                                                    BoolGetDatum(rte->inh));
              vardata->freefunc = ReleaseSysCache;
          }
          else
          {
              /*
!              * XXX This means the Var comes from a JOIN or sub-SELECT. Later
!              * add code to dig down into the join etc and see if we can trace
!              * the variable to something with stats.  (But beware of
!              * sub-SELECTs with DISTINCT/GROUP BY/etc.    Perhaps there are no
!              * cases where this would really be useful, because we'd have
!              * flattened the subselect if it is??)
               */
          }

--- 4179,4205 ----
          }
          else if (rte->rtekind == RTE_RELATION)
          {
+             /* plain table, so look up the column in pg_statistic */
              vardata->statsTuple = SearchSysCache3(STATRELATTINH,
                                                  ObjectIdGetDatum(rte->relid),
                                                  Int16GetDatum(var->varattno),
                                                    BoolGetDatum(rte->inh));
              vardata->freefunc = ReleaseSysCache;
          }
+         else if (rte->rtekind == RTE_SUBQUERY)
+         {
+             /* recurse into subquery */
+             examine_variable_recurse(rte->subquery, var->varattno,
+                                      vardata);
+         }
          else
          {
              /*
!              * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE.
!              * (We won't see RTE_JOIN here because join alias Vars have
!              * already been flattened.)  There's not much we can do with
!              * function outputs, but maybe someday try to be smarter about
!              * VALUES and/or CTEs.
               */
          }

*************** examine_variable(PlannerInfo *root, Node
*** 4335,4340 ****
--- 4344,4435 ----
  }

  /*
+  * examine_variable_recurse
+  *        Handle a Var referring to a subquery result for examine_variable
+  *
+  * If the Var ultimately refers to a column of a table, we can fill in the
+  * statsTuple; none of the other fields of vardata need to be changed.
+  * (In particular, we don't try to set isunique, since even if the underlying
+  * column is unique, the subquery may have joined to other tables in a way
+  * that creates duplicates.)
+  *
+  * XXX For the moment, we ignore the possibility that the subquery may change
+  * the column's statistics so much that we'd be better off not consulting
+  * pg_statistic at all.  Possibly we should give up if the subquery uses
+  * GROUP BY or DISTINCT, for instance.
+  *
+  * The subquery we are looking at has not been through planner preprocessing,
+  * so some things have to be done differently here than in examine_variable.
+  * (Perhaps we should rearrange things so that the sub-root data structure
+  * is kept around after we plan a subquery?)
+  */
+ static void
+ examine_variable_recurse(Query *subquery, AttrNumber resno,
+                          VariableStatData *vardata)
+ {
+     TargetEntry *ste;
+     Var           *var;
+     RangeTblEntry *rte;
+
+     /* Get the subquery output expression referenced by the upper Var */
+     ste = get_tle_by_resno(subquery->targetList, resno);
+     if (ste == NULL || ste->resjunk)
+         elog(ERROR, "subquery does not have attribute %d", resno);
+     var = (Var *) ste->expr;
+
+ restart:
+     /* Can only handle a simple Var of subquery's query level */
+     if (var && IsA(var, Var) &&
+         var->varlevelsup == 0)
+     {
+         /* OK, fetch the RTE referenced by the Var */
+         rte = rt_fetch(var->varno, subquery->rtable);
+
+         /* XXX Can't call get_relation_stats_hook for lack of subroot */
+
+         if (rte->rtekind == RTE_RELATION)
+         {
+             /*
+              * We can't just rely on rte->inh here, because in parser output
+              * that only signifies the absence of ONLY.  We have to check
+              * relhassubclass as well.  For speed, we don't actually examine
+              * pg_inherits, so we might get fooled by a table that once had
+              * children but no longer does.  But that's okay, because ANALYZE
+              * will still produce inheritance-tree statistics in such cases,
+              * so we will still find an applicable pg_statistic entry.
+              */
+             bool    inh;
+
+             inh = rte->inh && has_subclass(rte->relid);
+             vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+                                                   ObjectIdGetDatum(rte->relid),
+                                                   Int16GetDatum(var->varattno),
+                                                   BoolGetDatum(inh));
+             vardata->freefunc = ReleaseSysCache;
+         }
+         else if (rte->rtekind == RTE_SUBQUERY)
+         {
+             /* recurse some more ... */
+             examine_variable_recurse(rte->subquery, var->varattno,
+                                      vardata);
+         }
+         else if (rte->rtekind == RTE_JOIN)
+         {
+             /*
+              * Since join alias variables haven't been flattened in the
+              * subquery, we have to be prepared to dereference them.
+              */
+             if (var->varattno == InvalidAttrNumber)
+                 return;            /* punt on whole-row reference */
+             Assert(var->varattno > 0);
+             var = (Var *) list_nth(rte->joinaliasvars, var->varattno - 1);
+             goto restart;
+         }
+         /* As in examine_variable, do nothing for other RTE kinds */
+     }
+ }
+
+ /*
   * get_variable_numdistinct
   *      Estimate the number of distinct values of a variable.
   *

pgsql-hackers by date:

Previous
From: Kohei Kaigai
Date:
Subject: Re: [v9.1] sepgsql - userspace access vector cache
Next
From: Alvaro Herrera
Date:
Subject: Re: [COMMITTERS] pgsql: Remove "fmgr.h" include in cube contrib --- caused crash on a Ge