Thread: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From
Andres Freund
Date:
Hi all,

Monday RhodiumToad/Andrew Gierth and I tried to debug a plan of raptelan
(CCed) getting rather strange plans. After trawling through some unrelated
stuff we diagnosed that the problem were some rather strange estimates.

I managed to extract a reproducable, simple testcase. Our analysis was that a
recent backported change causes the strange estimates:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f5d2fe3029b181fe773a02f1d4b34624c357634

The problem lies in eqjoinsel_semi's behaviour if it doesn't find MCVs and
cannot rely on the ndistinct estimate of a lower node. If one of both sides of
a semijoin doesn't have a sensible estimate it just assumes a selectivity of
0.5 which will often overestimate
That change is pretty bad because - as seen in the example below - it leads to
absurd rowcounts.

The approach I quickly tried was to use the underlying relations rows as
substitute ndistinct estimate. For those examples that seems to work rather
well. Very likely its sensible though to check whether those values actually
make sense before really using them ;)

diff --git a/src/backend/utils/adt/selfuncs.c
b/src/backend/utils/adt/selfuncs.c
index da638f8..7117978 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2471,15 +2471,22 @@ eqjoinsel_semi(Oid operator,
                 */
                double          nullfrac1 = stats1 ? stats1->stanullfrac :
0.0;

-               if (!isdefault1 && !isdefault2)
+               if (isdefault1 && isdefault2)
                {
+                       selec = 0.5 * (1.0 - nullfrac1);
+               }
+               else{
+                       if(isdefault1)
+                               nd1 = Max(nd1, vardata1->rel->rows * (1.0 -
nullfrac1));
+
+                       if(isdefault2)
+                               nd2 = Max(nd2, vardata2->rel->rows);
+
                        if (nd1 <= nd2 || nd2 < 0)
                                selec = 1.0 - nullfrac1;
                        else
                                selec = (nd2 / nd1) * (1.0 - nullfrac1);
                }
-               else
-                       selec = 0.5 * (1.0 - nullfrac1);
        }

        if (have_mcvs1)


Whats your opinion on this?

Andres

Testcase:

/*

test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT DISTINCT id FROM
b WHERE id < 5000);
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Hash Join  (cost=315.13..2782.56 rows=50000 width=9)
   Hash Cond: (a.id = b.id)
   ->  Seq Scan on a  (cost=0.00..1540.00 rows=100000 width=9)
   ->  Hash  (cost=249.59..249.59 rows=5243 width=4)
         ->  Unique  (cost=0.00..197.16 rows=5243 width=4)
               ->  Index Only Scan using b_pkey on b  (cost=0.00..184.06
rows=5243 width=4)
                     Index Cond: (id < 5000)
(7 rows)

Time: 4.016 ms

test_raptelan=# EXPLAIN WITH foo AS (SELECT * FROM b WHERE id < 5000) SELECT *
FROM a WHERE a.id IN (SELECT id FROM foo);
                                  QUERY PLAN
------------------------------------------------------------------------------
 Nested Loop  (cost=302.02..486.74 rows=50000 width=9)
   CTE foo
     ->  Index Scan using b_pkey on b  (cost=0.00..184.06 rows=5243 width=10)
           Index Cond: (id < 5000)
   ->  HashAggregate  (cost=117.97..119.97 rows=200 width=4)
         ->  CTE Scan on foo  (cost=0.00..104.86 rows=5243 width=4)
   ->  Index Scan using a_pkey on a  (cost=0.00..0.90 rows=1 width=9)
         Index Cond: (id = foo.id)
(8 rows)

Time: 2.636 ms

test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT id FROM b WHERE
id < 5000);
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Hash Semi Join  (cost=249.59..2184.02 rows=5243 width=9)
   Hash Cond: (a.id = b.id)
   ->  Seq Scan on a  (cost=0.00..1540.00 rows=100000 width=9)
   ->  Hash  (cost=184.06..184.06 rows=5243 width=4)
         ->  Index Only Scan using b_pkey on b  (cost=0.00..184.06 rows=5243
width=4)
               Index Cond: (id < 5000)
(6 rows)

Time: 2.459 ms

*/
Andres Freund <andres@anarazel.de> writes:
> Whats your opinion on this?

Looks pretty bogus to me.  You're essentially assuming that the side of
the join without statistics is unique, which is a mighty dubious
assumption.  (In cases where we *know* it's unique, something like this
could be reasonable, but I believe get_variable_numdistinct already
accounts for such cases.)  The reason for the reversion to pre-8.4
behavior was that with the other behavior, we might sometimes make
extremely optimistic estimates (ie, conclude that the join result is
very small) on the basis of, really, nothing at all.  AFAICS this
proposal just reintroduces unwarranted assumptions, and therefore will
probably produce as many worse results as better ones.

Also, why the asymmetry in null handling?  And why did you only touch
one of the two code paths in eqjoinsel_semi?  They have both got this
issue of how to estimate with inadequate stats.

            regards, tom lane

Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From
Andres Freund
Date:
On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:

> Also, why the asymmetry in null handling?  And why did you only touch
> one of the two code paths in eqjoinsel_semi?  They have both got this
> issue of how to estimate with inadequate stats.
This patch was purely a proof of concept, sorry if that wasn't clear. I mostly
wanted to point out that real plans regressed with this change. Digging a bit
around I could find more examples where it caused real pain. But also cases
were the new behaviour was advantageous.
Unfortunately the pastebins where raptelan provided plans expired by now...
Perhaps he can provide them again?

If we aggree on a way to handle the stats I am happy to produce a patch that
actually tries to cover all the cases.

> > Whats your opinion on this?
> Looks pretty bogus to me.  You're essentially assuming that the side of
> the join without statistics is unique, which is a mighty dubious
> assumption.
It sure is a bit dubious. But assuming that a semijoin that has max of n rows
on the inner side results in half of the outer sides rows (>> n) is pretty
bogus as well. Using the asumption of uniqueness for the outer side seems
sensible if its only used as a upper limit (Except in an antijoin ...).

Yes, my "patch" didn't even start to do this ;)

SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation); is not
exactly a fringe case, so I find it problematic regressing quite a bit in the
estimates.


> (In cases where we *know* it's unique, something like this
> could be reasonable, but I believe get_variable_numdistinct already
> accounts for such cases.)
Only that we infer uniqueness only from very few things unless I miss
something...

Andres

Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From
Andres Freund
Date:
On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>  (In cases where we know it's unique, something like this
> could be reasonable, but I believe get_variable_numdistinct already
> accounts for such cases.)
One of those case which looks relatively easy is that CTEs currently work as a
kind of 'statistics barrier' here. Specifically I wonder why:

test_raptelan=# EXPLAIN WITH foo AS (SELECT * FROM b WHERE id < 5000) SELECT *
FROM a WHERE a.id IN (SELECT id FROM foo);
                                  QUERY PLAN
------------------------------------------------------------------------------
 Nested Loop  (cost=302.02..1876.30 rows=2550000 width=11)
   CTE foo
     ->  Index Scan using b_pkey on b  (cost=0.00..184.06 rows=5243 width=10)
           Index Cond: (id < 5000)
   ->  HashAggregate  (cost=117.97..119.97 rows=200 width=4)
         ->  CTE Scan on foo  (cost=0.00..104.86 rows=5243 width=4)
   ->  Index Scan using a_pkey on a  (cost=0.00..7.85 rows=1 width=11)
         Index Cond: (id = foo.id)


plans differently than

test_raptelan=# EXPLAIN SELECT * FROM a WHERE a.id IN (SELECT id FROM b WHERE
id < 5000 OFFSET 0);
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=560.41..17426.03 rows=5243 width=11)
   Merge Cond: (a.id = b.id)
   ->  Index Scan using a_pkey on a  (cost=0.00..160013.81 rows=5100000
width=11)
   ->  Sort  (cost=560.40..573.51 rows=5243 width=4)
         Sort Key: b.id
         ->  Limit  (cost=0.00..184.06 rows=5243 width=4)
               ->  Index Only Scan using b_pkey on b  (cost=0.00..184.06
rows=5243 width=4)
                     Index Cond: (id < 5000)


Couldn't the CTE pass a vardata from inside to the outside?

Andres
Andres Freund <andres@anarazel.de> writes:
> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>> Looks pretty bogus to me.  You're essentially assuming that the side of
>> the join without statistics is unique, which is a mighty dubious
>> assumption.

> It sure is a bit dubious. But assuming that a semijoin that has max of n rows
> on the inner side results in half of the outer sides rows (>> n) is pretty
> bogus as well.

How so?  There is no reason to think that the number of LHS rows with a
match is limited by the number of RHS rows.  If we knew that the LHS
join key was unique, then yes that'd be sensible, but we don't know
that.

> SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> is not exactly a fringe case, so I find it problematic regressing
> quite a bit in the estimates.

Agreed, and that's why I don't want to put in a patch that carries the
risk of regressing even more.  I'm happy to do something that's got some
amount of theory behind it, but if we're just guessing, we can't afford
to guess a very high or low selectivity.

One thing I've considered but not done anything about is that in a lot
of practical cases for this, the aggregation or grouping properties of
the sub-select would provide adequate reason for assuming its output is
more or less unique, so that taking ndistinct equal to number of rows
actually is sane.  But it would need a bit of thought about what
properties we want to treat as justifying such an assumption, and then
some code to see if the join key is a Var coming out of such a
sub-select.  (Actually, what such a patch would probably look like is
modifying examine_simple_variable to not just punt when it finds the
Var came from an aggregating subquery.)

            regards, tom lane

Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From
Andres Freund
Date:
On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> >> Looks pretty bogus to me.  You're essentially assuming that the side of
> >> the join without statistics is unique, which is a mighty dubious
> >> assumption.
> >
> > It sure is a bit dubious. But assuming that a semijoin that has max of n
> > rows on the inner side results in half of the outer sides rows (>> n) is
> > pretty bogus as well.
>
> How so?  There is no reason to think that the number of LHS rows with a
> match is limited by the number of RHS rows.  If we knew that the LHS
> join key was unique, then yes that'd be sensible, but we don't know
> that.
In the current example we have an estimate for the distinctness of the LHS. I
don't see how guesstimating the RHS number of tuples in a semijoin to
vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
the whole thing. Possibly it would make sense to additionally clamp the
selectivity to an upper limit of 0.5 in that case.
I guess what I am aiming at is that overestimating the number of distinct
tuples in the *RHS* won't lead to underestimating the number of result tuples.
By clamping the selectivity to 0.5 in that case we can be sure not to
overestimate more than currently.

> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> > is not exactly a fringe case, so I find it problematic regressing
> > quite a bit in the estimates.
>
> Agreed, and that's why I don't want to put in a patch that carries the
> risk of regressing even more.  I'm happy to do something that's got some
> amount of theory behind it, but if we're just guessing, we can't afford
> to guess a very high or low selectivity.
Not sure how more an estimation can regress than:
 Nested Loop  (cost=0.00..1859.02 rows=2550000 width=11) (actual
time=0.182..11.236 rows=199 loops=1)

;)

> One thing I've considered but not done anything about is that in a lot
> of practical cases for this, the aggregation or grouping properties of
> the sub-select would provide adequate reason for assuming its output is
> more or less unique, so that taking ndistinct equal to number of rows
> actually is sane.  But it would need a bit of thought about what
> properties we want to treat as justifying such an assumption, and then
> some code to see if the join key is a Var coming out of such a
> sub-select.  (Actually, what such a patch would probably look like is
> modifying examine_simple_variable to not just punt when it finds the
> Var came from an aggregating subquery.)
Yes, having that would be great, but be a bit more invasive than I like to
think right now. This thing is actually a problem for people in the field...

Andres-

Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

From
Casey Allen Shobe
Date:
On Wed, 11 Jan 2012 19:40:34 -0500, Andres wrote:
> Unfortunately the pastebins where raptelan provided plans expired by
> now... Perhaps he can provide them again?

Sure, the original curiosity I noticed was that adjusting the block
size of results returned by the CTE had widely different effects, 5000
seemed to be some sort of "magic number", while either 6000 or 4000
worked poorly.  Originally, our design used blocks of 100,000.

I then noticed a regression with the 5,000 block size.  The difference
in the queries between a prototype (that was pretty fast (5-10s) and
what was generated (slow, (5m+)) was that the order of columns in one
of the join conditions in the main query was reversed.  ON
source.column = joined.column was slow, while ON joined.column =
source.column was fast.  Apparently this is enough to get a different
possible plan to hit the planner first, while another slow plan with
nearly identical estimates is hitting the planner first in other cases.

Attached are three files - one shows the fast plan, another the slow
plan, and another with the query in question.  The ON clause where
reversal makes a difference is the td_13 one.  I use a different range
in the CTE for both queries as otherwise filesystem cache makes the
timings look better, but in both cases, the CTE returns exactly 5,000
results.

Yes, the query could be written a lot better, but currently it's
generated this way to conform to an expectation of user-defined custom
where clauses that do not qualify column names.  Breaking that
compatibility and redoing this better is a longer-term plan.

Please let me know if I'm omitting any important details.

Regards,
--
Casey Allen Shobe | Senior Software Engineer/DBA
Message Systems | http://messagesystems.com
casey.shobe@messagesystems.com | 443-656-3311 x248
Attachment
[ getting back to this after assorted distractions ]

Andres Freund <andres@anarazel.de> writes:
> On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>>>> Looks pretty bogus to me.  You're essentially assuming that the side of
>>>> the join without statistics is unique, which is a mighty dubious
>>>> assumption.

>>> It sure is a bit dubious. But assuming that a semijoin that has max of n
>>> rows on the inner side results in half of the outer sides rows (>> n) is
>>> pretty bogus as well.

> In the current example we have an estimate for the distinctness of the LHS. I
> don't see how guesstimating the RHS number of tuples in a semijoin to
> vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
> the whole thing.

The real problem is that the estimate we want to use involves the ratio
nd2/nd1.  If either of those numbers is mere fantasy, so is the ratio.
It doesn't really help to say "well, we can upper-bound this number
here", because sometimes a too large result is as bad as too small.

>> One thing I've considered but not done anything about is that in a lot
>> of practical cases for this, the aggregation or grouping properties of
>> the sub-select would provide adequate reason for assuming its output is
>> more or less unique, so that taking ndistinct equal to number of rows
>> actually is sane.  But it would need a bit of thought about what
>> properties we want to treat as justifying such an assumption, and then
>> some code to see if the join key is a Var coming out of such a
>> sub-select.  (Actually, what such a patch would probably look like is
>> modifying examine_simple_variable to not just punt when it finds the
>> Var came from an aggregating subquery.)

> Yes, having that would be great, but be a bit more invasive than I like to
> think right now. This thing is actually a problem for people in the field...

I'm not sure it's that bad.  Attached is a simple patch to notice that
the subquery is "SELECT DISTINCT foo" and if so assume the result is
unique.  This takes care of at least the first of your original
examples.  I'm not sure whether fixing this single case is enough to get
excited about, but it's a step forward anyway.

Your second example, involving a WITH, would properly be handled by
teaching examine_simple_variable to drill down into CTEs.  I lacked the
round tuits to make it do so initially, and still do, but if you'd like
to have a go at it ...

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6d78068476e520f7dd2da6c0c8d48d93e0649768..5a4fa77848b5e44abd913e536f0f83cc06bb85fb 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 110,115 ****
--- 110,116 ----
  #include "optimizer/predtest.h"
  #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
+ #include "parser/parse_clause.h"
  #include "parser/parse_coerce.h"
  #include "parser/parsetree.h"
  #include "utils/builtins.h"
*************** examine_simple_variable(PlannerInfo *roo
*** 4357,4375 ****
      {
          /*
           * Plain subquery (not one that was converted to an appendrel).
-          *
-          * Punt if subquery uses set operations, GROUP BY, or DISTINCT --- any
-          * of these will mash underlying columns' stats beyond recognition.
-          * (Set ops are particularly nasty; if we forged ahead, we would
-          * return stats relevant to only the leftmost subselect...)
           */
          Query       *subquery = rte->subquery;
          RelOptInfo *rel;
          TargetEntry *ste;

          if (subquery->setOperations ||
!             subquery->groupClause ||
!             subquery->distinctClause)
              return;

          /*
--- 4358,4378 ----
      {
          /*
           * Plain subquery (not one that was converted to an appendrel).
           */
          Query       *subquery = rte->subquery;
          RelOptInfo *rel;
          TargetEntry *ste;

+         /*
+          * Punt if subquery uses set operations or GROUP BY, as these will
+          * mash underlying columns' stats beyond recognition.  (Set ops are
+          * particularly nasty; if we forged ahead, we would return stats
+          * relevant to only the leftmost subselect...)  DISTINCT is also
+          * problematic, but we check that later because there is a possibility
+          * of learning something even with it.
+          */
          if (subquery->setOperations ||
!             subquery->groupClause)
              return;

          /*
*************** examine_simple_variable(PlannerInfo *roo
*** 4415,4420 ****
--- 4418,4437 ----
                   rte->eref->aliasname, var->varattno);
          var = (Var *) ste->expr;

+         /*
+          * If subquery uses DISTINCT, we can't make use of any stats for the
+          * variable ... but, if it's the only DISTINCT column, we are entitled
+          * to consider it unique.  We do the test this way so that it works
+          * for cases involving DISTINCT ON.
+          */
+         if (subquery->distinctClause)
+         {
+             if (list_length(subquery->distinctClause) == 1 &&
+                 targetIsInSortList(ste, InvalidOid, subquery->distinctClause))
+                 vardata->isunique = true;
+             return;
+         }
+
          /* Can only handle a simple Var of subquery's query level */
          if (var && IsA(var, Var) &&
              var->varlevelsup == 0)