Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date
Msg-id 1378.1326331484@sss.pgh.pa.us
Whole thread Raw
In response to Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct  (Andres Freund <andres@anarazel.de>)
Responses Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
Andres Freund <andres@anarazel.de> writes:
> On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
>> Looks pretty bogus to me.  You're essentially assuming that the side of
>> the join without statistics is unique, which is a mighty dubious
>> assumption.

> It sure is a bit dubious. But assuming that a semijoin that has max of n rows
> on the inner side results in half of the outer sides rows (>> n) is pretty
> bogus as well.

How so?  There is no reason to think that the number of LHS rows with a
match is limited by the number of RHS rows.  If we knew that the LHS
join key was unique, then yes that'd be sensible, but we don't know
that.

> SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> is not exactly a fringe case, so I find it problematic regressing
> quite a bit in the estimates.

Agreed, and that's why I don't want to put in a patch that carries the
risk of regressing even more.  I'm happy to do something that's got some
amount of theory behind it, but if we're just guessing, we can't afford
to guess a very high or low selectivity.

One thing I've considered but not done anything about is that in a lot
of practical cases for this, the aggregation or grouping properties of
the sub-select would provide adequate reason for assuming its output is
more or less unique, so that taking ndistinct equal to number of rows
actually is sane.  But it would need a bit of thought about what
properties we want to treat as justifying such an assumption, and then
some code to see if the join key is a Var coming out of such a
sub-select.  (Actually, what such a patch would probably look like is
modifying examine_simple_variable to not just punt when it finds the
Var came from an aggregating subquery.)

            regards, tom lane

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Next
From: Andres Freund
Date:
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct