Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct - Mailing list pgsql-bugs

From Andres Freund
Subject Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Date
Msg-id 201201120300.03698.andres@anarazel.de
Whole thread Raw
In response to Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> >> Looks pretty bogus to me.  You're essentially assuming that the side of
> >> the join without statistics is unique, which is a mighty dubious
> >> assumption.
> >
> > It sure is a bit dubious. But assuming that a semijoin that has max of n
> > rows on the inner side results in half of the outer sides rows (>> n) is
> > pretty bogus as well.
>
> How so?  There is no reason to think that the number of LHS rows with a
> match is limited by the number of RHS rows.  If we knew that the LHS
> join key was unique, then yes that'd be sensible, but we don't know
> that.
In the current example we have an estimate for the distinctness of the LHS. I
don't see how guesstimating the RHS number of tuples in a semijoin to
vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for
the whole thing. Possibly it would make sense to additionally clamp the
selectivity to an upper limit of 0.5 in that case.
I guess what I am aiming at is that overestimating the number of distinct
tuples in the *RHS* won't lead to underestimating the number of result tuples.
By clamping the selectivity to 0.5 in that case we can be sure not to
overestimate more than currently.

> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> > is not exactly a fringe case, so I find it problematic regressing
> > quite a bit in the estimates.
>
> Agreed, and that's why I don't want to put in a patch that carries the
> risk of regressing even more.  I'm happy to do something that's got some
> amount of theory behind it, but if we're just guessing, we can't afford
> to guess a very high or low selectivity.
Not sure how more an estimation can regress than:
 Nested Loop  (cost=0.00..1859.02 rows=2550000 width=11) (actual
time=0.182..11.236 rows=199 loops=1)

;)

> One thing I've considered but not done anything about is that in a lot
> of practical cases for this, the aggregation or grouping properties of
> the sub-select would provide adequate reason for assuming its output is
> more or less unique, so that taking ndistinct equal to number of rows
> actually is sane.  But it would need a bit of thought about what
> properties we want to treat as justifying such an assumption, and then
> some code to see if the join key is a Var coming out of such a
> sub-select.  (Actually, what such a patch would probably look like is
> modifying examine_simple_variable to not just punt when it finds the
> Var came from an aggregating subquery.)
Yes, having that would be great, but be a bit more invasive than I like to
think right now. This thing is actually a problem for people in the field...

Andres-

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Botched estimation in eqjoinsel_semi for cases without reliable ndistinct
Next
From: Maxim Boguk
Date:
Subject: Re: BUG #6393: cluster sometime fail under heavy concurrent write load