Re: Weird, bad 0.5% selectivity estimate for a column equal to itself - Mailing list pgsql-performance

From Josh Berkus
Subject Re: Weird, bad 0.5% selectivity estimate for a column equal to itself
Date
Msg-id 51CB7362.5060605@agliodbs.com
Whole thread Raw
In response to Weird, bad 0.5% selectivity estimate for a column equal to itself  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
On 06/25/2013 06:41 PM, Tom Lane wrote:
>> Well, it was more in the form of:
>> tab1.x = COALESCE(tab2.y,tab1.x)
>
> Hm.  I'm not following how you get from there to complaining about not
> being smart about X = X, because that surely ain't the same.

Actually, it was dominated by defaultsel, since tab2.y had a nullfrac of
70%.  It took us a couple days of reducing the bad query plan to figure
out where the bad estimate was coming from.  The real estimate should
have been 0.7 + ( est. tab2.y = tab1.x ), but instead we were getting
0.005 + ( est. tab2.y = tab1.x ), which was throwing the whole query
plan way off ... with an execution time difference of 900X.

> It's a totally different case.  In the join case you expect that each
> element of one table will be compared with each element of the other.
> In the single-table case, that's exactly *not* what will happen, and
> I don't see how you get to anything very useful without knowing
> something about the value pairs that actually occur.

Sure you can.  If you make the assumption that there is 0 correlation,
then you can simply estimate the comparison as between two random
columns.  In the simplest approach, you would multiply the two
ndistincts, so that a column with 3 values would match a column with 10
values 0.033 of the time.

Now for a real estimator, we'd of course want to use the MCVs and the
histogram to calculate a better estimation; obviously our 3X10 table is
going to match 0% of the time if col1 is [1,2,3] and col2 contains
values from 1000 to 1100.  The MCVs would be particularly valuable here;
if the same MCV appears in both columns, we can multiply the probabilities.

To me, this seems just like estimating on a foreign table match, only
simpler.  Of course, a coefficient of corellation would make it even
more accurate, but even without one we can arrive at much better
estimates than defaultsel.

> As a concrete
> example, applying the join selectivity logic would certainly give a
> completely wrong answer for X = X, unless there were only one value
> occurring in the column.

Yeah, I think we'll eventually need to special-case that one.  In the
particular case I ran across, though, using column match estimation
would have still yielded a better result than defaultsel.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


pgsql-performance by date:

Previous
From: Willy-Bas Loos
Date:
Subject: Re: seqscan for 100 out of 3M rows, index present
Next
From: Marcin Mańk
Date:
Subject: Re: incorrect row estimates for primary key join