Home > mailing lists

Re: Weird, bad 0.5% selectivity estimate for a column equal to itself - Mailing list pgsql-performance

From	Josh Berkus
Subject	Re: Weird, bad 0.5% selectivity estimate for a column equal to itself
Date	June 27, 2013 02:04:14
Msg-id	51CB7362.5060605@agliodbs.com Whole thread Raw
In response to	Weird, bad 0.5% selectivity estimate for a column equal to itself (Josh Berkus <josh@agliodbs.com>)
List	pgsql-performance

Tree view

On 06/25/2013 06:41 PM, Tom Lane wrote:
>> Well, it was more in the form of:
>> tab1.x = COALESCE(tab2.y,tab1.x)
>
> Hm.  I'm not following how you get from there to complaining about not
> being smart about X = X, because that surely ain't the same.

Actually, it was dominated by defaultsel, since tab2.y had a nullfrac of
70%.  It took us a couple days of reducing the bad query plan to figure
out where the bad estimate was coming from.  The real estimate should
have been 0.7 + ( est. tab2.y = tab1.x ), but instead we were getting
0.005 + ( est. tab2.y = tab1.x ), which was throwing the whole query
plan way off ... with an execution time difference of 900X.

> It's a totally different case.  In the join case you expect that each
> element of one table will be compared with each element of the other.
> In the single-table case, that's exactly *not* what will happen, and
> I don't see how you get to anything very useful without knowing
> something about the value pairs that actually occur.

Sure you can.  If you make the assumption that there is 0 correlation,
then you can simply estimate the comparison as between two random
columns.  In the simplest approach, you would multiply the two
ndistincts, so that a column with 3 values would match a column with 10
values 0.033 of the time.

Now for a real estimator, we'd of course want to use the MCVs and the
histogram to calculate a better estimation; obviously our 3X10 table is
going to match 0% of the time if col1 is [1,2,3] and col2 contains
values from 1000 to 1100.  The MCVs would be particularly valuable here;
if the same MCV appears in both columns, we can multiply the probabilities.

To me, this seems just like estimating on a foreign table match, only
simpler.  Of course, a coefficient of corellation would make it even
more accurate, but even without one we can arrive at much better
estimates than defaultsel.

> As a concrete
> example, applying the join selectivity logic would certainly give a
> completely wrong answer for X = X, unless there were only one value
> occurring in the column.

Yeah, I think we'll eventually need to special-case that one.  In the
particular case I ran across, though, using column match estimation
would have still yielded a better result than defaultsel.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

pgsql-performance by date:

From: Willy-Bas Loos
Date: 27 June 2013, 00:35:44
Subject: Re: seqscan for 100 out of 3M rows, index present

From: Marcin Mańk
Date: 27 June 2013, 03:23:04
Subject: Re: incorrect row estimates for primary key join

Re: Weird, bad 0.5% selectivity estimate for a column equal to itself - Mailing list pgsql-performance

Previous

Next