Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct - Mailing list pgsql-bugs

From Tom Lane
Subject Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct
Date
Msg-id 23723.1496521636@sss.pgh.pa.us
Whole thread Raw
In response to [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct  (marko@joh.to)
Responses Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct
List pgsql-bugs
marko@joh.to writes:
> Despite the fact that postgres knows that the lookup on q2 will always
> produce exactly 10 rows, it still estimates it at 58.

It doesn't know that.  You've told it there are N/10 distinct values,
but the conclusion that any given outer row matches exactly 10 inner rows
depends on an additional assumption that the distribution is exactly
uniform.  It's not going to think that, because of ANALYZE's
not-exactly-uniform sampling.

Repeatedly running ANALYZE on your example table with the usual
default_statistics_target of 100, I get results similar to

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and
attname= 'other') ss group by 1 order by 1;     f      | count  
-------------+-------6.66667e-05 |    98     0.0001 |     2
(2 rows)

which corresponds to a maximum of 3 occurrences of a "most common value",
and a minimum of 2 occurrences, out of the default 30000-row sample.
(Of course, ANALYZE will reject single-occurrence values as not being
MCVs, so we could never see frequencies less than 2/30000 = 6.67e-5.)

If I increase the stats target to 1000, I get results like

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and
attname= 'other') ss group by 1 order by 1;     f      | count  
-------------+-------6.66667e-06 |   935      1e-05 |    65
(2 rows)

regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and
attname= 'other') ss group by 1 order by 1;     f      | count  
-------------+-------6.66667e-06 |   936      1e-05 |    631.33333e-05 |     1
(3 rows)

(the counts move around a bit, but not very much).  That corresponds to a
minimum of 2 and a maximum of either 3 or 4 duplicates in a 300000-row
sample.

Given that we know that the duplicate values of "other" are adjacent,
what this is looking like to me is that there's something fishy about
ANALYZE's choice of sample rows.  It seems like it should not be picking 3
or 4 rows from the same block very often, and for all of those to come
from the same 10-row span (out of 225 rows per page, more or less, in
this table) seems like it ought to be a really low-probability event.
But maybe my math is off.

Another way we could look at this is that if ANALYZE can't find any
entries occurring more than 3 or 4 times in 300000 samples, it ought
to reject them as not being MCVs after all.  The code's existing
credulity about what counts as an MCV dates from back when typical
sample sizes were 3000 rows, not 10 or 100 times that.  Maybe we'd
better rethink that, though I'm not sure what a suitable filtering
rule would be.

Anyway, this is not the fault of the selectivity estimator, it's the
fault of a nonuniform ANALYZE sample that reports some values as MCVs,
with overblown frequency estimates, even though they're not really any
more common than anything else.  Pending some decision about whether
and how to change ANALYZE, it seems like raising the stats target to
1000 mostly solves the problem for your example table:

regression=# explain select * from qqq q1 join qqq q2 using (other) where q1.pk = 10;
QUERYPLAN                                      
-----------------------------------------------------------------------------------Nested Loop  (cost=1.14..17.43
rows=15width=12)  ->  Index Scan using qqq_pkey on qqq q1  (cost=0.57..8.59 rows=1 width=8)        Index Cond: (pk =
10) ->  Index Scan using qqq_other_idx on qqq q2  (cost=0.57..8.74 rows=10 width=8)        Index Cond: (other =
q1.other)
(5 rows)

We're still overestimating a bit from thinking that the "MCVs" really occur
more often than other values, but the error is fairly small at this level.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: chjischj@163.com
Date:
Subject: [BUGS] BUG #14687: pg_xlogdump does only count "main data" for record lengthand leading to incorrect statistics
Next
From: Laurence Parry
Date:
Subject: Re: [BUGS] BUG #14686: OpenSSL 1.1.0+ breaks PostgreSQL'ssslcompression assumption, defaults to SSL_OP_NO_COMPRESSION