Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct |
Date | |
Msg-id | 23723.1496521636@sss.pgh.pa.us Whole thread Raw |
In response to | [BUGS] BUG #14664: Nonsensical join selectivity estimation despite n_distinct (marko@joh.to) |
Responses |
Re: [BUGS] BUG #14664: Nonsensical join selectivity estimationdespite n_distinct
|
List | pgsql-bugs |
marko@joh.to writes: > Despite the fact that postgres knows that the lookup on q2 will always > produce exactly 10 rows, it still estimates it at 58. It doesn't know that. You've told it there are N/10 distinct values, but the conclusion that any given outer row matches exactly 10 inner rows depends on an additional assumption that the distribution is exactly uniform. It's not going to think that, because of ANALYZE's not-exactly-uniform sampling. Repeatedly running ANALYZE on your example table with the usual default_statistics_target of 100, I get results similar to regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname= 'other') ss group by 1 order by 1; f | count -------------+-------6.66667e-05 | 98 0.0001 | 2 (2 rows) which corresponds to a maximum of 3 occurrences of a "most common value", and a minimum of 2 occurrences, out of the default 30000-row sample. (Of course, ANALYZE will reject single-occurrence values as not being MCVs, so we could never see frequencies less than 2/30000 = 6.67e-5.) If I increase the stats target to 1000, I get results like regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname= 'other') ss group by 1 order by 1; f | count -------------+-------6.66667e-06 | 935 1e-05 | 65 (2 rows) regression=# select f, count(*) from (select unnest(most_common_freqs) f from pg_stats where tablename = 'qqq' and attname= 'other') ss group by 1 order by 1; f | count -------------+-------6.66667e-06 | 936 1e-05 | 631.33333e-05 | 1 (3 rows) (the counts move around a bit, but not very much). That corresponds to a minimum of 2 and a maximum of either 3 or 4 duplicates in a 300000-row sample. Given that we know that the duplicate values of "other" are adjacent, what this is looking like to me is that there's something fishy about ANALYZE's choice of sample rows. It seems like it should not be picking 3 or 4 rows from the same block very often, and for all of those to come from the same 10-row span (out of 225 rows per page, more or less, in this table) seems like it ought to be a really low-probability event. But maybe my math is off. Another way we could look at this is that if ANALYZE can't find any entries occurring more than 3 or 4 times in 300000 samples, it ought to reject them as not being MCVs after all. The code's existing credulity about what counts as an MCV dates from back when typical sample sizes were 3000 rows, not 10 or 100 times that. Maybe we'd better rethink that, though I'm not sure what a suitable filtering rule would be. Anyway, this is not the fault of the selectivity estimator, it's the fault of a nonuniform ANALYZE sample that reports some values as MCVs, with overblown frequency estimates, even though they're not really any more common than anything else. Pending some decision about whether and how to change ANALYZE, it seems like raising the stats target to 1000 mostly solves the problem for your example table: regression=# explain select * from qqq q1 join qqq q2 using (other) where q1.pk = 10; QUERYPLAN -----------------------------------------------------------------------------------Nested Loop (cost=1.14..17.43 rows=15width=12) -> Index Scan using qqq_pkey on qqq q1 (cost=0.57..8.59 rows=1 width=8) Index Cond: (pk = 10) -> Index Scan using qqq_other_idx on qqq q2 (cost=0.57..8.74 rows=10 width=8) Index Cond: (other = q1.other) (5 rows) We're still overestimating a bit from thinking that the "MCVs" really occur more often than other values, but the error is fairly small at this level. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
pgsql-bugs by date:
Previous
From: chjischj@163.comDate:
Subject: [BUGS] BUG #14687: pg_xlogdump does only count "main data" for record lengthand leading to incorrect statistics
Next
From: Laurence ParryDate:
Subject: Re: [BUGS] BUG #14686: OpenSSL 1.1.0+ breaks PostgreSQL'ssslcompression assumption, defaults to SSL_OP_NO_COMPRESSION