On 12/17/20 6:36 PM, Bertrand Guillaumin wrote:
> Hello,
> I think I just made a test that shows that even with MCV on both sides
> the estimated selectivity can be pretty wrong.
> test=# create table test_bug_hash2 as SELECT mod(id,500) as id , case
> when id<=500 then 1 else 2 end parent_id, null::text as attrib from
> (select generate_series(1,1000) as id) alias0;
> SELECT 1000
> test=# update test_bug_hash2 set attrib='BEL' where id=2;
> UPDATE 2
> test=# analyze test_bug_hash2;
> ANALYZE
> test=# explain select * from test_bug_hash2 a, test_bug_hash2 b where
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
> QUERY PLAN
> ------------------------------------------------------------------------------
> Hash Join (cost=17.52..37.56 rows=4 width=24)
> Hash Cond: (a.parent_id = b.id <http://b.id>)
> -> Seq Scan on test_bug_hash2 a (cost=0.00..15.00 rows=1000 width=12)
> -> Hash (cost=17.50..17.50 rows=2 width=12)
> -> Seq Scan on test_bug_hash2 b (cost=0.00..17.50 rows=2
> width=12)
> Filter: (attrib = 'BEL'::text)
>
> test=# select count(*) from test_bug_hash2 a, test_bug_hash2 b where
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
> count
> -------
> 1000
>
> I won't copy paste the pg_stats lines but most_common_vals and
> most_common_freq have values for all three columns.
>
> I'm not a programmer but I've looked into the code of the planner a
> little bit and it seems you try to estimate the selectivity of a join in
> itself without any regards to the filters that can be applied on any
> side of the join ( if I understood correctly).
> I think that it's ultimately where the problem lies, if the filter is
> not too important the selectivity stays more or less the same but with
> filters like the one in this query the selectivity of the join can
> change a lot so in the end you get estimations which can be totally wrong.
>
>
Right. The problem is that the two columns are correlated, thanks to how
you set the attrib value only for id=2, but the join estimation code is
oblivious to that.
Perhaps the multi-column/extended stats might allow us to improve this,
but the code has not been written yet.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company