Re: BUG #16759: Estimation of the planner is wrong for hash join - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16759: Estimation of the planner is wrong for hash join
Date
Msg-id 18e3e4e6-2816-77fa-84c2-b85448fbe6ce@enterprisedb.com
Whole thread Raw
In response to Re: BUG #16759: Estimation of the planner is wrong for hash join  (Bertrand Guillaumin <bertrand.guillaumin@gmail.com>)
List pgsql-bugs

On 12/17/20 6:36 PM, Bertrand Guillaumin wrote:
> Hello,
> I think I just made a test that shows that even with MCV on both sides 
> the estimated selectivity can be pretty wrong.
> test=# create table test_bug_hash2 as SELECT mod(id,500) as id , case 
> when id<=500 then 1 else 2 end parent_id, null::text as attrib from 
> (select generate_series(1,1000) as id) alias0;
> SELECT 1000
> test=# update test_bug_hash2 set attrib='BEL' where id=2;
> UPDATE 2
> test=# analyze test_bug_hash2;
> ANALYZE
> test=# explain select * from test_bug_hash2 a, test_bug_hash2 b where 
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
>                                    QUERY PLAN
> ------------------------------------------------------------------------------
>   Hash Join  (cost=17.52..37.56 rows=4 width=24)
>     Hash Cond: (a.parent_id = b.id <http://b.id>)
>     ->  Seq Scan on test_bug_hash2 a  (cost=0.00..15.00 rows=1000 width=12)
>     ->  Hash  (cost=17.50..17.50 rows=2 width=12)
>           ->  Seq Scan on test_bug_hash2 b  (cost=0.00..17.50 rows=2 
> width=12)
>                 Filter: (attrib = 'BEL'::text)
> 
> test=#  select count(*) from test_bug_hash2 a, test_bug_hash2 b where 
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
>   count
> -------
>    1000
> 
> I won't copy paste the pg_stats lines but most_common_vals and 
> most_common_freq have values for all three columns.
> 
> I'm not a programmer but I've looked into the code of the planner a 
> little bit and it seems you try to estimate the selectivity of a join in 
> itself without any regards to the filters that can be applied on any 
> side of the join ( if I understood correctly).
> I think that it's ultimately where the problem lies, if the filter is 
> not too important the selectivity stays more or less the same but with 
> filters like the one in this query the selectivity of the join can 
> change a lot so in the end you get estimations which can be totally wrong.
> 
> 

Right. The problem is that the two columns are correlated, thanks to how 
you set the attrib value only for id=2, but the join estimation code is 
oblivious to that.

Perhaps the multi-column/extended stats might allow us to improve this, 
but the code has not been written yet.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-bugs by date:

Previous
From: Bertrand Guillaumin
Date:
Subject: Re: BUG #16759: Estimation of the planner is wrong for hash join
Next
From: PG Bug reporting form
Date:
Subject: BUG #16778: Problem: nothing provides libarmadillo.so.9()(64bit) needed by gdal30-libs-3.0.4-2.sles12.x86_64