BUG #15160: planner overestimates number of rows in join when thereare more than 200 rows coming from CTE - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #15160: planner overestimates number of rows in join when thereare more than 200 rows coming from CTE |
Date | |
Msg-id | 152395805004.19366.3107109716821067806@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 15160 Logged by: Alexey Ermakov Email address: alexey.ermakov@dataegret.com PostgreSQL version: 10.3 Operating system: Linux Description: Hello, I'm wondering how planner estimates number of rows in that case: create table test_in (id int primary key); insert into test_in select id from generate_series(1,1000000) gs(id); analyze test_in; explain analyze with ids as (select id from generate_series(1,1000) gs(id) limit 200) select * from test_in where id in (select id from ids); ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=6.93..139.79 rows=200 width=4) (actual time=0.129..0.400 rows=200 loops=1) CTE ids -> Limit (cost=0.00..2.00 rows=200 width=4) (actual time=0.050..0.066 rows=200 loops=1) -> Function Scan on generate_series gs (cost=0.00..10.00 rows=1000 width=4) (actual time=0.050..0.057 rows=200 loops=1) -> HashAggregate (cost=4.50..6.50 rows=200 width=4) (actual time=0.117..0.133 rows=200 loops=1) Group Key: ids.id -> CTE Scan on ids (cost=0.00..4.00 rows=200 width=4) (actual time=0.051..0.086 rows=200 loops=1) -> Index Only Scan using test_in_pkey on test_in (cost=0.42..0.66 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=200) Index Cond: (id = ids.id) Heap Fetches: 200 Planning time: 0.128 ms Execution time: 0.434 ms explain analyze with ids as (select id from generate_series(1,1000) gs(id) limit 201) select * from test_in where id in (select id from ids); ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=6.96..132.78 rows=500000 width=4) (actual time=0.119..0.389 rows=201 loops=1) CTE ids -> Limit (cost=0.00..2.01 rows=201 width=4) (actual time=0.048..0.064 rows=201 loops=1) -> Function Scan on generate_series gs (cost=0.00..10.00 rows=1000 width=4) (actual time=0.048..0.056 rows=201 loops=1) -> HashAggregate (cost=4.52..6.52 rows=200 width=4) (actual time=0.113..0.130 rows=201 loops=1) Group Key: ids.id -> CTE Scan on ids (cost=0.00..4.02 rows=201 width=4) (actual time=0.049..0.083 rows=201 loops=1) -> Index Only Scan using test_in_pkey on test_in (cost=0.42..0.66 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=201) Index Cond: (id = ids.id) Heap Fetches: 201 Planning time: 0.068 ms Execution time: 0.417 ms please note that it first example we got correct estimate of total number of rows - 200, but in last one where CTE returned 201 rows (instead of 200) we estimate total number of rows as 500000 (half of the table test_in). which is way off and could lead to non optimal plan and poor performance. I have same estimate if I replace IN clause with equivalent EXISTS subquery; normal join estimates number of rows fine (but it's not equivalent in general case when table_in.id is not unique). reproduced in 9.5, 9.6 and 10. interesting thing that in postgresql 10 threshold is 200 rows but in previous version it's 199. I suspect selectivity 0.5 we somehow get inside compute_semi_anti_join_factors function in costsize.c but I'm not sure. Thanks, Alexey Ermakov
pgsql-bugs by date: