BUG #15160: planner overestimates number of rows in join when thereare more than 200 rows coming from CTE - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15160: planner overestimates number of rows in join when thereare more than 200 rows coming from CTE
Date
Msg-id 152395805004.19366.3107109716821067806@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15160
Logged by:          Alexey Ermakov
Email address:      alexey.ermakov@dataegret.com
PostgreSQL version: 10.3
Operating system:   Linux
Description:

Hello,


I'm wondering how planner estimates number of rows in that case:

create table test_in (id int primary key);
insert into test_in select id from generate_series(1,1000000) gs(id);
analyze test_in;

explain analyze with ids as (select id from generate_series(1,1000) gs(id)
limit 200)
select * from test_in where id in (select id from ids);

-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=6.93..139.79 rows=200 width=4) (actual time=0.129..0.400
rows=200 loops=1)
   CTE ids
     ->  Limit  (cost=0.00..2.00 rows=200 width=4) (actual time=0.050..0.066
rows=200 loops=1)
           ->  Function Scan on generate_series gs  (cost=0.00..10.00
rows=1000 width=4) (actual time=0.050..0.057 rows=200 loops=1)
   ->  HashAggregate  (cost=4.50..6.50 rows=200 width=4) (actual
time=0.117..0.133 rows=200 loops=1)
         Group Key: ids.id
         ->  CTE Scan on ids  (cost=0.00..4.00 rows=200 width=4) (actual
time=0.051..0.086 rows=200 loops=1)
   ->  Index Only Scan using test_in_pkey on test_in  (cost=0.42..0.66
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=200)
         Index Cond: (id = ids.id)
         Heap Fetches: 200
 Planning time: 0.128 ms
 Execution time: 0.434 ms

explain analyze with ids as (select id from generate_series(1,1000) gs(id)
limit 201)
select * from test_in where id in (select id from ids);

-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=6.96..132.78 rows=500000 width=4) (actual
time=0.119..0.389 rows=201 loops=1)
   CTE ids
     ->  Limit  (cost=0.00..2.01 rows=201 width=4) (actual time=0.048..0.064
rows=201 loops=1)
           ->  Function Scan on generate_series gs  (cost=0.00..10.00
rows=1000 width=4) (actual time=0.048..0.056 rows=201 loops=1)
   ->  HashAggregate  (cost=4.52..6.52 rows=200 width=4) (actual
time=0.113..0.130 rows=201 loops=1)
         Group Key: ids.id
         ->  CTE Scan on ids  (cost=0.00..4.02 rows=201 width=4) (actual
time=0.049..0.083 rows=201 loops=1)
   ->  Index Only Scan using test_in_pkey on test_in  (cost=0.42..0.66
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=201)
         Index Cond: (id = ids.id)
         Heap Fetches: 201
 Planning time: 0.068 ms
 Execution time: 0.417 ms

please note that it first example we got correct estimate of total number of
rows - 200, but in last one where CTE returned 201 rows (instead of 200) we
estimate total number of rows as 500000 (half of the table test_in).
which is way off and could lead to non optimal plan and poor performance.
I have same estimate if I replace IN clause with equivalent EXISTS subquery;
normal join estimates number of rows fine (but it's not equivalent in
general case when table_in.id is not unique).
reproduced in 9.5, 9.6 and 10. interesting thing that in postgresql 10
threshold is 200 rows but in previous version it's 199.
I suspect selectivity 0.5 we somehow get inside
compute_semi_anti_join_factors function in costsize.c but I'm not sure.


Thanks,
Alexey Ermakov


pgsql-bugs by date:

Previous
From: Emre Hasegeli
Date:
Subject: Re: Standby corruption after master is restarted
Next
From: Tomas Vondra
Date:
Subject: Re: Standby corruption after master is restarted