Startup cost of sequential scan - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Startup cost of sequential scan |
Date | |
Msg-id | CAPpHfdvfDAXPXhFQT3Ww=kZ4tpyAsD07_oz8-fh0=p6eckEBKQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Startup cost of sequential scan
|
List | pgsql-hackers |
Hi! Our customer have a bad plan problem, which could be reduced to the following example. create table t1 (id int primary key, k int); create table t2 (id int); insert into t1 (select i, i from generate_series(1,1000000) i); insert into t2 (select 0 from generate_series(1,100)i); insert into t2 values (500000); For the following query the plan is OK despite number of resulting rows is very much overestimated. # explain select * from t1 join t2 x1 on x1.id = t1.k join t2 x2 on x2.id = t1.k join t2 x3 on x3.id = t1.k join t2 x4 on x4.id = t1.k join t2 x5 on x5.id = t1.k join t2 x6 on x6.id = t1.k join t2 x7 on x7.id = t1.k join t2 x8 on x8.id = t1.k join t2 x9 on x9.id = t1.k where t1.id = 500000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Hash Join (cost=265.98..104071348014760.11 rows=9240411823550620 width=44) Hash Cond: (x1.id = x8.id) -> Hash Join (cost=22.80..25.20 rows=942425865760 width=36) Hash Cond: (x1.id = t1.k) -> Seq Scan on t2 x1 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=22.79..22.79 rows=1 width=32) -> Hash Join (cost=20.39..22.79 rows=1 width=32) Hash Cond: (x7.id = t1.k) -> Seq Scan on t2 x7 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=20.38..20.38 rows=1 width=28) -> Hash Join (cost=17.98..20.38 rows=1 width=28) Hash Cond: (x6.id = t1.k) -> Seq Scan on t2 x6 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=17.96..17.96 rows=1 width=24) -> Hash Join (cost=15.57..17.96 rows=1 width=24) Hash Cond: (x5.id = t1.k) -> Seq Scan on t2 x5 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=15.55..15.55 rows=1 width=20) -> Hash Join (cost=13.15..15.55 rows=1 width=20) Hash Cond: (x4.id = t1.k) -> Seq Scan on t2 x4 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=13.14..13.14 rows=1 width=16) -> Hash Join (cost=10.74..13.14 rows=1 width=16) Hash Cond: (x3.id = t1.k) -> Seq Scan on t2 x3 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=10.73..10.73 rows=1 width=12) -> Hash Join (cost=8.46..10.73 rows=1 width=12) Hash Cond: (x2.id = t1.k) -> Seq Scan on t2 x2 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=8.44..8.44 rows=1 width=8) -> Index Scan using t1_pkey on t1 (cost=0.42..8.44 rows=1 width=8) Index Cond: (id = 500000) -> Hash (cost=118.17..118.17 rows=10001 width=8) -> Hash Join (cost=3.27..118.17 rows=10001 width=8) Hash Cond: (x8.id = x9.id) -> Seq Scan on t2 x8 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=2.01..2.01 rows=101 width=4) -> Seq Scan on t2 x9 (cost=0.00..2.01 rows=101 width=4) (38 rows) But when you add LIMIT clause, index scan on t1 turns out into sequential scan. # explain select * from t1 join t2 x1 on x1.id = t1.k join t2 x2 on x2.id = t1.k join t2 x3 on x3.id = t1.k join t2 x4 on x4.id = t1.k join t2 x5 on x5.id = t1.k join t2 x6 on x6.id = t1.k join t2 x7 on x7.id = t1.k join t2 x8 on x8.id = t1.k join t2 x9 on x9.id = t1.k where t1.id = 500000 limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..1.55 rows=100 width=44) -> Nested Loop (cost=0.00..142805795187416.44 rows=9240411823550620 width=44) Join Filter: (x1.id = x9.id) -> Nested Loop (cost=0.00..1427775203576.57 rows=93318825071840 width=40) Join Filter: (x1.id = x8.id) -> Nested Loop (cost=0.00..16947.91 rows=942425865760 width=36) Join Filter: (t1.k = x1.id) -> Nested Loop (cost=0.00..16944.63 rows=1 width=32) Join Filter: (t1.k = x7.id) -> Nested Loop (cost=0.00..16941.36 rows=1 width=28) Join Filter: (t1.k = x6.id) -> Nested Loop (cost=0.00..16938.09 rows=1 width=24) Join Filter: (t1.k = x5.id) -> Nested Loop (cost=0.00..16934.82 rows=1 width=20) Join Filter: (t1.k = x4.id) -> Nested Loop (cost=0.00..16931.54 rows=1 width=16) Join Filter: (t1.k = x3.id) -> Nested Loop (cost=0.00..16928.27 rows=1 width=12) Join Filter: (t1.k = x2.id) -> Seq Scan on t1 (cost=0.00..16925.00 rows=1 width=8) Filter: (id = 500000) -> Seq Scan on t2 x2 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x3 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x4 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x5 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x6 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x7 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x1 (cost=0.00..2.01 rows=101 width=4) -> Materialize (cost=0.00..2.51 rows=101 width=4) -> Seq Scan on t2 x8 (cost=0.00..2.01 rows=101 width=4) -> Materialize (cost=0.00..2.51 rows=101 width=4) -> Seq Scan on t2 x9 (cost=0.00..2.01 rows=101 width=4) (32 rows) Obviously, that happens because selectivity for join with t2 is overestimated and multiplied many times. And improvements in this area seems quite hard for me. But I think there is another issue in sequential scan cost. We have zero startup cost for sequential scan. But why? As I get sequential scan should estimate amount of resources we need to spend in order to start returning rows. So, in our example when we expect finding 1 row from the table, in average we have to scan half of table before find this row. Thus, startup_cost should be about half of total cost. Attached POC patch implements that. In the given example sequential scan turns out back to index scan. Any thoughts? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: