Thread: Startup cost of sequential scan
Hi! Our customer have a bad plan problem, which could be reduced to the following example. create table t1 (id int primary key, k int); create table t2 (id int); insert into t1 (select i, i from generate_series(1,1000000) i); insert into t2 (select 0 from generate_series(1,100)i); insert into t2 values (500000); For the following query the plan is OK despite number of resulting rows is very much overestimated. # explain select * from t1 join t2 x1 on x1.id = t1.k join t2 x2 on x2.id = t1.k join t2 x3 on x3.id = t1.k join t2 x4 on x4.id = t1.k join t2 x5 on x5.id = t1.k join t2 x6 on x6.id = t1.k join t2 x7 on x7.id = t1.k join t2 x8 on x8.id = t1.k join t2 x9 on x9.id = t1.k where t1.id = 500000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Hash Join (cost=265.98..104071348014760.11 rows=9240411823550620 width=44) Hash Cond: (x1.id = x8.id) -> Hash Join (cost=22.80..25.20 rows=942425865760 width=36) Hash Cond: (x1.id = t1.k) -> Seq Scan on t2 x1 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=22.79..22.79 rows=1 width=32) -> Hash Join (cost=20.39..22.79 rows=1 width=32) Hash Cond: (x7.id = t1.k) -> Seq Scan on t2 x7 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=20.38..20.38 rows=1 width=28) -> Hash Join (cost=17.98..20.38 rows=1 width=28) Hash Cond: (x6.id = t1.k) -> Seq Scan on t2 x6 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=17.96..17.96 rows=1 width=24) -> Hash Join (cost=15.57..17.96 rows=1 width=24) Hash Cond: (x5.id = t1.k) -> Seq Scan on t2 x5 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=15.55..15.55 rows=1 width=20) -> Hash Join (cost=13.15..15.55 rows=1 width=20) Hash Cond: (x4.id = t1.k) -> Seq Scan on t2 x4 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=13.14..13.14 rows=1 width=16) -> Hash Join (cost=10.74..13.14 rows=1 width=16) Hash Cond: (x3.id = t1.k) -> Seq Scan on t2 x3 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=10.73..10.73 rows=1 width=12) -> Hash Join (cost=8.46..10.73 rows=1 width=12) Hash Cond: (x2.id = t1.k) -> Seq Scan on t2 x2 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=8.44..8.44 rows=1 width=8) -> Index Scan using t1_pkey on t1 (cost=0.42..8.44 rows=1 width=8) Index Cond: (id = 500000) -> Hash (cost=118.17..118.17 rows=10001 width=8) -> Hash Join (cost=3.27..118.17 rows=10001 width=8) Hash Cond: (x8.id = x9.id) -> Seq Scan on t2 x8 (cost=0.00..2.01 rows=101 width=4) -> Hash (cost=2.01..2.01 rows=101 width=4) -> Seq Scan on t2 x9 (cost=0.00..2.01 rows=101 width=4) (38 rows) But when you add LIMIT clause, index scan on t1 turns out into sequential scan. # explain select * from t1 join t2 x1 on x1.id = t1.k join t2 x2 on x2.id = t1.k join t2 x3 on x3.id = t1.k join t2 x4 on x4.id = t1.k join t2 x5 on x5.id = t1.k join t2 x6 on x6.id = t1.k join t2 x7 on x7.id = t1.k join t2 x8 on x8.id = t1.k join t2 x9 on x9.id = t1.k where t1.id = 500000 limit 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..1.55 rows=100 width=44) -> Nested Loop (cost=0.00..142805795187416.44 rows=9240411823550620 width=44) Join Filter: (x1.id = x9.id) -> Nested Loop (cost=0.00..1427775203576.57 rows=93318825071840 width=40) Join Filter: (x1.id = x8.id) -> Nested Loop (cost=0.00..16947.91 rows=942425865760 width=36) Join Filter: (t1.k = x1.id) -> Nested Loop (cost=0.00..16944.63 rows=1 width=32) Join Filter: (t1.k = x7.id) -> Nested Loop (cost=0.00..16941.36 rows=1 width=28) Join Filter: (t1.k = x6.id) -> Nested Loop (cost=0.00..16938.09 rows=1 width=24) Join Filter: (t1.k = x5.id) -> Nested Loop (cost=0.00..16934.82 rows=1 width=20) Join Filter: (t1.k = x4.id) -> Nested Loop (cost=0.00..16931.54 rows=1 width=16) Join Filter: (t1.k = x3.id) -> Nested Loop (cost=0.00..16928.27 rows=1 width=12) Join Filter: (t1.k = x2.id) -> Seq Scan on t1 (cost=0.00..16925.00 rows=1 width=8) Filter: (id = 500000) -> Seq Scan on t2 x2 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x3 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x4 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x5 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x6 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x7 (cost=0.00..2.01 rows=101 width=4) -> Seq Scan on t2 x1 (cost=0.00..2.01 rows=101 width=4) -> Materialize (cost=0.00..2.51 rows=101 width=4) -> Seq Scan on t2 x8 (cost=0.00..2.01 rows=101 width=4) -> Materialize (cost=0.00..2.51 rows=101 width=4) -> Seq Scan on t2 x9 (cost=0.00..2.01 rows=101 width=4) (32 rows) Obviously, that happens because selectivity for join with t2 is overestimated and multiplied many times. And improvements in this area seems quite hard for me. But I think there is another issue in sequential scan cost. We have zero startup cost for sequential scan. But why? As I get sequential scan should estimate amount of resources we need to spend in order to start returning rows. So, in our example when we expect finding 1 row from the table, in average we have to scan half of table before find this row. Thus, startup_cost should be about half of total cost. Attached POC patch implements that. In the given example sequential scan turns out back to index scan. Any thoughts? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > But I think there is another issue in sequential scan cost. We have > zero startup cost for sequential scan. But why? Because it's what the mental model of startup cost says it should be. Also, I do not think we can change that without a whole lot of unpleasant side effects on cases that work well today. Have you even checked to see how much of the regression tests break with this change? regards, tom lane
On Thu, Aug 30, 2018 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > > But I think there is another issue in sequential scan cost. We have > > zero startup cost for sequential scan. But why? > > Because it's what the mental model of startup cost says it should be. Right. So as I understand. The model is that we start sequential scan immediately, and then find one row uniformly distributed over the whole table. From this model we make a conclusion that we're starting getting rows from sequential scan sooner than from index scan. And this conclusion doesn't reflect reality. If even estimates for join with t2 wouldn't be wrong, then our plan for LIMIT query would be still bad, because it would be still faster to get that one row using index scan rather than sequential scan. So, if understand the mental model correctly, our cost estimate for LIMIT query is based on the idea that we need to fetch only *fraction* of row from t1 in order to get 100 resulting rows. This is obviously wrong, because we're always dealing with whole rows :) But I can't imagine how we can take care of it without redesigning our whole costing model... > Also, I do not think we can change that without a whole lot of unpleasant > side effects on cases that work well today. Have you even checked to > see how much of the regression tests break with this change? I though it's to early to discuss this. But yet, I've checked this. It appears to be surprisingly few broken tests. Just some reordering of tables in partition_join, select_parallel.------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > On Thu, Aug 30, 2018 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Because it's what the mental model of startup cost says it should be. > From this model we make a conclusion that we're starting getting rows > from sequential scan sooner than from index scan. And this conclusion > doesn't reflect reality. No, startup cost is not the "time to find the first row". It's overhead paid before you even get to start examining rows. I'm disinclined to consider fundamental changes to our costing model on the basis of this example. The fact that the rowcount estimates are so far off reality means that you're basically looking at "garbage in, garbage out" for the cost calculations --- and applying a small LIMIT just magnifies that. It'd be more useful to think first about how to make the selectivity estimates better; after that, we might or might not still think there's a costing issue. regards, tom lane
On 30.08.2018 17:58, Tom Lane wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: >> On Thu, Aug 30, 2018 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Because it's what the mental model of startup cost says it should be. >> From this model we make a conclusion that we're starting getting rows >> from sequential scan sooner than from index scan. And this conclusion >> doesn't reflect reality. > No, startup cost is not the "time to find the first row". It's overhead > paid before you even get to start examining rows. But it seems to me that calculation of cost in LIMIT node contradicts with this statement: pathnode->path.startup_cost += (subpath->total_cost - subpath->startup_cost) * offset_rows / subpath->rows; > > I'm disinclined to consider fundamental changes to our costing model > on the basis of this example. The fact that the rowcount estimates are > so far off reality means that you're basically looking at "garbage in, > garbage out" for the cost calculations --- and applying a small LIMIT > just magnifies that. > > It'd be more useful to think first about how to make the selectivity > estimates better; after that, we might or might not still think there's > a costing issue. > > regards, tom lane > -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Aug 30, 2018 at 6:08 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 30.08.2018 17:58, Tom Lane wrote: > > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > >> On Thu, Aug 30, 2018 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Because it's what the mental model of startup cost says it should be. > >> From this model we make a conclusion that we're starting getting rows > >> from sequential scan sooner than from index scan. And this conclusion > >> doesn't reflect reality. > > No, startup cost is not the "time to find the first row". It's overhead > > paid before you even get to start examining rows. > But it seems to me that calculation of cost in LIMIT node contradicts > with this statement: > > pathnode->path.startup_cost += > (subpath->total_cost - subpath->startup_cost) > * offset_rows / subpath->rows; Why does it contradict? It just assumes that skipping OFFSET rows to be preliminary work before returning results rows... ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>>>>> "Konstantin" == Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: >> No, startup cost is not the "time to find the first row". It's >> overhead paid before you even get to start examining rows. Konstantin> But it seems to me that calculation of cost in LIMIT node Konstantin> contradicts with this statement: The model (assuming I understand it rightly) is that what we're actually tracking is a startup cost and a per-output-row cost, but for comparison purposes we actually store the rows and the computed total, rather than just the per-row cost: rows startup_cost total_cost = startup_cost + (rows * per_row_cost) So what Limit is doing the for the offset count is recovering the subpath's per_row_cost from (total_cost - startup_cost)/rows, and then scaling that by the number of rows in the offset (which are being discarded), and adding that to the startup cost. So this is saying: the startup cost for OFFSET N is the startup cost of the subplan, plus the cost of fetching N rows from the subplan. (And after fetching those N rows, we still haven't found the first row that we will actually return.) For LIMIT N, we instead replace the old total cost with a new one calculated from the startup cost plus N times the subplan's per-row cost. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > The model (assuming I understand it rightly) is that what we're actually > tracking is a startup cost and a per-output-row cost, but for comparison > purposes we actually store the rows and the computed total, rather than > just the per-row cost: > rows > startup_cost > total_cost = startup_cost + (rows * per_row_cost) Right. I tend to think of it differently: the cost to retrieve the first K rows out of a total of N is startup_cost + (total_cost - startup_cost) * K/N but that comes out to the same thing as what Andrew said. regards, tom lane
On Thu, Aug 30, 2018 at 5:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > > On Thu, Aug 30, 2018 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Because it's what the mental model of startup cost says it should be. > > > From this model we make a conclusion that we're starting getting rows > > from sequential scan sooner than from index scan. And this conclusion > > doesn't reflect reality. > > No, startup cost is not the "time to find the first row". It's overhead > paid before you even get to start examining rows. > > I'm disinclined to consider fundamental changes to our costing model > on the basis of this example. The fact that the rowcount estimates are > so far off reality means that you're basically looking at "garbage in, > garbage out" for the cost calculations --- and applying a small LIMIT > just magnifies that. > > It'd be more useful to think first about how to make the selectivity > estimates better; after that, we might or might not still think there's > a costing issue. I understand that startup cost is not "time to find the first row". But I think this example highlight not one but two issues. 1) Row count estimates for joins are wrong. 2) Rows are assumed to be continuous while in reality they are discrete. So, if we reverse the assumptions made in LIMIT clause estimation, we may say that it's basically assuming that we need to fetch only fraction of row from the sequential scan node. And in the case we really fetch 101 rows in each join with t2, this logic would still bring us to the bad plan. And now I'm not proposing go rush redesigning planner to fix that. I just think it's probably something worth discussion. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > I understand that startup cost is not "time to find the first row". > But I think this example highlight not one but two issues. > 1) Row count estimates for joins are wrong. Yup. > 2) Rows are assumed to be continuous while in reality they are > discrete. Where do you get that from? The problem this example is highlighting is mostly just the bad join size estimate, imo. It's not at all clear that the planner would still do the wrong thing if that were fixed. In fact, if I replace the contents of t2 with some other distribution, such as insert into t2 (select i from generate_series(1,100)i); I get a much better join size estimate *and* a sane plan, even in the LIMIT case. So the problem here is just with the join size estimate. regards, tom lane
On Thu, Aug 30, 2018 at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: >> But I think there is another issue in sequential scan cost. We have >> zero startup cost for sequential scan. But why? > > Because it's what the mental model of startup cost says it should be. Whose mental model? I guess the Tom Lane mind is the canonical one for this project, but I'm not sure that it entirely agrees with mine. IIRC, it was previously proposed that we ought to charge random_page_cost for the first block of a sequential scan, because at present the cost of fetching 1 block differs depending on whether we are fetching it from a heap or an index, which seems unprincipled. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Whose mental model? I guess the Tom Lane mind is the canonical one > for this project, but I'm not sure that it entirely agrees with mine. Since the fact that we have a notion of startup cost at all is entirely my fault, I don't feel shy about claiming to have the authoritative view of what it means. (Whether that's adequately documented is another question :-() > IIRC, it was previously proposed that we ought to charge > random_page_cost for the first block of a sequential scan, because at > present the cost of fetching 1 block differs depending on whether we > are fetching it from a heap or an index, which seems unprincipled. That might well be a sane thing to do ... but it'd still be part of run cost not startup cost. regards, tom lane
On Thu, Aug 30, 2018 at 6:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <a.korotkov@postgrespro.ru> writes: > > I understand that startup cost is not "time to find the first row". > > But I think this example highlight not one but two issues. > > 1) Row count estimates for joins are wrong. > > Yup. > > > 2) Rows are assumed to be continuous while in reality they are > > discrete. > > Where do you get that from? I made a similar example, where estimates are good. It's quite artificial, because it selects small limit from enormous virtual table constructed by cross joins. But it illustrates how our model, assuming tuples to be continuous, can differ from reality. create table t1 (id int primary key, k int); create table t2 (id int); insert into t1 (select i, i from generate_series(1,1000000) i); insert into t2 (select 0 from generate_series(1,100) i); vacuum analyze t1, t2; By default, the query is processed using sequential scan for t1. # explain analyze select * from t1,t2 x1,t2 x2,t2 x3,t2 x4,t2 x5 where t1.id = 500000 limit 100; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..1.26 rows=100 width=28) (actual time=32.879..32.926 rows=100 loops=1) -> Nested Loop (cost=0.00..126279562.00 rows=10000000000 width=28) (actual time=32.878..32.912 rows=100 loops=1) -> Nested Loop (cost=0.00..1279559.75 rows=100000000 width=24) (actual time=32.873..32.873 rows=1 loops=1) -> Nested Loop (cost=0.00..29557.50 rows=1000000 width=20) (actual time=32.868..32.868 rows=1 loops=1) -> Nested Loop (cost=0.00..17055.25 rows=10000 width=16) (actual time=32.864..32.864 rows=1 loops=1) -> Nested Loop (cost=0.00..16928.00 rows=100 width=12) (actual time=32.856..32.856 rows=1 loops=1) -> Seq Scan on t1 (cost=0.00..16925.00 rows=1 width=8) (actual time=32.841..32.841 rows=1 loops=1) Filter: (id = 500000) Rows Removed by Filter: 501983 -> Seq Scan on t2 x1 (cost=0.00..2.00 rows=100 width=4) (actual time=0.011..0.011 rows=1 loops=1) -> Materialize (cost=0.00..2.50 rows=100 width=4) (actual time=0.007..0.007 rows=1 loops=1) -> Seq Scan on t2 x2 (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Materialize (cost=0.00..2.50 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Seq Scan on t2 x3 (cost=0.00..2.00 rows=100 width=4) (actual time=0.002..0.003 rows=1 loops=1) -> Materialize (cost=0.00..2.50 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Seq Scan on t2 x4 (cost=0.00..2.00 rows=100 width=4) (actual time=0.002..0.003 rows=1 loops=1) -> Materialize (cost=0.00..2.50 rows=100 width=4) (actual time=0.004..0.024 rows=100 loops=1) -> Seq Scan on t2 x5 (cost=0.00..2.00 rows=100 width=4) (actual time=0.003..0.010 rows=100 loops=1) Planning Time: 0.372 ms Execution Time: 32.987 ms (20 rows) But if we disable sequential scan and bitmap scan then it would be index scan, which is much faster. # set enable_seqscan = off; # set enable_bitmapscan = off; # explain analyze select * from t1,t2 x1,t2 x2,t2 x3,t2 x4,t2 x5 where t1.id = 500000 limit 100; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=50000000000.43..50000000001.69 rows=100 width=28) (actual time=0.041..0.092 rows=100 loops=1) -> Nested Loop (cost=50000000000.43..50126262645.44 rows=10000000000 width=28) (actual time=0.040..0.079 rows=100 loops=1) -> Nested Loop (cost=40000000000.43..40001262643.19 rows=100000000 width=24) (actual time=0.036..0.036 rows=1 loops=1) -> Nested Loop (cost=30000000000.42..30000012640.94 rows=1000000 width=20) (actual time=0.033..0.033 rows=1 loops=1) -> Nested Loop (cost=20000000000.42..20000000138.69 rows=10000 width=16) (actual time=0.029..0.029 rows=1 loops=1) -> Nested Loop (cost=10000000000.42..10000000011.44 rows=100 width=12) (actual time=0.023..0.023 rows=1 loops=1) -> Index Scan using t1_pkey on t1 (cost=0.42..8.44 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1) Index Cond: (id = 500000) -> Seq Scan on t2 x1 (cost=10000000000.00..10000000002.00 rows=100 width=4) (actual time=0.007..0.007 rows=1 loops=1) -> Materialize (cost=10000000000.00..10000000002.50 rows=100 width=4) (actual time=0.004..0.005 rows=1 loops=1) -> Seq Scan on t2 x2 (cost=10000000000.00..10000000002.00 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Materialize (cost=10000000000.00..10000000002.50 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Seq Scan on t2 x3 (cost=10000000000.00..10000000002.00 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Materialize (cost=10000000000.00..10000000002.50 rows=100 width=4) (actual time=0.003..0.003 rows=1 loops=1) -> Seq Scan on t2 x4 (cost=10000000000.00..10000000002.00 rows=100 width=4) (actual time=0.002..0.003 rows=1 loops=1) -> Materialize (cost=10000000000.00..10000000002.50 rows=100 width=4) (actual time=0.003..0.026 rows=100 loops=1) -> Seq Scan on t2 x5 (cost=10000000000.00..10000000002.00 rows=100 width=4) (actual time=0.003..0.010 rows=100 loops=1) Planning Time: 0.274 ms Execution Time: 0.150 ms (19 rows) From this example, I get that there is a distinct issue that we assume rows to be continuous. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company