Thread: BUG #16324: bad cost estimates for generic query plans
The following bug has been logged on the website: Bug reference: 16324 Logged by: Todd Cook Email address: tcook@blackducksoftware.com PostgreSQL version: 11.7 Operating system: CentOS 7.7 Description: With PG 11.7, we're seeing bad cost estimates for generic query plans where the cost of a very expensive InitPlan is not included in the total cost. test=# select version() ; version --------------------------------------------------------------------------------------------------------- PostgreSQL 11.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit The initial custom plan is very good: test=# prepare s1 as SELECT EXISTS(SELECT 1 FROM audit_event WHERE id > $1 AND event_name IN ($2,$3,$4,$5,$6,$7,$8,$9,$10)) ; PREPARE test=# explain analyze execute s1(316945699, 'CVA', 'CVCC', 'CVIC', 'CVRDC', 'CVR', 'CVSC', 'CVTC', 'CBE', 'VBCLBC') ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Result (cost=4.60..4.61 rows=1 width=1) (actual time=0.009..0.009 rows=1 loops=1) InitPlan 1 (returns $0) -> Index Scan using audit_event_pkey on audit_event (cost=0.57..4.60 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (id > '316945699'::bigint) Filter: (event_name = ANY ('{CVA,CVCC,CVIC,CVRDC,CVR,CVSC,CVTC,CBE,VBCLBC}'::text[])) Planning Time: 0.403 ms Execution Time: 0.033 ms (7 rows) The audit_event table has 82 million rows, and the listed event_names account for about 15 million of them. However, 316945699 is the maximum id value, so the existence check returns false. Then, after 5 invocations, PG switches to using a cached, generic query plan that is very slow: test=# explain analyze execute s1(316945699, 'CVA', 'CVCC', 'CVIC', 'CVRDC', 'CVR', 'CVSC', 'CVTC', 'CBE', 'VBCLBC') ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Result (cost=0.47..0.48 rows=1 width=1) (actual time=28314.960..28314.961 rows=1 loops=1) InitPlan 1 (returns $0) -> Seq Scan on audit_event (cost=0.00..6796877.67 rows=14532272 width=0) (actual time=28314.953..28314.953 rows=0 loops=1) Filter: ((id > $1) AND (event_name = ANY (ARRAY[$2, $3, $4, $5, $6, $7, $8, $9, $10]))) Rows Removed by Filter: 82349547 Planning Time: 0.377 ms Execution Time: 28315.003 ms (7 rows) It looks like the total cost of the plan is not including the substantial cost of the InitPlan. FWIW, 9.6.17 exhibits the same behavior.
PG Bug reporting form <noreply@postgresql.org> writes: > With PG 11.7, we're seeing bad cost estimates for generic query plans where > the cost of a very expensive InitPlan is not included in the total cost. > test=# prepare s1 as SELECT EXISTS(SELECT 1 FROM audit_event WHERE id > $1 > AND event_name IN ($2,$3,$4,$5,$6,$7,$8,$9,$10)) ; > PREPARE > Then, after 5 invocations, PG switches to using a cached, generic query plan > that is very slow: > Result (cost=0.47..0.48 rows=1 width=1) (actual time=28314.960..28314.961 > rows=1 loops=1) > InitPlan 1 (returns $0) > -> Seq Scan on audit_event (cost=0.00..6796877.67 rows=14532272 > width=0) (actual time=28314.953..28314.953 rows=0 loops=1) > Filter: ((id > $1) AND (event_name = ANY (ARRAY[$2, $3, $4, $5, $6, > $7, $8, $9, $10]))) > Rows Removed by Filter: 82349547 > Planning Time: 0.377 ms > Execution Time: 28315.003 ms > (7 rows) > It looks like the total cost of the plan is not including the substantial > cost of the InitPlan. It's not "ignoring" the cost. What it is doing, since this is an EXISTS subplan, is assuming that it will fetch the first tuple and stop, the same as if there'd been a LIMIT 1 in the subquery. Since the estimated number of rows is (wrongly) very high, that results in a low estimated cost to obtain the EXISTS result, specifically 6796877.67 / 14532272 or about 0.47. Then in reality there are *no* tuples in the result, so that the seqscan has to run to completion to find that out. Ooops. So in general this is just an instance of the well-known difficulty of estimating costs with small LIMIT accurately. On the other hand, since we know the context is EXISTS, maybe we could do better? There's an argument that the user wouldn't be bothering to test EXISTS if there weren't a chance of a false result, hence we ought to assume that the subquery might need to run to completion; which would lead to taking the cost as being the full run cost not the estimated-time-to-first-tuple. On the other hand that seems like it would discourage use of fast-start plans for this purpose, which is probably a net loss. On the third hand, it looks like we have already settled on the subplan's plan by this point so maybe that objection is bogus. If you want to experiment you could try changing this bit in cost_subplan: if (subplan->subLinkType == EXISTS_SUBLINK) { /* we only need to fetch 1 tuple; clamp to avoid zero divide */ sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows); } The comment above that suggests that this logic needs to match make_subplan, but I'm thinking we would want to intentionally make them not match, since the other code is what drives picking a fast-start plan for the subplan. You could make an argument for charging full run cost, or maybe just half of that as we do for ALL/ANY cases, depending on whether you think we should be taking worst-case estimates or not. regards, tom lane
On 3/27/20, 12:36 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: It's not "ignoring" the cost. What it is doing, since this is an EXISTS subplan, is assuming that it will fetch the first tuple and stop, the same as if there'd been a LIMIT 1 in the subquery. Since the estimated number of rows is (wrongly) very high, that results in a low estimated cost to obtain the EXISTS result, specifically 6796877.67 / 14532272 or about 0.47. Then in reality there are *no* tuples in the result, so that the seqscan has to run to completion to find that out. Ooops. So in general this is just an instance of the well-known difficulty of estimating costs with small LIMIT accurately. On the other hand, since we know the context is EXISTS, maybe we could do better? There's an argument that the user wouldn't be bothering to test EXISTS if there weren't a chance of a false result, hence we ought to assume that the subquery might need to run to completion; which would lead to taking the cost as being the full run cost not the estimated-time-to-first-tuple. On the other hand that seems like it would discourage use of fast-start plans for this purpose, which is probably a net loss. On the third hand, it looks like we have already settled on the subplan's plan by this point so maybe that objection is bogus. If you want to experiment you could try changing this bit in cost_subplan: if (subplan->subLinkType == EXISTS_SUBLINK) { /* we only need to fetch 1 tuple; clamp to avoid zero divide */ sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows); } Thanks for looking at this. I will experiment with that and report back. As you predicted, changing the query to SELECT 1 FROM audit_event WHERE id > $1 AND event_name IN ($2,$3,$4,$5,$6,$7,$8,$9,$10) limit 1 results in the same behavior. -- todd The comment above that suggests that this logic needs to match make_subplan, but I'm thinking we would want to intentionally make them not match, since the other code is what drives picking a fast-start plan for the subplan. You could make an argument for charging full run cost, or maybe just half of that as we do for ALL/ANY cases, depending on whether you think we should be taking worst-case estimates or not. regards, tom lane