Planner making bad choice in alternative subplan decision - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Planner making bad choice in alternative subplan decision |
Date | |
Msg-id | CAApHDvpbJHwMZ1U-nzU0kBxu0kwMpBvyL+AFWvFAmurypSo1SQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Planner making bad choice in alternative subplan decision
|
List | pgsql-hackers |
This follows on from what I mentioned in [1]. I'll reiterate below: Really the topic is much more broad than what I wrote above, but the example case I have is around subplan costing which results in the alternative subplan code making a bad choice. So the example case is: create table t (a int, b int); insert into t select x,1 from generate_Series(1,10000)x; create index t_b_idx on t (b); analyze t; The query is: select * from t where exists (select 1 from t t2 where t.a=t2.b) or a < 0; Here the planner has 3 choices: 1. Use a non-hashed subplan and use the t_b_idx to find if there's a match for the EXISTS, or; 2. Use a hashed subplan and just scan all of t and put that in a hash table to probe when scanning the outer query, or; 3. Use a seq-scan on the inner table and fetch 1 row matching t.a=t2.b for each outer row. #3 is not going to work out well here since t.a has the values from 1 to 10000 and t2.b just has one distinct value that is 1. For 9999 of the scans we'd have to seqscan all rows in the entire t2 to find out that no row exists. This would be a terrible plan if the planner went with that. Unfortunately, that's the exact plan the planner goes with :-( Let me explain: The subquery planning that's going on is pretty much planning the following. I'll use PREPARE and a generic plan to simulate how the planner deals with unknown parameter values in the subquery. I'll add a LIMIT 1 since the subquery_planner will by passed a 1.0 tuple_fraction when planning a EXISTS_SUBLINK. set plan_cache_mode = force_generic_plan; prepare q1(int) as select b from t where b = $1 limit 1; postgres=# explain (analyze, costs off, timing off) execute q1(1); QUERY PLAN --------------------------------------------- Limit (actual rows=1 loops=1) -> Seq Scan on t (actual rows=1 loops=1) Filter: (b = $1) Planning Time: 0.035 ms Execution Time: 0.089 ms (5 rows) So here the planner has to plan with a unknown parameter value. var_eq_non_const() comes along and sees the stats say there's only 1 distinct value in the "b" column, so the selectivity is 1.0 / 1.0, aka 1.0. That works out well when I pass the value of 1 as the to execute, but if I pass in anything else, then: postgres=# explain (analyze, costs off, timing off) execute q1(0); QUERY PLAN --------------------------------------------- Limit (actual rows=0 loops=1) -> Seq Scan on t (actual rows=0 loops=1) Filter: (b = $1) Rows Removed by Filter: 10000 Planning Time: 0.017 ms Execution Time: 1.850 ms (6 rows) We run through the entire table to find nothing. I removed the costs here to trim down the output, but the row estimate is 10000 in both cases. The reason that the alternative subplan code picks the non-hashed plan is due to the estimated rows of the subplan is 10000 and we apply the scan cost in cost_subplan() as: /* we only need to fetch 1 tuple; clamp to avoid zero divide */ sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows); On my machine the seqscan for the 10000 rows costs 180. So cost_subplan() costs the subplan with: 180 / 10000. That's slightly more favourable than the hashed subplan since that's planning the seqscan + the cost of hashing. The qual-less scan in this case cost just 155 but the cost of hashing, coincidentally, takes the startup cost to 180. In the very recently committed fix_alternative_subplan() function, by the time we account for the hash probe cost, the hashed subplan will cost 205. That code ends up choosing the non-hashed plan, aka the dreaded #3 case from above. Fixing it: I'm not really quite sure which part of the planner to blame for this poor choice. Is it the fact that the planner picked such a risky seqscan path to fetch 1 row from a table. Or is it cost_subplan() for dividing the total cost for the EXISTS_SUBLINK by the row estimate. Certainly if we change that to do the same as ALL_SUBLINK, then that would fix this particular problem, but it might penilise othercases incorrectly. I'm more leaning towards the fact that the planner picked the seqscan path in the first place. Unfotunately I don't have any great ideas on how to fix without fudging the costs a bit. Maybe we can make var_eq_non_const() divide down the selectivity by: ndistinct = Max(get_variable_numdistinct(vardata, &isdefault), DEFAULT_NUM_DISTINCT); instead of : ndistinct = get_variable_numdistinct(vardata, &isdefault); But that's going to make the row estimate in this case 50 rows (10000 / 200). With a subplan that the stats say can only either return 10000 or 0 rows. Is that too weird? FWIW, the execution times for #3 is 7,4 seconds on my machine. Using the hashed subplan (#2) takes 3.7 milliseconds, or 2000 times faster, if you'd prefer that. Anyone else have any thoughts on this? David (copied in Tom as he was the one I mentioned this to in [1] and who I promised another thread for this to) [1] https://www.postgresql.org/message-id/CAApHDvqeh8JEqMjpCFTgHD_zu2S03nOVh2srejd+sNLza8M+mg@mail.gmail.com
pgsql-hackers by date: