Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: disfavoring unparameterized nested loops |
Date | |
Msg-id | CAApHDvo2sMPF9m=i+YPPUssfTV1GB=Z8nMVa+9Uq4RZJ8sULeQ@mail.gmail.com Whole thread Raw |
In response to | disfavoring unparameterized nested loops (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: disfavoring unparameterized nested loops
Re: disfavoring unparameterized nested loops Re: disfavoring unparameterized nested loops Re: disfavoring unparameterized nested loops |
List | pgsql-hackers |
On Wed, 16 Jun 2021 at 05:09, Robert Haas <robertmhaas@gmail.com> wrote: > How to do that is not very clear. One very simple thing we could do > would be to introduce enable_nestloop=parameterized or > enable_parameterized_nestloop=false. That is a pretty blunt instrument > but the authors of the paper seem to have done that with positive > results, so maybe it's actually not a dumb idea. It's not great that people are having to use such blunt instruments to get the planner to behave. It might not be a terrible idea to provide them with something a bit sharper as you suggest. The GUC idea is certainly something that could be done without too much effort. There was some talk of doing that in [1]. > Another approach > would be to introduce a large fuzz factor for such nested loops e.g. > keep them only if the cost estimate is better than the comparable hash > join plan by at least a factor of N (or if no such plan is being > generated for some reason). In my experience, the most common reason that the planner chooses non-parameterized nested loops wrongly is when there's row underestimation that says there's just going to be 1 row returned by some set of joins. The problem often comes when some subsequent join is planned and the planner sees the given join rel only produces one row. The cheapest join method we have to join 1 row is Nested Loop. So the planner just sticks the 1-row join rel on the outer side thinking the executor will only need to scan the inner side of the join once. When the outer row count blows up, then we end up scanning that inner side many more times. The problem is compounded when you nest it a few joins deep Most of the time when I see that happen it's down to either the selectivity of some correlated base-quals being multiplied down to a number low enough that we clamp the estimate to be 1 row. The other case is similar, but with join quals. It seems to me that each time we multiply 2 selectivities together that the risk of the resulting selectivity being out increases. The risk is likely lower when we have some extended statistics which allows us to do fewer selectivity multiplications. For that 1-row case, doing an unparameterized nested loop is only the cheapest join method by a tiny amount. It really wouldn't be much more expensive to just put that single row into a hash table. If that 1 estimated row turns out to be 10 actual rows then it's likely not too big a deal for the hash join code to accommodate the 9 additional rows. This makes me think that it's insanely risky for the planner to be picking Nested Loop joins in this case. And that puts me down the path of thinking that this problem should be solved by having the planner take into account risk as well as costs when generating plans. I don't really have any concrete ideas on that, but a basic idea that I have been considering is that a Path has a risk_factor field that is decided somewhere like clauselist_selectivity(). Maybe the risk can go up by 1 each time we multiply an individual selectivity. (As of master, estimate_num_groups() allows the estimation to pass back some further information to the caller. I added that for Result Cache so I could allow the caller to get visibility about when the estimate fell back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get similar treatment to allow the risk_factor or number of nstats_used to be passed back). We'd then add a GUC, something like planner_risk_adversion which could be set to anything from 0.0 to some positive number. During add_path() we could do the cost comparison like: path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion) < path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion). That way, if you set planner_risk_adversion to 0.0, then the planner does as it does today, i.e takes big risks. The problem I have with this idea is that I really don't know how to properly calculate what the risk_factor should be set to. It seems easy at first to set it to something that has the planner avoid these silly 1-row estimate nested loop mistakes, but I think what we'd set the risk_factor to would become much more important when more and more Path types start using it. So if we did this and just guessed the risk_factor, that might be fine when only 1 of the paths being compared had a non-zero risk_factor, but as soon as both paths have one set, unless they're set to something sensible, then we just end up comparing garbage costs to garbage costs. Another common mistake the planner makes is around WHERE a = <value> ORDER BY b LIMIT n; where there are separate indexes on (a) and (b). Scanning the (b) index is pretty risky. All the "a" values you need might be right at the end of the index. It seems safer to scan the (a) index as we'd likely have statistics that tell us how many rows exist with <value>. We don't have any stats that tell us where in the (b) index are all the rows with a = <value>. I don't really think we should solve this by having nodeNestloop.c fall back on hashing when the going gets tough. Overloading our nodes that way is not a sustainable thing to do. I'd rather see the executor throw the plan back at the planner along with some hints about what was wrong with it. We could do that providing we've not sent anything back to the client yet. David [1] https://www.postgresql.org/message-id/CAKJS1f8nsm-T0KMvGJz_bskUjQ%3DyGmGUUtUdAcFoEaZ_tuTXjA%40mail.gmail.com
pgsql-hackers by date: