Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: disfavoring unparameterized nested loops |
Date | |
Msg-id | CA+Tgmob3Cpm5dSYwLfT7F7cHMtOfeZv-Kni_hxDxT7vyzUC=nw@mail.gmail.com Whole thread Raw |
In response to | Re: disfavoring unparameterized nested loops (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: disfavoring unparameterized nested loops
|
List | pgsql-hackers |
On Thu, Sep 29, 2022 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Agreed, but dealing with uncertainty in those numbers is an enormous > task if you want to do it right. "Doing it right", IMV, would start > out by extending all the selectivity estimation functions to include > error bars; then we could have error bars on rowcount estimates and > then costs; then we could start adding policies about avoiding plans > with too large a possible upper-bound cost. Trying to add such > policy with no data to go on is not going to work well. I think that the point of the paper which started the thread that this discussion branched from was essentially that trying to add such a policy with no data to go on worked extremely well in practice, and other database systems are already doing it, and we're hurting ourselves by not doing it. And specifically what they did was to disfavor unparameterized nested loops. And I think that actually makes a lot of sense. If you think through parameterized nested loops, unparameterized nested loops, hash joins, and merge joins, which is basically all the strategies, and you imagine having many more or fewer rows on one side of the join or the other than you thought, unparameterized nested loops are the standout case. It's the only kind of join where the expense grows as the product of the sizes of the two inputs. The other cases tend to be more like O(n+m) rather than O(nm), and even if there are some lg n or lg m terms in there too they don't tend to make a whole lot of difference in practice. So I think what you would find if you did all of this analysis is that, basically, every time you did the cost computation for a parameterized nested loop, a hash join, or a merge join, the error bars would be whatever they were, and then when you did a cost computation for an unparameterized nested loop, the error bars would be way worse. Like, if we assume that the estimates for each side of a hash join are off by 100x, then the cost will be off by roughly 100x if it still fits in work_mem and by several hundred x if it now spills to disk. But if we assume the same thing for an unparameterized nested loop, the cost is now off by 10000x. And that is massively, massively more, so clearly we should avoid the unparameterized nested loop. But it's unnecessary to do this computation at runtime for every separate unparameterized nested loop: we can do it right here, in a generic way, for every such loop. Now, it's true that the actual risk depends on how certain we are of the estimates for the input rels, but that's difficult to quantify and I'm not convinced it really matters at all. Like, if the input is a base relation with no filter condition, then the error should be small, unless the table size changes a lot between planning and execution. If it's the output of a user-defined aggregate, the error could be really, really large. But that has no impact on the *relative* dangers of the unparameterized nested loop vs. some other join method. If join A is between two input rels whose sizes are probably known fairly precisely, and join B is between two input rels whose sizes we might be drastically wrong about, then a hash join is riskier for join B than it is for join A. But that really does not matter. What *does* matter is that an unparameterized nested loop is riskier for join A than a hash join is for join A; and likewise for join B. I think we're kind of just making life complicated for ourselves here by pretending that unparameterized nested loops are part of some general class of uncertainty problems that we need to worry about. In some sense, they are, and David Rowley is right to mention the other one that comes up pretty frequently. But like that list of two is pretty much the whole list. I think we've talked ourselves into believing that this problem is much harder than it really is. Maybe a blanket ban on unparameterized nested loops is too strong (or maybe it's exactly the right thing) but it can't possibly be wrong to think about that case in particular as something we need to solve. It's the only join method that can go quadratic in easy, realistic scenarios -- and it often does. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: