Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: disfavoring unparameterized nested loops |
Date | |
Msg-id | CA+TgmoYXhsviApqHz5DuX7myZBWDqBoEojEmy2n9suPxj0M=Vg@mail.gmail.com Whole thread Raw |
In response to | Re: disfavoring unparameterized nested loops (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: disfavoring unparameterized nested loops
Re: disfavoring unparameterized nested loops |
List | pgsql-hackers |
On Mon, Jun 21, 2021 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > There are certainly cases where the optimizer can prove (in principle; > it doesn't do so today) that a plan node will produce at most one row. > They're hardly uncommon either: an equality comparison on a unique > key, or a subquery with a simple aggregate function, come to mind. Hmm, maybe I need to see an example of the sort of plan shape that you have in mind. To me it feels like a comparison on a unique key ought to use a *parameterized* nested loop. And it's also not clear to me why a nested loop is actually better in a case like this. If the nested loop iterates more than once because there are more rows on the outer side, then you don't want to have something on the inner side that might be expensive, and either an aggregate or an unparameterized search for a unique value are potentially quite expensive. Now if you put a materialize node on top of the inner side, then you don't have to worry about that problem, but how much are you saving at that point vs. just doing a hash join? > I'd be a lot happier if this proposal were couched around some sort > of estimate of the risk of the outer side producing more than the > expected number of rows. The arguments so far seem like fairly lame > rationalizations for not putting forth the effort to do that. I don't understand how to generate a risk assessment or what we ought to do with it if we had one. I don't even understand what units we would use. We measure costs using abstract cost units, but those abstract cost units are intended to be a proxy for runtime. If it's not the case that a plan that runs for longer has a higher cost, then something's wrong with the costing model or the settings. In the case of risk, the whole thing seems totally subjective. We're talking about the risk that our estimate is bogus, but how do we estimate the risk that we don't know how to estimate? Given quals (x + 0) = x, x = some MCV, and x = some non-MCV, we can say that we're most likely to be wrong about the first one and least likely to be wrong about the second one, but how much more likely? I don't know how you can decide that, even in principle. We can also say that an unparameterized nested loop is more risky than some other plan because it could turn out to be crazy expensive, but is that more or less risky than scanning the index on b as a way to implement SELECT * FROM foo WHERE a = 1 ORDER BY b LIMIT 1? How much more risky, and why? And then, even supposing we had a risk metric for every path, what exactly would we do with it? Suppose path A is cheaper than path B, but also more risky. Which should we keep? We could keep both, but that seems to be just kicking the can down the road. If plan B is likely to blow up in our face, we should probably just get rid of it, or not even generate it in the first place. Even if we do keep both, at some point we're going to have to make a cost-vs-risk tradeoff, and I don't see how to do that intelligently, because the point is precisely that if the risk is high, the cost number might be totally wrong. If we know that plan A is cheaper than plan B, we should choose plan A. But if all we know is that plan A would be cheaper than plan B if our estimate of the cost were correct, but also that it probably isn't, then what we actually know is nothing. We have no principled basis for deciding anything based on cost unless we're reasonably confident that the cost estimate is pretty good. So AFAICT the only principled strategy here is to throw away high risk paths as early as we possibly can. What am I missing? The other thing is - the risk of a particular path doesn't matter in an absolute sense, only a relative one. In the particular case I'm on about here, we *know* there's a less-risky alternative. We don't need to quantify the risk to know which of the two options has more. In many other cases, the risk is irreducible e.g. a default estimate could be totally bogus, but switching paths is of no help in getting rid of it. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: