Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers

From Robert Haas
Subject Re: disfavoring unparameterized nested loops
Date
Msg-id CA+TgmoYXhsviApqHz5DuX7myZBWDqBoEojEmy2n9suPxj0M=Vg@mail.gmail.com
Whole thread Raw
In response to Re: disfavoring unparameterized nested loops  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: disfavoring unparameterized nested loops
Re: disfavoring unparameterized nested loops
List pgsql-hackers
On Mon, Jun 21, 2021 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

Hmm, maybe I need to see an example of the sort of plan shape that you
have in mind. To me it feels like a comparison on a unique key ought
to use a *parameterized* nested loop. And it's also not clear to me
why a nested loop is actually better in a case like this. If the
nested loop iterates more than once because there are more rows on the
outer side, then you don't want to have something on the inner side
that might be expensive, and either an aggregate or an unparameterized
search for a unique value are potentially quite expensive. Now if you
put a materialize node on top of the inner side, then you don't have
to worry about that problem, but how much are you saving at that point
vs. just doing a hash join?

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I don't understand how to generate a risk assessment or what we ought
to do with it if we had one. I don't even understand what units we
would use. We measure costs using abstract cost units, but those
abstract cost units are intended to be a proxy for runtime. If it's
not the case that a plan that runs for longer has a higher cost, then
something's wrong with the costing model or the settings. In the case
of risk, the whole thing seems totally subjective. We're talking about
the risk that our estimate is bogus, but how do we estimate the risk
that we don't know how to estimate? Given quals (x + 0) = x, x = some
MCV, and x = some non-MCV, we can say that we're most likely to be
wrong about the first one and least likely to be wrong about the
second one, but how much more likely? I don't know how you can decide
that, even in principle. We can also say that an unparameterized
nested loop is more risky than some other plan because it could turn
out to be crazy expensive, but is that more or less risky than
scanning the index on b as a way to implement SELECT * FROM foo WHERE
a = 1 ORDER BY b LIMIT 1? How much more risky, and why?

And then, even supposing we had a risk metric for every path, what
exactly would we do with it? Suppose path A is cheaper than path B,
but also more risky. Which should we keep? We could keep both, but
that seems to be just kicking the can down the road. If plan B is
likely to blow up in our face, we should probably just get rid of it,
or not even generate it in the first place. Even if we do keep both,
at some point we're going to have to make a cost-vs-risk tradeoff, and
I don't see how to do that intelligently, because the point is
precisely that if the risk is high, the cost number might be totally
wrong. If we know that plan A is cheaper than plan B, we should choose
plan A. But if all we know is that plan A would be cheaper than plan B
if our estimate of the cost were correct, but also that it probably
isn't, then what we actually know is nothing. We have no principled
basis for deciding anything based on cost unless we're reasonably
confident that the cost estimate is pretty good. So AFAICT the only
principled strategy here is to throw away high risk paths as early as
we possibly can. What am I missing?

The other thing is - the risk of a particular path doesn't matter in
an absolute sense, only a relative one. In the particular case I'm on
about here, we *know* there's a less-risky alternative. We don't need
to quantify the risk to know which of the two options has more. In
many other cases, the risk is irreducible e.g. a default estimate
could be totally bogus, but switching paths is of no help in getting
rid of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: disfavoring unparameterized nested loops
Next
From: Tom Lane
Date:
Subject: Re: Out-of-bounds (src/backend/utils/misc/queryjumble.c)