Thread: disfavoring unparameterized nested loops

disfavoring unparameterized nested loops

From

Robert Haas

Date:

15 June 2021, 17:09:38

From time to time, someone tells me that they've configured
enable_nestloop=false on postgresql.conf, which is a pretty bad idea
since there are a significant number of cases where such plans are the
only reasonable way of executing some query. However, it's no great
secret that PostgreSQL's optimizer sometimes produces nested loops
that are very, very, very slow, generally because it has grossly
underestimated the cardinality of the inner side of the nested loop.
The frustration which users experience as a result of such events is
understandable.

I read https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
today and found out that the authors of that paper did something a bit
more nuanced which, in their experiments, was very successful. It
sounds like what they basically did is disabled unparameterized nested
loops. They argue that such nested loops figure to gain very little as
compared with a hash join, but that they might turn out to lose a lot
if the cardinality estimation is inaccurate, and they present
experimental results to back up those claims. One observation that the
paper makes along the way is that every system they tested is more
likely to underestimate the cardinality of joins than to overestimate
it, and that this tendency becomes more pronounced as the size of the
join planning problem increases. On reflection, I believe this matches
my experience, and it makes sense that it should be so, since it
occasionally happens that the join selectivity estimate is essentially
zero, and a bigger join problem is more likely to have at least one
such case. On the other hand, the join selectivity estimate can never
be +infinity. Hence, it's more important in general for a database
system to be resilient against underestimates than to be resilient
against overestimates. Being less willing to choose unparameterized
nested loops is one way to move in that direction.

How to do that is not very clear. One very simple thing we could do
would be to introduce enable_nestloop=parameterized or
enable_parameterized_nestloop=false. That is a pretty blunt instrument
but the authors of the paper seem to have done that with positive
results, so maybe it's actually not a dumb idea. Another approach
would be to introduce a large fuzz factor for such nested loops e.g.
keep them only if the cost estimate is better than the comparable hash
join plan by at least a factor of N (or if no such plan is being
generated for some reason). I'm not very confident that this would
actually do what we want, though. In the problematic cases, a nested
loop is likely to look extremely cheap, so just imagining that the
cost might be higher is not very protective. Perhaps a better approach
would be something like: if the estimated number of inner rows is less
than 100, then re-estimate the cost of this approach and of the best
available hash join on the assumption that there are 100 inner rows.
If the hash join still wins, keep it; if it loses under that
assumption, throw it out. I think it's likely that this approach would
eliminate a large number of highly risky nested loop plans, probably
even with s/100/10/g, without causing many other problems (except
perhaps increased planner CPU consumption ... but maybe that's not too
bad).

Just to be clear, I do understand that there are cases where no Hash
Join is possible, but anything we do here could be made to apply only
when a hash join is in fact possible. We could also think about making
the point of comparison the best other plans of any sort rather than a
hash join specifically, which feels a little more principled but might
actually be worse. When a Nested Loop is a stupid idea, it's stupid
precisely because the inner side is big and we could've avoided
recomputing it over and over by using a Hash Join instead, not because
some Merge Join based plan turns out to be better. I mean, it is
possible that some Merge Join plan does turn out to be better, but
that's not rage-inducing in the same way. Nobody looks at a
complicated join plan that happened to use a Nested Loop and says
"obviously, this is inferior to a merge join," or if they do, they're
probably full of hot air. But people look at complicated join plans
that happen to use a Nested Loop and say "obviously, this is inferior
to a hash join" *all the time* and assuming the inner path is
unparameterized, they are very often correct.

Thoughts? I'd be particularly curious to hear about any cases anyone
knows about where an unparameterized nested loop and a hash join with
batches=1 are both possible and where the unparameterized nested loop
is *much* cheaper than the hash join.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

15 June 2021, 18:04:10

On Tue, Jun 15, 2021 at 10:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

I think that it's probably a good idea as-is.

Simple heuristics that are very frequently wrong when considered in a
naive way can work very well in practice. This seems to happen when
they capture some kind of extreme naturally occuring cost/benefit
asymmetry -- especially one with fixed well understood costs and
unlimited benefits (this business with unparameterized nestloop joins
is about *avoiding* the inverse asymmetry, but that seems very
similar). My go to example of such an asymmetry is the rightmost page
split heuristic of applying leaf fillfactor regardless of any of the
other specifics; we effectively assume that all indexes are on columns
with ever-increasing values. Which is obviously wrong.

We're choosing between two alternatives (unparameterized nested loop
vs hash join) that are really very similar when things go as expected,
but diverge sharply when there is a misestimation -- who wouldn't take
the "conservative" choice here?

I guess that there is a hesitation to not introduce heuristics like
this because it doesn't fit into some larger framework that captures
risk -- it might be seen as an ugly special case. But isn't this
already actually kind of special, whether or not we officially think
so?

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

15 June 2021, 19:31:01

On Tue, Jun 15, 2021 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I guess that there is a hesitation to not introduce heuristics like
> this because it doesn't fit into some larger framework that captures
> risk -- it might be seen as an ugly special case. But isn't this
> already actually kind of special, whether or not we officially think
> so?

Yes, I think it is. Reading the paper really helped me crystallize my
thoughts about this, because when I've studied the problems myself, I
came, as you postulate here, to the conclusion that there's a lot of
stuff the planner does where there is risk and uncertainty, and thus
that a general framework would be necessary to deal with it. But the
fact that an academic researcher called this problem out as the only
one worth treating specially makes me think that perhaps it deserves
special handling. In defense of that approach, note that this is a
case where we know both that the Nested Loop is risky and that Hash
Join is a similar alternative with probably similar cost. I am not
sure there are any other cases where we can say quite so generally
both that a certain thing is risky and what we could do instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

15 June 2021, 20:00:08

On Tue, Jun 15, 2021 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Yes, I think it is. Reading the paper really helped me crystallize my
> thoughts about this, because when I've studied the problems myself, I
> came, as you postulate here, to the conclusion that there's a lot of
> stuff the planner does where there is risk and uncertainty, and thus
> that a general framework would be necessary to deal with it.

It is an example (perhaps the only example in the optimizer) of an
oasis of certainty in an ocean of uncertainty. As uncertain as
everything is, we seemingly can make strong robust statements about
the relative merits of each strategy *in general*, just in this
particular instance. It's just not reasonable to make such a reckless
choice, no matter what your general risk tolerance is.

Goetz Graefe is interviewed here, and goes into his philosophy on
robustness -- it seems really interesting to me:

https://sigmodrecord.org/publications/sigmodRecord/2009/pdfs/05_Profiles_Graefe.pdf

> In defense of that approach, note that this is a
> case where we know both that the Nested Loop is risky and that Hash
> Join is a similar alternative with probably similar cost. I am not
> sure there are any other cases where we can say quite so generally
> both that a certain thing is risky and what we could do instead.

I tend to think of a hash join as like a nested loop join with an
inner index scan where you build the index yourself, dynamically. That
might be why I find it easy to make this mental leap. In theory you
could do this by giving the nestloop join runtime smarts -- make it
turn into a hash join adaptively. Like Graefe's G-Join design. That
way you could do this in a theoretically pure way.

I don't think that that's actually necessary just to deal with this
case -- it probably really is as simple as it seems. I point this out
because perhaps it's useful to have that theoretical anchoring.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

David Rowley

Date:

15 June 2021, 23:59:55

On Wed, 16 Jun 2021 at 05:09, Robert Haas <robertmhaas@gmail.com> wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

It's not great that people are having to use such blunt instruments to
get the planner to behave.  It might not be a terrible idea to provide
them with something a bit sharper as you suggest. The GUC idea is
certainly something that could be done without too much effort.

There was some talk of doing that in [1].

> Another approach
> would be to introduce a large fuzz factor for such nested loops e.g.
> keep them only if the cost estimate is better than the comparable hash
> join plan by at least a factor of N (or if no such plan is being
> generated for some reason).

In my experience, the most common reason that the planner chooses
non-parameterized nested loops wrongly is when there's row
underestimation that says there's just going to be 1 row returned by
some set of joins.  The problem often comes when some subsequent join
is planned and the planner sees the given join rel only produces one
row.  The cheapest join method we have to join 1 row is Nested Loop.
So the planner just sticks the 1-row join rel on the outer side
thinking the executor will only need to scan the inner side of the
join once.  When the outer row count blows up, then we end up scanning
that inner side many more times. The problem is compounded when you
nest it a few joins deep

Most of the time when I see that happen it's down to either the
selectivity of some correlated base-quals being multiplied down to a
number low enough that we clamp the estimate to be 1 row.   The other
case is similar, but with join quals.

It seems to me that each time we multiply 2 selectivities together
that the risk of the resulting selectivity being out increases.  The
risk is likely lower when we have some extended statistics which
allows us to do fewer selectivity multiplications.

For that 1-row case, doing an unparameterized nested loop is only the
cheapest join method by a tiny amount.  It really wouldn't be much
more expensive to just put that single row into a hash table.  If that
1 estimated row turns out to be 10 actual rows then it's likely not
too big a deal for the hash join code to accommodate the 9 additional
rows.

This makes me think that it's insanely risky for the planner to be
picking Nested Loop joins in this case. And that puts me down the path
of thinking that this problem should be solved by having the planner
take into account risk as well as costs when generating plans.

I don't really have any concrete ideas on that, but a basic idea that
I have been considering is that a Path has a risk_factor field that is
decided somewhere like clauselist_selectivity(). Maybe the risk can go
up by 1 each time we multiply an individual selectivity. (As of
master, estimate_num_groups() allows the estimation to pass back some
further information to the caller. I added that for Result Cache so I
could allow the caller to get visibility about when the estimate fell
back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get
similar treatment to allow the risk_factor or number of nstats_used to
be passed back).  We'd then add a GUC, something like
planner_risk_adversion which could be set to anything from 0.0 to some
positive number. During add_path() we could do the cost comparison
like:  path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion)
< path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion).
That way, if you set planner_risk_adversion to 0.0, then the planner
does as it does today, i.e takes big risks.

The problem I have with this idea is that I really don't know how to
properly calculate what the risk_factor should be set to.  It seems
easy at first to set it to something that has the planner avoid these
silly 1-row estimate nested loop mistakes, but I think what we'd set
the risk_factor to would become much more important when more and more
Path types start using it. So if we did this and just guessed the
risk_factor, that might be fine when only 1 of the paths being
compared had a non-zero risk_factor, but as soon as both paths have
one set, unless they're set to something sensible, then we just end up
comparing garbage costs to garbage costs.

Another common mistake the planner makes is around WHERE a = <value>
ORDER BY b LIMIT n; where there are separate indexes on (a) and (b).
Scanning the (b) index is pretty risky. All the "a" values you need
might be right at the end of the index. It seems safer to scan the (a)
index as we'd likely have statistics that tell us how many rows exist
with <value>. We don't have any stats that tell us where in the (b)
index are all the rows with a = <value>.

I don't really think we should solve this by having nodeNestloop.c
fall back on hashing when the going gets tough.  Overloading our nodes
that way is not a sustainable thing to do.  I'd rather see the
executor throw the plan back at the planner along with some hints
about what was wrong with it.  We could do that providing we've not
sent anything back to the client yet.

David

[1] https://www.postgresql.org/message-id/CAKJS1f8nsm-T0KMvGJz_bskUjQ%3DyGmGUUtUdAcFoEaZ_tuTXjA%40mail.gmail.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

16 June 2021, 00:11:39

On Tue, Jun 15, 2021 at 5:00 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I don't really think we should solve this by having nodeNestloop.c
> fall back on hashing when the going gets tough.  Overloading our nodes
> that way is not a sustainable thing to do.  I'd rather see the
> executor throw the plan back at the planner along with some hints
> about what was wrong with it.  We could do that providing we've not
> sent anything back to the client yet.

It wasn't a serious suggestion -- it was just a way of framing the
issue at hand that I thought might be useful.

If we did have something like that (which FWIW I think makes sense but
is hard to do right in a general way) then it might be expected to
preemptively refuse to even start down the road of using an
unparameterized nestloop join very early, or even before execution
time. Such an adaptive join algorithm/node might be expected to have a
huge bias against this particular plan shape, that can be reduced to a
simple heuristic. But you can have the simple heuristic without
needing to build everything else.

Whether or not we throw the plan back at the planner or "really change
our minds at execution time" seems like a distinction without a
difference. Either way we're changing our minds about the plan based
on information that is fundamentally execution time information, not
plan time information. Have I missed something?

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

David Rowley

Date:

16 June 2021, 01:15:17

On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan <pg@bowt.ie> wrote:
> Whether or not we throw the plan back at the planner or "really change
> our minds at execution time" seems like a distinction without a
> difference.

What is "really change our minds at execution time"?  Is that switch
to another plan without consulting the planner?  If so what decides
what that new plan should be? The planner is meant to be the expert in
that department. The new information might cause the join order to
completely change. It might not be as simple as swapping a Nested Loop
for a Hash Join.

> Either way we're changing our minds about the plan based
> on information that is fundamentally execution time information, not
> plan time information. Have I missed something?

I don't really see why you think the number of rows that a given join
might produce is execution information. It's exactly the sort of
information the planner needs to make a good plan. It's just that
today we get that information from statistics. Plenty of other DBMSs
make decisions from sampling. FWIW, we do already have a minimalist
sampling already in get_actual_variable_range().

I'm just trying to highlight that I don't think overloading nodes is a
good path to go down.  It's not a sustainable practice. It's a path
towards just having a single node that does everything. If your
suggestion was not serious then there's no point in discussing it
further.

David

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

16 June 2021, 01:48:27

On Tue, Jun 15, 2021 at 6:15 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan <pg@bowt.ie> wrote:
> > Whether or not we throw the plan back at the planner or "really change
> > our minds at execution time" seems like a distinction without a
> > difference.
>
> What is "really change our minds at execution time"?  Is that switch
> to another plan without consulting the planner?

I don't know what it means. That was my point -- it all seems like
semantics to me.

The strong separation between plan time and execution time isn't
necessarily a good thing, at least as far as solving some of the
thorniest problems goes. It seems obvious to me that cardinality
estimation is the main problem, and that the most promising solutions
are all fundamentally about using execution time information to change
course. Some problems with planning just can't be solved at plan time
-- no model can ever be smart enough. Better to focus on making query
execution more robust, perhaps by totally changing the plan when it is
clearly wrong. But also by using more techniques that we've
traditionally thought of as execution time techniques (e.g. role
reversal in hash join). The distinction is blurry to me.

There are no doubt practical software engineering issues with this --
separation of concerns and whatnot. But it seems premature to go into
that now.

> The new information might cause the join order to
> completely change. It might not be as simple as swapping a Nested Loop
> for a Hash Join.

I agree that it might not be that simple at all. I think that Robert
is saying that this is one case where it really does appear to be that
simple, and so we really can expect to benefit from a simple plan-time
heuristic that works within the confines of the current model. Why
wouldn't we just take that easy win, once the general idea has been
validated some more? Why let the perfect be the enemy of the good?

I have perhaps muddied the waters by wading into the more general
question of robust execution, the inherent uncertainty with
cardinality estimation, and so on. Robert really didn't seem to be
talking about that at all (though it is clearly related).

> > Either way we're changing our minds about the plan based
> > on information that is fundamentally execution time information, not
> > plan time information. Have I missed something?
>
> I don't really see why you think the number of rows that a given join
> might produce is execution information.

If we're 100% sure a join will produce at least n rows because we
executed it (with the express intention of actually doing real query
processing that returns rows to the client), and it already produced n
rows, then what else could it be called? Why isn't it that simple?

> It's exactly the sort of
> information the planner needs to make a good plan. It's just that
> today we get that information from statistics. Plenty of other DBMSs
> make decisions from sampling.

> FWIW, we do already have a minimalist
> sampling already in get_actual_variable_range().

I know, but that doesn't seem all that related -- it almost seems like
the opposite idea. It isn't the executor balking when it notices that
the plan is visibly wrong during execution, in some important way.
It's more like the planner using the executor to get information about
an index that is well within the scope of what we think of as plan
time.

To some degree the distinction gets really blurred due to nodes like
hash join, where some important individual decisions are delayed until
execution time already. It's really unclear when precisely it stops
being that, and starts being more of a case of either partially or
wholly replanning. I don't know how to talk about it without it being
confusing.

> I'm just trying to highlight that I don't think overloading nodes is a
> good path to go down.  It's not a sustainable practice. It's a path
> towards just having a single node that does everything. If your
> suggestion was not serious then there's no point in discussing it
> further.

As I said, it was a way of framing one particular issue that Robert is
concerned about.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

16 June 2021, 03:08:00

On Tue, Jun 15, 2021 at 5:00 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row.   The other
> case is similar, but with join quals.
>
> It seems to me that each time we multiply 2 selectivities together
> that the risk of the resulting selectivity being out increases.  The
> risk is likely lower when we have some extended statistics which
> allows us to do fewer selectivity multiplications.

It seems important to distinguish between risk and uncertainty --
they're rather different things. The short version is that you can
model risk but you cannot model uncertainty. This seems like a problem
of uncertainty to me.

The example from the paper that Robert cited isn't interesting to me
because it hints at a way of managing the uncertainty, exactly. It's
interesting because it seems to emphasize the user's exposure to the
problem, which is what really matters. Even if it was extremely
unlikely that the user would have a problem, the downside of being
wrong is still absurdly high, and the upside of being right is low
(possibly even indistinguishable from zero). It's just not worth
thinking about. Besides, we all know that selectivity estimates are
very often quite wrong without the user ever noticing. It's amazing
that database optimizers work as well as they do, really.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Ashutosh Bapat

Date:

18 June 2021, 10:20:22

>
> The problem I have with this idea is that I really don't know how to
> properly calculate what the risk_factor should be set to.  It seems
> easy at first to set it to something that has the planner avoid these
> silly 1-row estimate nested loop mistakes, but I think what we'd set
> the risk_factor to would become much more important when more and more
> Path types start using it. So if we did this and just guessed the
> risk_factor, that might be fine when only 1 of the paths being
> compared had a non-zero risk_factor, but as soon as both paths have
> one set, unless they're set to something sensible, then we just end up
> comparing garbage costs to garbage costs.

Risk factor is the inverse of confidence on estimate, lesser
confidence, higher risk. If we associate confidence with the
selectivity estimate, or computer confidence interval of the estimate
instead of a single number, we can associate risk factor with each
estimate. When we combine estimates to calculate new estimates, we
also combine their confidences/confidence intervals. If my memory
serves well, confidence intervals/confidences are calculated based on
the sample size and method used for estimation, so we should be able
to compute those during ANALYZE.

I have not come across many papers which leverage this idea. Googling
"selectivity estimation confidence interval", does not yield many
papers. Although I found [1] to be using a similar idea. So may be
there's not merit in this idea, thought theoretically it sounds fine
to me.


[1] https://pi3.informatik.uni-mannheim.de/~moer/Publications/vldb18_smpl_synop.pdf
--
Best Wishes,
Ashutosh Bapat

Re: disfavoring unparameterized nested loops

From

John Naylor

Date:

18 June 2021, 16:32:29

On Fri, Jun 18, 2021 at 6:20 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

> I have not come across many papers which leverage this idea. Googling
> "selectivity estimation confidence interval", does not yield many
> papers. Although I found [1] to be using a similar idea. So may be
> there's not merit in this idea, thought theoretically it sounds fine
> to me.
>
>
> [1] https://pi3.informatik.uni-mannheim.de/~moer/Publications/vldb18_smpl_synop.pdf

Well, that paper's title shows it's a bit too far forward for us, since we don't use samples during plan time (although that's a separate topic worth considering). From the references, however, this one gives some mathematical framing of the problem that lead to the thread subject, although I haven't read enough to see if we can get practical advice from it:

Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results.

https://www.csd.uoc.gr/~hy460/pdf/p268-ioannidis.pdf

--
John Naylor
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

David Rowley

Date:

21 June 2021, 10:41:40

On Wed, 16 Jun 2021 at 15:08, Peter Geoghegan <pg@bowt.ie> wrote:
> It seems important to distinguish between risk and uncertainty --
> they're rather different things. The short version is that you can
> model risk but you cannot model uncertainty. This seems like a problem
> of uncertainty to me.

You might be right there.  "Uncertainty" or "Certainty" seems more
like a value that clauselist_selectivity() would be able to calculate
itself. It would just be up to the planner to determine what to do
with it.

One thing I thought about is that if the costing modal was able to
separate out a cost of additional (unexpected) rows then it would be
easier for add_path() to take into account how bad things might go if
we underestimate.

For example, in an unparameterized Nested Loop that estimates the
outer Path to have 1 row will cost an entire additional inner cost if
there are 2 rows.  With Hash Join the cost is just an additional
hashtable lookup, which is dead cheap.   I don't know exactly how
add_path() would weigh all that up, but it seems to me that I wouldn't
take the risk unless I was 100% certain that the Nested Loop's outer
Path would only return 1 row exactly, if there was any chance at all
it could return more, I'd be picking some other join method.

David

Re: disfavoring unparameterized nested loops

From

John Naylor

Date:

21 June 2021, 11:27:10

On Tue, Jun 15, 2021 at 8:00 PM David Rowley <dgrowleyml@gmail.com> wrote:

> In my experience, the most common reason that the planner chooses
> non-parameterized nested loops wrongly is when there's row
> underestimation that says there's just going to be 1 row returned by
> some set of joins. The problem often comes when some subsequent join
> is planned and the planner sees the given join rel only produces one
> row. The cheapest join method we have to join 1 row is Nested Loop.
> So the planner just sticks the 1-row join rel on the outer side
> thinking the executor will only need to scan the inner side of the
> join once. When the outer row count blows up, then we end up scanning
> that inner side many more times. The problem is compounded when you
> nest it a few joins deep
>
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row. The other
> case is similar, but with join quals.

If an estimate is lower than 1, that should be a red flag that Something Is Wrong. This is kind of a crazy idea, but what if we threw it back the other way by computing 1 / est , and clamping that result to 2 <= res < 10 (or 100 or something)? The theory is, the more impossibly low it is, the more wrong it is. I'm attracted to the idea of dealing with it as an estimation problem and not needing to know about join types. Might have unintended consequences, though.

Long term, it would be great to calculate something about the distribution of cardinality estimates, so we can model risk in the estimates.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Kenneth Marshall

Date:

21 June 2021, 14:15:41

> >
> > Most of the time when I see that happen it's down to either the
> > selectivity of some correlated base-quals being multiplied down to a
> > number low enough that we clamp the estimate to be 1 row.   The other
> > case is similar, but with join quals.
> 
> If an estimate is lower than 1, that should be a red flag that Something Is
> Wrong. This is kind of a crazy idea, but what if we threw it back the other
> way by computing 1 / est , and clamping that result to 2 <= res < 10 (or
> 100 or something)? The theory is, the more impossibly low it is, the more
> wrong it is. I'm attracted to the idea of dealing with it as an estimation
> problem and not needing to know about join types. Might have unintended
> consequences, though.
> 
> Long term, it would be great to calculate something about the distribution
> of cardinality estimates, so we can model risk in the estimates.
> 

Hi,

Laurenz suggested clamping to 2 in this thread in 2017:

https://www.postgresql.org/message-id/1509611428.3268.5.camel%40cybertec.at

Having been the victim of this problem in the past, I like the risk
based approach to this. If the cost of being wrong in the estimate is
high, use a merge join instead. In every case that I have encountered,
that heuristic would have given the correct performant plan.

Regards,
Ken

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

21 June 2021, 14:45:16

On Mon, Jun 21, 2021 at 6:41 AM David Rowley <dgrowleyml@gmail.com> wrote:
> For example, in an unparameterized Nested Loop that estimates the
> outer Path to have 1 row will cost an entire additional inner cost if
> there are 2 rows.  With Hash Join the cost is just an additional
> hashtable lookup, which is dead cheap.   I don't know exactly how
> add_path() would weigh all that up, but it seems to me that I wouldn't
> take the risk unless I was 100% certain that the Nested Loop's outer
> Path would only return 1 row exactly, if there was any chance at all
> it could return more, I'd be picking some other join method.

It seems like everyone agrees that it would be good to do something
about this problem, but the question is whether it's best to do
something that tries to be general, or whether we should instead do
something about this specific case. I favor the latter approach. Risk
and uncertainty exist all over the place, but dealing with that in a
general way seems difficult, and maybe unnecessary. Addressing the
case of unparameterized nest loops specifically seems simpler, because
it's easier to reason about what the alternatives are. Your last
sentence here seems right on point to me.

Basically, what you argue for there is disabling unparameterized
nested loops entirely except when we can prove that the inner side
will never generate more than one row. But, that's almost never going
to be something that we can prove. If the inner side is coming from a
table or sub-join, it can turn out to be big. As far as I can see, the
only way that this doesn't happen is if it's something like a subquery
that aggregates everything down to one row, or has LIMIT 1, but those
are such special cases that I don't even think we should be worrying
about them.

So my counter-proposal is: how about if we split
merge_unsorted_outer() into two functions, one of which generates
nested loops only based on parameterized paths and the other of which
generates nested loops based only on unparameterized paths, and then
rejigger add_paths_to_joinrel() so that we do the latter between the
steps that are currently number 5 and 6 and only if we haven't got any
other paths yet? If somebody later comes up with logic for proving
that the inner side can never have more than 1 row, we can let this be
run in those cases as well. In the meantime, if somebody does
something like a JOIN b ON a.x < b.x, we will still generate these
paths because there's no other approach, or similarly if it's a.x =
b.x but for some strange type that doesn't have a hash-joinable or
merge-joinable equality operator. But otherwise we omit those paths
entirely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

21 June 2021, 15:39:24

On Mon, Jun 21, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 21, 2021 at 6:41 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > For example, in an unparameterized Nested Loop that estimates the
> > outer Path to have 1 row will cost an entire additional inner cost if
> > there are 2 rows.  With Hash Join the cost is just an additional
> > hashtable lookup, which is dead cheap.   I don't know exactly how
> > add_path() would weigh all that up, but it seems to me that I wouldn't
> > take the risk unless I was 100% certain that the Nested Loop's outer
> > Path would only return 1 row exactly, if there was any chance at all
> > it could return more, I'd be picking some other join method.
>
> It seems like everyone agrees that it would be good to do something
> about this problem, but the question is whether it's best to do
> something that tries to be general, or whether we should instead do
> something about this specific case. I favor the latter approach.

I agree with your conclusion, but FWIW I am sympathetic to David's
view too. I certainly understand why he'd naturally want to define the
class of problems that are like this one, to understand what the
boundaries are.

The heuristic that has the optimizer flat out avoids an
unparameterized nested loop join is justified by the belief that
that's fundamentally reckless. Even though we all agree on that much,
I don't know when it stops being reckless and starts being "too risky
for me, but not fundamentally reckless". I think that that's worth
living with, but it isn't very satisfying.

> Risk
> and uncertainty exist all over the place, but dealing with that in a
> general way seems difficult, and maybe unnecessary. Addressing the
> case of unparameterized nest loops specifically seems simpler, because
> it's easier to reason about what the alternatives are. Your last
> sentence here seems right on point to me.

Right. Part of why this is a good idea is that the user is exposed to
so many individual risks and uncertainties. We cannot see any one risk
as existing in a vacuum. It is not the only risk the user will ever
take in the planner -- if it was then it might actually be okay to
allow unparameterized nested loop joins.

A bad unparameterized nested loop join plan has, in a sense, unknown
and unbounded cost/downside. But it is only very slightly faster than
a hash join, by a fixed well understood amount. With enough "trials"
and on a long enough timeline, it will inevitably blow up and cause
the application to grind to a halt. It seems like no amount of fixed,
bounded benefit from "fast unparameterized nested loop joins" could
possibly make up for that. The life of Postgres users would be a lot
better if bad plans were at least "survivable events".

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 15:55:48

Peter Geoghegan <pg@bowt.ie> writes:
> The heuristic that has the optimizer flat out avoids an
> unparameterized nested loop join is justified by the belief that
> that's fundamentally reckless. Even though we all agree on that much,
> I don't know when it stops being reckless and starts being "too risky
> for me, but not fundamentally reckless". I think that that's worth
> living with, but it isn't very satisfying.

There are certainly cases where the optimizer can prove (in principle;
it doesn't do so today) that a plan node will produce at most one row.
They're hardly uncommon either: an equality comparison on a unique
key, or a subquery with a simple aggregate function, come to mind.

In such cases, not only is this choice not reckless, but it's provably
superior to a hash join.  So in the end this gets back to the planning
risk factor that we keep circling around but nobody quite wants to
tackle.

I'd be a lot happier if this proposal were couched around some sort
of estimate of the risk of the outer side producing more than the
expected number of rows.  The arguments so far seem like fairly lame
rationalizations for not putting forth the effort to do that.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

21 June 2021, 16:31:14

On Mon, Jun 21, 2021 at 8:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

That sounds like it might be useful in general.

> In such cases, not only is this choice not reckless, but it's provably
> superior to a hash join.  So in the end this gets back to the planning
> risk factor that we keep circling around but nobody quite wants to
> tackle.

Let's assume for the sake of argument that we really have to have that
additional infrastructure to move forward with the idea. (I'm not sure
if it's possible in principle to use infrastructure like that for some
of the cases that Robert has in mind, but for now I'll assume that it
is both possible and a practical necessity.)

Even when I make this working assumption I don't see what it changes
at a fundamental level. You've merely come up with a slightly more
specific definition of the class of plans that are "reckless". You've
only refined the original provisional definition of "reckless" to
exclude specific "clearly not reckless" cases (I think). But the
definition of "reckless" is no less squishy than what we started out
with.

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I'm not so sure that it is. The point isn't the risk, even if it could
be calculated. The point is the downsides of being wrong are huge and
pretty much unbounded, whereas the benefits of being right are tiny
and bounded. It almost doesn't matter what the underlying
probabilities are.

To be clear I'm not arguing against modelling risk. I'm just not sure
that it's useful to think of this problem as truly a problem of risk.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Tomas Vondra

Date:

21 June 2021, 16:42:20

On 6/21/21 5:55 PM, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> The heuristic that has the optimizer flat out avoids an
>> unparameterized nested loop join is justified by the belief that
>> that's fundamentally reckless. Even though we all agree on that much,
>> I don't know when it stops being reckless and starts being "too risky
>> for me, but not fundamentally reckless". I think that that's worth
>> living with, but it isn't very satisfying.
> 
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.
>   
> In such cases, not only is this choice not reckless, but it's provably
> superior to a hash join.  So in the end this gets back to the planning
> risk factor that we keep circling around but nobody quite wants to
> tackle.
> 

Agreed.

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.
> 
I agree having such measure would be helpful, but do you have an idea 
how it could be implemented?

I've been thinking about this a bit recently and searching for papers 
talking about this, and but it's not clear to me how to calculate the 
risk (and propagate it through the plan) without making the whole cost 
evaluation way more complicated / expensive :-(

The "traditional approach" to quantifying risk would be confidence 
intervals, i.e. for each cardinality estimate "e" we'd determine a range 
[a,b] so that P(a <= e <= b) = p. So we could pick "p" depending on how 
"robust" the plan choice should be (say 0.9 for "risky" and 0.999 for 
"safe" plans) and calculate a,b. Or maybe we could calculate where the 
plan changes, and then we'd see if those "break points" are within the 
confidence interval. If not, great - we can assume the plan is stable, 
otherwise we'd need to consider the other plans too, somehow.

But what I'm not sure about is:

1) Now we're dealing with three cardinality estimates (the original "e" 
and the boundaries "a, "b"). So which one do we use to calculate cost 
and pass to upper parts of the plan?

2) The outer relation may be a complex join, so we'd need to combine the 
confidence intervals for the two input relations, somehow.

3) We'd need to know how to calculate the confidence intervals for most 
plan nodes, which I'm not sure we know how to do. So it's not clear to 
me how to do this, which seems rather problematic because we need to 
propagate and combine those confidence intervals through the plan.

But maybe you have thought about some much simpler approach, addressing 
this sufficiently well?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 16:52:39

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Jun 21, 2021 at 8:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd be a lot happier if this proposal were couched around some sort
>> of estimate of the risk of the outer side producing more than the
>> expected number of rows.  The arguments so far seem like fairly lame
>> rationalizations for not putting forth the effort to do that.

> I'm not so sure that it is. The point isn't the risk, even if it could
> be calculated. The point is the downsides of being wrong are huge and
> pretty much unbounded, whereas the benefits of being right are tiny
> and bounded. It almost doesn't matter what the underlying
> probabilities are.

You're throwing around a lot of pejorative adjectives there without
having bothered to quantify any of them.  This sounds less like a sound
argument to me than like a witch trial.

Reflecting for a bit on the ancient principle that "the only good numbers
in computer science are 0, 1, and N", it seems to me that we could make
an effort to classify RelOptInfos as provably empty, provably at most one
row, and others.  (This would be a property of relations, not individual
paths, so it needn't bloat struct Path.)  We already understand about
provably-empty rels, so this is just generalizing that idea a little.
Once we know about that, at least for the really-common cases like unique
keys, I'd be okay with a hard rule that we don't consider unparameterized
nestloop joins with an outer side that falls in the "N" category.
Unless there's no alternative, of course.

Another thought that occurs to me here is that maybe we could get rid of
the enable_material knob in favor of forcing (or at least encouraging)
materialization when the outer side isn't provably one row.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 17:01:18

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> I've been thinking about this a bit recently and searching for papers 
> talking about this, and but it's not clear to me how to calculate the 
> risk (and propagate it through the plan) without making the whole cost 
> evaluation way more complicated / expensive :-(

Yeah, a truly complete approach using confidence intervals or the
like seems frighteningly complicated.

> But maybe you have thought about some much simpler approach, addressing 
> this sufficiently well?

See my nearby response to Peter.  The main case that's distressing me
is the possibility of forcing a hash join even when the outer side
is obviously only one row.  If we could avoid that, at least for
large values of "obvious", I'd be far more comfortable.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

21 June 2021, 17:11:21

On Mon, Jun 21, 2021 at 9:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm not so sure that it is. The point isn't the risk, even if it could
> > be calculated. The point is the downsides of being wrong are huge and
> > pretty much unbounded, whereas the benefits of being right are tiny
> > and bounded. It almost doesn't matter what the underlying
> > probabilities are.
>
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.

I'm not sure what you mean by pejorative. Isn't what I said about
downsides and upsides pretty accurate?

> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.

It sounds like you're concerned about properly distinguishing between:

1. Cases where the only non-reckless choice is a hash join over a
unparameterized nested loop join

2. Cases that look like that at first, that don't really have that
quality on closer examination.

This seems like a reasonable concern.

> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.

I thought that you were arguing against the premise itself. It's now
clear that you weren't, though.

I don't have an opinion for or against bringing the provably-empty
rels stuff into the picture. At least not right now.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

21 June 2021, 17:14:03

On Mon, Jun 21, 2021 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are certainly cases where the optimizer can prove (in principle;
> it doesn't do so today) that a plan node will produce at most one row.
> They're hardly uncommon either: an equality comparison on a unique
> key, or a subquery with a simple aggregate function, come to mind.

Hmm, maybe I need to see an example of the sort of plan shape that you
have in mind. To me it feels like a comparison on a unique key ought
to use a *parameterized* nested loop. And it's also not clear to me
why a nested loop is actually better in a case like this. If the
nested loop iterates more than once because there are more rows on the
outer side, then you don't want to have something on the inner side
that might be expensive, and either an aggregate or an unparameterized
search for a unique value are potentially quite expensive. Now if you
put a materialize node on top of the inner side, then you don't have
to worry about that problem, but how much are you saving at that point
vs. just doing a hash join?

> I'd be a lot happier if this proposal were couched around some sort
> of estimate of the risk of the outer side producing more than the
> expected number of rows.  The arguments so far seem like fairly lame
> rationalizations for not putting forth the effort to do that.

I don't understand how to generate a risk assessment or what we ought
to do with it if we had one. I don't even understand what units we
would use. We measure costs using abstract cost units, but those
abstract cost units are intended to be a proxy for runtime. If it's
not the case that a plan that runs for longer has a higher cost, then
something's wrong with the costing model or the settings. In the case
of risk, the whole thing seems totally subjective. We're talking about
the risk that our estimate is bogus, but how do we estimate the risk
that we don't know how to estimate? Given quals (x + 0) = x, x = some
MCV, and x = some non-MCV, we can say that we're most likely to be
wrong about the first one and least likely to be wrong about the
second one, but how much more likely? I don't know how you can decide
that, even in principle. We can also say that an unparameterized
nested loop is more risky than some other plan because it could turn
out to be crazy expensive, but is that more or less risky than
scanning the index on b as a way to implement SELECT * FROM foo WHERE
a = 1 ORDER BY b LIMIT 1? How much more risky, and why?

And then, even supposing we had a risk metric for every path, what
exactly would we do with it? Suppose path A is cheaper than path B,
but also more risky. Which should we keep? We could keep both, but
that seems to be just kicking the can down the road. If plan B is
likely to blow up in our face, we should probably just get rid of it,
or not even generate it in the first place. Even if we do keep both,
at some point we're going to have to make a cost-vs-risk tradeoff, and
I don't see how to do that intelligently, because the point is
precisely that if the risk is high, the cost number might be totally
wrong. If we know that plan A is cheaper than plan B, we should choose
plan A. But if all we know is that plan A would be cheaper than plan B
if our estimate of the cost were correct, but also that it probably
isn't, then what we actually know is nothing. We have no principled
basis for deciding anything based on cost unless we're reasonably
confident that the cost estimate is pretty good. So AFAICT the only
principled strategy here is to throw away high risk paths as early as
we possibly can. What am I missing?

The other thing is - the risk of a particular path doesn't matter in
an absolute sense, only a relative one. In the particular case I'm on
about here, we *know* there's a less-risky alternative. We don't need
to quantify the risk to know which of the two options has more. In
many other cases, the risk is irreducible e.g. a default estimate
could be totally bogus, but switching paths is of no help in getting
rid of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

21 June 2021, 17:35:42

On Mon, Jun 21, 2021 at 1:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jun 21, 2021 at 9:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > I'm not so sure that it is. The point isn't the risk, even if it could
> > > be calculated. The point is the downsides of being wrong are huge and
> > > pretty much unbounded, whereas the benefits of being right are tiny
> > > and bounded. It almost doesn't matter what the underlying
> > > probabilities are.
> >
> > You're throwing around a lot of pejorative adjectives there without
> > having bothered to quantify any of them.  This sounds less like a sound
> > argument to me than like a witch trial.
>
> I'm not sure what you mean by pejorative. Isn't what I said about
> downsides and upsides pretty accurate?

Yeah, I don't see why Peter's characterization deserves to be labelled
as pejorative here. A hash join or merge join or parameterized nested
loop can turn out to be slower than some other algorithm, but all of
those approaches have some component that tends to make the asymptotic
cost less than the product of the sizes of the inputs. I don't think
that's true in absolutely every case; for example, if a merge join has
every row duplicated on both sides, it will have to scan every inner
tuple once per outer tuple, just like a nested loop, and the other
algorithms also are going to degrade toward O(NM) performance in the
face of many duplicates. Also, a hash join can be pretty close to that
if it needs a shazillion batches. But in normal cases, any algorithm
other than an unparameterized nested loop tends to read each input
tuple on each side ONCE, so the cost is more like the sum of the input
sizes than the product. And there's nothing pejorative in saying that
N + M can be less than N * M by an unbounded amount. That's just the
facts.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 17:38:28

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 21, 2021 at 11:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There are certainly cases where the optimizer can prove (in principle;
>> it doesn't do so today) that a plan node will produce at most one row.
>> They're hardly uncommon either: an equality comparison on a unique
>> key, or a subquery with a simple aggregate function, come to mind.

> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop.

The unique-key comparison would be involved in the outer scan in
the cases I'm thinking of.  As an example,

    select * from t1, t2 where t1.id = constant and t1.x op t2.y;

where I'm not assuming much about the properties of "op".
This could be amenable to a plan like

    NestLoop Join
      Join Filter: t1.x op t2.y
      -> Index Scan on t1_pkey
           Index Cond: t1.id = constant
      -> Seq Scan on t2

and if we can detect that the pkey indexscan produces just one row,
this is very possibly the best available plan.  Nor do I think this
is an unusual situation that we can just ignore.

BTW, it strikes me that there might be an additional consideration
here: did parameterization actually help anything?  That is, the
proposed rule wants to reject the above but allow

    NestLoop Join
      -> Index Scan on t1_pkey
           Index Cond: t1.id = constant
      -> Seq Scan on t2
           Filter: t1.x op t2.y

even though the latter isn't meaningfully better.  It's possible
this won't arise because we don't consider parameterized paths
except where the parameter is used in an indexqual or the like,
but I'm not confident of that.  See in particular reparameterize_path
and friends before you assert there's no such issue.  So we might
need to distinguish essential from incidental parameterization,
or something like that.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 18:26:41

I wrote:
> ... As an example,
>     select * from t1, t2 where t1.id = constant and t1.x op t2.y;
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>     NestLoop Join
>       Join Filter: t1.x op t2.y
>       -> Index Scan on t1_pkey
>            Index Cond: t1.id = constant
>       -> Seq Scan on t2

Also, to enlarge on that example: if "op" isn't hashable then the
original argument is moot.  However, it'd still be useful to know
if the outer scan is sure to return no more than one row, as that
could inform the choice whether to plaster a Materialize node on
the inner scan.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

21 June 2021, 18:49:09

On Mon, Jun 21, 2021 at 1:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Hmm, maybe I need to see an example of the sort of plan shape that you
> > have in mind. To me it feels like a comparison on a unique key ought
> > to use a *parameterized* nested loop.
>
> The unique-key comparison would be involved in the outer scan in
> the cases I'm thinking of.  As an example,
>
>         select * from t1, t2 where t1.id = constant and t1.x op t2.y;
>
> where I'm not assuming much about the properties of "op".
> This could be amenable to a plan like
>
>         NestLoop Join
>           Join Filter: t1.x op t2.y
>           -> Index Scan on t1_pkey
>                Index Cond: t1.id = constant
>           -> Seq Scan on t2
>
> and if we can detect that the pkey indexscan produces just one row,
> this is very possibly the best available plan.

Hmm, yeah, I guess that's possible. How much do you think this loses
as compared with:

Hash Join
Hash Cond: t1.x op t2.y
-> Seq Scan on t2
-> Hash
  -> Index Scan on t1_pkey

(If the operator is not hashable then this plan is impractical, but in
such a case the question of preferring the hash join over the nested
loop doesn't arise anyway.)

> BTW, it strikes me that there might be an additional consideration
> here: did parameterization actually help anything?  That is, the
> proposed rule wants to reject the above but allow
>
>         NestLoop Join
>           -> Index Scan on t1_pkey
>                Index Cond: t1.id = constant
>           -> Seq Scan on t2
>                Filter: t1.x op t2.y
>
> even though the latter isn't meaningfully better.  It's possible
> this won't arise because we don't consider parameterized paths
> except where the parameter is used in an indexqual or the like,
> but I'm not confident of that.  See in particular reparameterize_path
> and friends before you assert there's no such issue.  So we might
> need to distinguish essential from incidental parameterization,
> or something like that.

Hmm, perhaps. I think it won't happen in the normal cases, but I can't
completely rule out the possibility that there are corner cases where
it does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 19:03:12

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 21, 2021 at 1:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> NestLoop Join
>>   Join Filter: t1.x op t2.y
>>   -> Index Scan on t1_pkey
>>        Index Cond: t1.id = constant
>>   -> Seq Scan on t2

> Hmm, yeah, I guess that's possible. How much do you think this loses
> as compared with:

> Hash Join
> Hash Cond: t1.x op t2.y
> -> Seq Scan on t2
> -> Hash
>   -> Index Scan on t1_pkey

Hard to say.  The hash overhead might or might not pay for itself.
If the equality operator proper is expensive and we get to avoid
applying it at most t2 rows, then this might be an OK alternative;
otherwise not so much.

In any case, the former looks like plans that we generate now,
the second not.  Do you really want to field a lot of questions
about why we suddenly changed to a not-clearly-superior plan?

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

21 June 2021, 20:42:12

On Mon, Jun 21, 2021 at 10:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, maybe I need to see an example of the sort of plan shape that you
> have in mind. To me it feels like a comparison on a unique key ought
> to use a *parameterized* nested loop. And it's also not clear to me
> why a nested loop is actually better in a case like this. If the
> nested loop iterates more than once because there are more rows on the
> outer side, then you don't want to have something on the inner side
> that might be expensive, and either an aggregate or an unparameterized
> search for a unique value are potentially quite expensive. Now if you
> put a materialize node on top of the inner side, then you don't have
> to worry about that problem, but how much are you saving at that point
> vs. just doing a hash join?

I suspected that that was true, but even that doesn't seem like the
really important thing. While it may be true that the simple heuristic
can't be quite as simple as we'd hoped at first, ISTM that this is
ultimately not much of a problem. The basic fact remains: some more or
less simple heuristic makes perfect sense, and should be adapted.

This conclusion is counterintuitive because it's addressing a very
complicated problem with a very simple solution. However, if we lived
in a world where things that sound too good to be true always turned
out to be false, we'd also live in a world where optimizers were
completely impractical and useless. Optimizers have that quality
already, whether or not we officially acknowledge it.

> I don't understand how to generate a risk assessment or what we ought
> to do with it if we had one. I don't even understand what units we
> would use. We measure costs using abstract cost units, but those
> abstract cost units are intended to be a proxy for runtime. If it's
> not the case that a plan that runs for longer has a higher cost, then
> something's wrong with the costing model or the settings. In the case
> of risk, the whole thing seems totally subjective. We're talking about
> the risk that our estimate is bogus, but how do we estimate the risk
> that we don't know how to estimate?

Clearly we need a risk estimate for our risk estimate!

> The other thing is - the risk of a particular path doesn't matter in
> an absolute sense, only a relative one. In the particular case I'm on
> about here, we *know* there's a less-risky alternative.

Exactly! This, a thousand times.

This reminds me of how people behave in the real world. In the real
world people deal with this without too much difficulty. Everything is
situational and based on immediate trade-offs, with plenty of
uncertainty at every step. For example, if you think that there is
even a tiny chance of a piece of fruit being poisonous, you don't eat
the piece of fruit -- better to wait until lunchtime. This is one of
the *easiest* decisions I can think of, despite the uncertainty.
(Except perhaps if you happen to be in danger of dying of starvation,
in which case it might be a very different story. And so on.)

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

21 June 2021, 20:52:37

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Jun 21, 2021 at 10:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> The other thing is - the risk of a particular path doesn't matter in
>> an absolute sense, only a relative one. In the particular case I'm on
>> about here, we *know* there's a less-risky alternative.

> Exactly! This, a thousand times.

This is a striking oversimplification.

You're ignoring the fact that the plan shape we generate now is in fact
*optimal*, and easily proven to be so, in some very common cases.  I don't
think the people whose use-cases get worse are going to be mollified by
the argument that you reduced their risk, when there is provably no risk.
Obviously the people whose use-cases are currently hitting the wrong end
of the risk will be happy with any change whatever, but those may not be
the same people.

I'm willing to take some flak if there's not an easy proof that the outer
scan is single-row, but I don't think we should just up and change it
for cases where there is.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

21 June 2021, 21:26:33

On Mon, Jun 21, 2021 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You're ignoring the fact that the plan shape we generate now is in fact
> *optimal*, and easily proven to be so, in some very common cases.

As I've said I don't reject the idea that there is room for
disagreement on the specifics. For example perhaps it'll turn out that
only a restricted subset of the cases that Robert originally had in
mind will truly turn out to work as well as hoped. But that just seems
like a case of Robert refining a very preliminary proposal. I
absolutely expect there to be some need to iron out the wrinkles.

> I don't
> think the people whose use-cases get worse are going to be mollified by
> the argument that you reduced their risk, when there is provably no risk.

By definition what we're doing here is throwing away slightly cheaper
plans when the potential downside is much higher than the potential
upside of choosing a reasonable alternative. I don't think that the
downside is particularly likely. In fact I believe that it's fairly
unlikely in general. This is an imperfect trade-off, at least in
theory. I fully own that.

> I'm willing to take some flak if there's not an easy proof that the outer
> scan is single-row, but I don't think we should just up and change it
> for cases where there is.

Seems reasonable to me.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Bruce Momjian

Date:

21 June 2021, 23:51:19

On Mon, Jun 21, 2021 at 12:52:39PM -0400, Tom Lane wrote:
> You're throwing around a lot of pejorative adjectives there without
> having bothered to quantify any of them.  This sounds less like a sound
> argument to me than like a witch trial.
>
> Reflecting for a bit on the ancient principle that "the only good numbers
> in computer science are 0, 1, and N", it seems to me that we could make
> an effort to classify RelOptInfos as provably empty, provably at most one
> row, and others.  (This would be a property of relations, not individual
> paths, so it needn't bloat struct Path.)  We already understand about
> provably-empty rels, so this is just generalizing that idea a little.
> Once we know about that, at least for the really-common cases like unique
> keys, I'd be okay with a hard rule that we don't consider unparameterized
> nestloop joins with an outer side that falls in the "N" category.
> Unless there's no alternative, of course.
> 
> Another thought that occurs to me here is that maybe we could get rid of
> the enable_material knob in favor of forcing (or at least encouraging)
> materialization when the outer side isn't provably one row.

There were a lot of interesting ideas in this thread and I want to
analyze some of them.  First, there is the common assumption (not
stated) that over-estimating by 5% and underestimating by 5% cause the
same harm, which is clearly false.  If I go to a restaurant and estimate
the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
under or over estimating is probably fine.  If I am driving and
under-estimate the traction of my tires, I am probably fine, but if I
over-estimate their traction by 5%, I might crash.

Closer to home, Postgres is more tolerant of memory usage
under-estimation than over-estimation:

    https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018

What I hear Robert saying is that unparameterized nested loops are much
more sensitive to misestimation than hash joins, and only slightly
faster than hash joins when they use accurate row counts, so we might
want to avoid them by default.  Tom is saying that if we know the outer
side will have zero or one row, we should still do unparameterized
nested loops because those are not more sensitive to misestimation than
hash joins, and slightly faster.

If that is accurate, I think the big question is how common are cases
where the outer side can't be proven to have zero or one row and nested
loops are enough of a win to risk its greater sensitivity to
misestimation.  If it is uncommon, seems we could just code the
optimizer to use hash joins in those cases without a user-visible knob,
beyond the knob that already turns off nested loop joins.

Peter's comment about having nodes in the executor that adjust to the
row counts it finds is interesting, and eventually might be necessary
once we are sure we have gone as far as we can without it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

22 June 2021, 00:25:31

On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian <bruce@momjian.us> wrote:
> There were a lot of interesting ideas in this thread and I want to
> analyze some of them.  First, there is the common assumption (not
> stated) that over-estimating by 5% and underestimating by 5% cause the
> same harm, which is clearly false.  If I go to a restaurant and estimate
> the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
> under or over estimating is probably fine.  If I am driving and
> under-estimate the traction of my tires, I am probably fine, but if I
> over-estimate their traction by 5%, I might crash.

My favorite analogy is the home insurance one:

It might make sense to buy home insurance because losing one's home
(say through fire) is a loss that usually just cannot be tolerated --
you are literally ruined. We can debate how likely it is to happen,
but in the end it's not so unlikely that it can't be ruled out. At the
same time I may be completely unwilling to buy insurance for personal
electronic devices. I can afford to replace all of them if I truly
have to. And the chances of all of them breaking or being stolen on
the same day is remote (unless my home burns down!). If I drop my cell
phone and crack the screen, I'll be annoyed, but it's certainly not
the end of the world.

This behavior will make perfect sense to most people. But it doesn't
scale up or down. I have quite a few electronic devices, but only a
single home, so technically I'm taking risks way more often than I am
playing it safe here. Am I risk tolerant when it comes to insurance?
Conservative?

I myself don't think that it is sensible to apply either term here.
It's easier to just look at the specifics. A home is a pretty
important thing to almost everybody; we can afford to treat it as a
special case.

> If that is accurate, I think the big question is how common are cases
> where the outer side can't be proven to have zero or one row and nested
> loops are enough of a win to risk its greater sensitivity to
> misestimation.  If it is uncommon, seems we could just code the
> optimizer to use hash joins in those cases without a user-visible knob,
> beyond the knob that already turns off nested loop joins.

I think it's possible that Robert's proposal will lead to very
slightly slower plans in the vast majority of cases that are affected,
while still being a very good idea. Why should insurance be 100% free,
though? Maybe it can be in some cases where we get lucky, but why
should that be the starting point? It just has to be very cheap
relative to what we do today for us to come out ahead, certainly, but
that seems quite possible in at least this case.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Tomas Vondra

Date:

22 June 2021, 09:53:26


On 6/22/21 2:25 AM, Peter Geoghegan wrote:
> On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian <bruce@momjian.us> wrote:
>> There were a lot of interesting ideas in this thread and I want to
>> analyze some of them.  First, there is the common assumption (not
>> stated) that over-estimating by 5% and underestimating by 5% cause the
>> same harm, which is clearly false.  If I go to a restaurant and estimate
>> the bill to be 5% higher or %5 lower, assuming I have sufficient funds,
>> under or over estimating is probably fine.  If I am driving and
>> under-estimate the traction of my tires, I am probably fine, but if I
>> over-estimate their traction by 5%, I might crash.
> 
> My favorite analogy is the home insurance one:
> 
> It might make sense to buy home insurance because losing one's home
> (say through fire) is a loss that usually just cannot be tolerated --
> you are literally ruined. We can debate how likely it is to happen,
> but in the end it's not so unlikely that it can't be ruled out. At the
> same time I may be completely unwilling to buy insurance for personal
> electronic devices. I can afford to replace all of them if I truly
> have to. And the chances of all of them breaking or being stolen on
> the same day is remote (unless my home burns down!). If I drop my cell
> phone and crack the screen, I'll be annoyed, but it's certainly not
> the end of the world.
> 
> This behavior will make perfect sense to most people. But it doesn't
> scale up or down. I have quite a few electronic devices, but only a
> single home, so technically I'm taking risks way more often than I am
> playing it safe here. Am I risk tolerant when it comes to insurance?
> Conservative?
> 
> I myself don't think that it is sensible to apply either term here.
> It's easier to just look at the specifics. A home is a pretty
> important thing to almost everybody; we can afford to treat it as a
> special case.
> 
>> If that is accurate, I think the big question is how common are cases
>> where the outer side can't be proven to have zero or one row and nested
>> loops are enough of a win to risk its greater sensitivity to
>> misestimation.  If it is uncommon, seems we could just code the
>> optimizer to use hash joins in those cases without a user-visible knob,
>> beyond the knob that already turns off nested loop joins.
> 
> I think it's possible that Robert's proposal will lead to very
> slightly slower plans in the vast majority of cases that are affected,
> while still being a very good idea. Why should insurance be 100% free,
> though? Maybe it can be in some cases where we get lucky, but why
> should that be the starting point? It just has to be very cheap
> relative to what we do today for us to come out ahead, certainly, but
> that seems quite possible in at least this case.
> 

Yeah, I like the insurance analogy - it gets to the crux of the problem,
because insurance is pretty much exactly about managing risk. But making
everything slower will be a hard sell, because wast majority of
workloads already running on Postgres don't have this issue at all, so
for them it's not worth the expense. Following the insurance analogy,
selling tornado insurance in Europe is mostly pointless.

Insurance is also about personal preference / risk tolerance. Maybe I'm
fine with accepting risk that my house burns down, or whatever ...

And the lack of data also plays role - the insurance company will ask
for higher rates when it does not have enough accurate data about the
phenomenon, or when there's a lot of unknowns. Maybe this would allow
some basic measure of uncertainty, based on the number and type of
restrictions, joins, etc. The more restrictions we have, the less
certain the estimates are. Some conditions are estimated less
accurately, and using default estimates makes it much less accurate.

So maybe some fairly rough measure of uncertainty might work, and the
user might specify how much risk it's willing to tolerate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re:disfavoring unparameterized nested loops

From

"Finnerty, Jim"

Date:

22 June 2021, 11:10:28

> But making everything slower will be a hard sell, because vast majority of
> workloads already running on Postgres don't have this issue at all, so
> for them it's not worth the expense. Following the insurance analogy,
> selling tornado insurance in Europe is mostly pointless.
>

Agree. I've been surprised about NOT hearing complaints from PostgreSQL
customers about a particular "bad" plan choice that was common in other
rdbms products where large, complex queries were the norm.  The situation
occurs late in a plan with many joins where a hash join can be used and  
where either side is estimated to fit in memory.  On one side is a base table 
with cardinality that we have statistics for, while the other side has an
estimated cardinality that is the result of many estimates each of which
has error that can compound, and that in some cases amounts to a wild guess.
(e.g. what is the selectivity of SUM(x) < 12 ?).  If the planner's point estimate 
of cardinality is such that both sides could fit in memory, then a bad plan can
easily be made.  As Peter said, [ most ] humans have no trouble dealing with 
these kind of situations. They take the risk of being wrong into account.

So in our world, the useful numbers are 0, 1, measured N, and estimated N,
but we don't distinguish between measured N and estimated N.

But that doesn't mean that OLTP customers would be willing to accept
slightly suboptimal plans to mitigate a risk they don't experience.

> Insurance is also about personal preference / risk tolerance. Maybe I'm
> fine with accepting risk that my house burns down, or whatever ...

Right, and that's why the problem mentioned above is still out there
annoying customers who have complex plans.  To them it looks like
an obviously bad plan choice.

Something that might help is to have the planner cost be a structure instead
of a number.  Costs of plans that are deemed "risky" are accumulated 
separately from plans that make no risky choices, and for a given set
of join items you keep the minimum cost plan of both types. It may happen that all
plans eventually make a risky choice, in which case you take the plan with the minimum
cost, but if there exists a plan with no risky choices, then the minimum cost
plan with no risky choices is chosen, with a GUC that enables a customer to ignore
risk when making this choice.  This is not in the spirit of the hoped for simple heuristic,
and it would be heuristic in its classification of plans that are risky, but in the NLJ case 
the cost of an unparameterized NLJ could be deemed risky if the cardinality of the inner 
relation is not 0, 1, or measured N.

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

22 June 2021, 11:13:31

On Mon, Jun 21, 2021 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm willing to take some flak if there's not an easy proof that the outer
> scan is single-row, but I don't think we should just up and change it
> for cases where there is.

I think that's a reasonable request. I'm not sure that I believe it's
100% necessary, but it's certainly an improvement on a technical
level, and given that the proposed change could impact quite a lot of
plans, it's fair to want to see some effort being put into mitigating
the possible downsides. Now, I'm not sure when I might have time to
actually try to do the work, which kind of sucks, but that's how it
goes sometimes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

22 June 2021, 22:37:06

On Tue, Jun 22, 2021 at 2:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Yeah, I like the insurance analogy - it gets to the crux of the problem,
> because insurance is pretty much exactly about managing risk.

The user's exposure to harm is what truly matters. I admit that that's
very hard to quantify, but we should at least try to do so.

We sometimes think about a plan that is 10x slower as if it's
infinitely slow, or might as well be. But it's usually not like that
-- it is generally meaningfully much better than the plan being 100x
slower, which is itself sometimes appreciably better than 1000x
slower. And besides, users often don't get anything like the optimal
plan, even on what they would consider to be a good day (which is most
days). So maybe 10x slower is actually the baseline good case already,
without anybody knowing it. Most individual queries are not executed
very often, even on the busiest databases. The extremes really do
matter a lot.

If a web app or OLTP query is ~10x slower than optimal then it might
be the practical equivalent of an outage that affects the query alone
(i.e. "infinitely slow") -- but probably not. I think that it is worth
paying more than nothing to avoid plans that are so far from optimal
that they might as well take forever to execute. This is not
meaningfully different from a database outage affecting one particular
query. It kind of is in a category of its own that surpasses "slow
plan", albeit one that is very difficult or impossible to define
formally.

There may be a huge amount of variation in risk tolerance among
basically reasonable people. For example, if somebody chooses to
engage in some kind of extreme sport, to me it seems understandable.
It's just not my cup of tea. Whereas if somebody chooses to never wear
a seatbelt while driving, then to me they're simply behaving
foolishly. They're not willing to incur the tiniest inconvenience in
order to get a huge reduction in potential harm -- including a very
real risk of approximately the worst thing that can happen to you.
Sure, refusing to wear a seatbelt can theoretically be classified as
just another point on the risk tolerance spectrum, but that seems
utterly contrived to me. Some things (maybe not that many) really are
like that, or can at least be assumed to work that way as a practical
matter.

> But making
> everything slower will be a hard sell, because wast majority of
> workloads already running on Postgres don't have this issue at all, so
> for them it's not worth the expense.

I think that we're accepting too much risk here. But I bet it's also
true that we're not taking enough risk in other areas. That was really
my point with the insurance analogy -- we can afford to take lots of
individual risks as long as they don't increase our exposure to truly
disastrous outcomes -- by which I mean queries that might as well take
forever to execute as far as the user is concerned. (Easier said than
done, of course.)

A simple trade-off between fast and robust doesn't seem like a
universally helpful thing. Sometimes it's a very unhelpful way of
looking at the situation. If you make something more robust to extreme
bad outcomes, then you may have simultaneously made it *faster* (not
slower) for all practical purposes. This can happen when the increase
in robustness allows the user to tune the system aggressively, and
only take on new risks that they can truly live with (which wouldn't
have been possible without the increase in robustness). For example, I
imagine that the failsafe mechanism added to VACUUM will actually make
it possible to tune VACUUM much more aggressively -- it might actually
end up significantly improving performance for all practical purposes,
even though technically it has nothing to do with performance.

Having your indexes a little more bloated because the failsafe
kicked-in is a survivable event -- the DBA lives to fight another day,
and *learns* to tune vacuum/the app so it doesn't happen again and
again. An anti-wraparound failure is perhaps not a survivable event --
the DBA gets fired. This really does seem like a fundamental
difference to me.

> Following the insurance analogy,
> selling tornado insurance in Europe is mostly pointless.

Principled skepticism of this kind of thing is of course necessary and
welcome. It *could* be taken too far.

> And the lack of data also plays role - the insurance company will ask
> for higher rates when it does not have enough accurate data about the
> phenomenon, or when there's a lot of unknowns. Maybe this would allow
> some basic measure of uncertainty, based on the number and type of
> restrictions, joins, etc.

I don't think that you can really model uncertainty. But you can have
true certainty (or close to it) about a trade-off that makes the
system fundamentally more robust over time. You can largely be certain
about both the cost of the insurance, as well as how it ameliorates
the problem in at least some cases.

> So maybe some fairly rough measure of uncertainty might work, and the
> user might specify how much risk it's willing to tolerate.

I think that most or all of the interesting stuff is where you have
this extreme asymmetry -- places where it's much more likely to be
true that basically everybody wants that. Kind of like wearing
seatbelts -- things that we really can claim are a universal good
without too much controversy. There might be as few as one or two
things in the optimizer that this could be said of. But they matter.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Mike Klaas

Date:

02 August 2021, 23:14:00

I think that it is worth paying more than nothing to avoid plans that are so far from optimal that they might as well take forever to execute

I recently came across this article from 2016 that expounded upon a bad plan of the sort discussed in this thread: https://heap.io/blog/when-to-avoid-jsonb-in-a-postgresql-schema

(The proximate cause in this case was Postgresql not collecting statistics for fields in a JSONB column, estimating rowcount of 1, and thus creating a pathological slowdown.)

–Mike

On Tue, Jun 22, 2021 at 7:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 22, 2021 at 2:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Yeah, I like the insurance analogy - it gets to the crux of the problem, because insurance is pretty much exactly about managing risk.
The user's exposure to harm is what truly matters. I admit that that's very hard to quantify, but we should at least try to do so.
We sometimes think about a plan that is 10x slower as if it's infinitely slow, or might as well be. But it's usually not like that
-- it is generally meaningfully much better than the plan being 100x slower, which is itself sometimes appreciably better than 1000x slower. And besides, users often don't get anything like the optimal plan, even on what they would consider to be a good day (which is most days). So maybe 10x slower is actually the baseline good case already, without anybody knowing it. Most individual queries are not executed very often, even on the busiest databases. The extremes really do matter a lot.
If a web app or OLTP query is ~10x slower than optimal then it might be the practical equivalent of an outage that affects the query alone
(i.e. "infinitely slow") -- but probably not. I think that it is worth paying more than nothing to avoid plans that are so far from optimal that they might as well take forever to execute. This is not meaningfully different from a database outage affecting one particular query. It kind of is in a category of its own that surpasses "slow plan", albeit one that is very difficult or impossible to define formally.
There may be a huge amount of variation in risk tolerance among basically reasonable people. For example, if somebody chooses to engage in some kind of extreme sport, to me it seems understandable. It's just not my cup of tea. Whereas if somebody chooses to never wear a seatbelt while driving, then to me they're simply behaving foolishly. They're not willing to incur the tiniest inconvenience in order to get a huge reduction in potential harm -- including a very real risk of approximately the worst thing that can happen to you. Sure, refusing to wear a seatbelt can theoretically be classified as just another point on the risk tolerance spectrum, but that seems utterly contrived to me. Some things (maybe not that many) really are like that, or can at least be assumed to work that way as a practical matter.
But making
everything slower will be a hard sell, because wast majority of workloads already running on Postgres don't have this issue at all, so for them it's not worth the expense.
I think that we're accepting too much risk here. But I bet it's also true that we're not taking enough risk in other areas. That was really my point with the insurance analogy -- we can afford to take lots of individual risks as long as they don't increase our exposure to truly disastrous outcomes -- by which I mean queries that might as well take forever to execute as far as the user is concerned. (Easier said than done, of course.)
A simple trade-off between fast and robust doesn't seem like a universally helpful thing. Sometimes it's a very unhelpful way of looking at the situation. If you make something more robust to extreme bad outcomes, then you may have simultaneously made it *faster* (not slower) for all practical purposes. This can happen when the increase in robustness allows the user to tune the system aggressively, and only take on new risks that they can truly live with (which wouldn't have been possible without the increase in robustness). For example, I imagine that the failsafe mechanism added to VACUUM will actually make it possible to tune VACUUM much more aggressively -- it might actually end up significantly improving performance for all practical purposes, even though technically it has nothing to do with performance.
Having your indexes a little more bloated because the failsafe kicked-in is a survivable event -- the DBA lives to fight another day, and *learns* to tune vacuum/the app so it doesn't happen again and again. An anti-wraparound failure is perhaps not a survivable event -- the DBA gets fired. This really does seem like a fundamental difference to me.
Following the insurance analogy,
selling tornado insurance in Europe is mostly pointless.
Principled skepticism of this kind of thing is of course necessary and welcome. It *could* be taken too far.
And the lack of data also plays role - the insurance company will ask for higher rates when it does not have enough accurate data about the phenomenon, or when there's a lot of unknowns. Maybe this would allow some basic measure of uncertainty, based on the number and type of restrictions, joins, etc.
I don't think that you can really model uncertainty. But you can have true certainty (or close to it) about a trade-off that makes the system fundamentally more robust over time. You can largely be certain about both the cost of the insurance, as well as how it ameliorates the problem in at least some cases.
So maybe some fairly rough measure of uncertainty might work, and the user might specify how much risk it's willing to tolerate.
I think that most or all of the interesting stuff is where you have this extreme asymmetry -- places where it's much more likely to be true that basically everybody wants that. Kind of like wearing seatbelts -- things that we really can claim are a universal good without too much controversy. There might be as few as one or two things in the optimizer that this could be said of. But they matter.
--
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

13 January 2022, 00:19:45

On Tue, Jun 22, 2021 at 4:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think that's a reasonable request. I'm not sure that I believe it's
> 100% necessary, but it's certainly an improvement on a technical
> level, and given that the proposed change could impact quite a lot of
> plans, it's fair to want to see some effort being put into mitigating
> the possible downsides. Now, I'm not sure when I might have time to
> actually try to do the work, which kind of sucks, but that's how it
> goes sometimes.

Should I take it that you've dropped this project? I was rather hoping
that you'd get back to it, FWIW.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

13 January 2022, 17:51:53

On Wed, Jan 12, 2022 at 7:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Should I take it that you've dropped this project? I was rather hoping
> that you'd get back to it, FWIW.

Honestly, I'd forgotten all about it.  While in theory I'd like to do
something about this, but I just have too many other things going on.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

29 September 2022, 14:32:58

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks
atPostgres) "estimation errors quickly grow as the number of joins increases, and that these errors are usually the
reasonfor bad plans" - I think we can all get behind that statement.

While nested loop joins work great when cardinality estimates are correct, they are notoriously bad when the optimizer
underestimatesand they degrade very fast in such cases - the good vs. bad here is very asymmetric. On the other hand
hashjoins degrade much more gracefully - they are considered very robust against underestimation. The above mentioned
paperillustrates that all mayor DBMS (including Postgres) tend to underestimate and usually that underestimation
increasesdrastically with the number of joins (see Figures 3+4 of the paper).

Now, a simple approach to guarding against bad plans that arise from underestimation could be to use what I would call
anested-loop-conviction-multiplier based on the current depth of the join tree, e.g. for a base table that multiplier
wouldobviously be 1, but would then grow (e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew
thecardinality estimates themselves, but rather be applied to the overall nested loop join cost at each particular
stageof the plan when comparing it to other more robust join strategies like hash or sort-merge joins. That way when we
canbe sure to have a good estimate at the bottom of the join tree we treat all things equal, but favor nested loops
lessand less as we move up the join tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets
areoff at that point - we should definitely treat nested loop joins as out of favor in this instance and that could
easilybe incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Cheers, Ben

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

29 September 2022, 23:12:06

On Thu, Sep 29, 2022 at 7:32 AM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
> Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets
areoff at that point - we should definitely treat nested loop joins as out of favor in this instance and that could
easilybe incorporated by simply increasing the conviction-mutliplier. 
>
> What are your thoughts on this simple idea - is it perhaps too simple?

Offhand I'd say it's more likely to be too complicated. Without
meaning to sound glib, the first question that occurs to me is "will
we also need a conviction multiplier conviction multiplier?". Anything
like that is going to have unintended consequences that might very
well be much worse than the problem that you set out to solve.

Personally I still like the idea of just avoiding unparameterized
nested loop joins altogether when an "equivalent" hash join plan is
available. I think of it as preferring the hash join plan because it
will have virtually the same performance characteristics when you have
a good cardinality estimate (probably very often), but far better
performance characteristics when you don't. We can perhaps be
approximately 100% sure that something like that will be true in all
cases, no matter the details. That seems like a very different concept
to what you've proposed.

--
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Bruce Momjian

Date:

29 September 2022, 23:27:16

On Thu, Sep 29, 2022 at 04:12:06PM -0700, Peter Geoghegan wrote:
> Offhand I'd say it's more likely to be too complicated. Without
> meaning to sound glib, the first question that occurs to me is "will
> we also need a conviction multiplier conviction multiplier?". Anything
> like that is going to have unintended consequences that might very
> well be much worse than the problem that you set out to solve.
> 
> Personally I still like the idea of just avoiding unparameterized
> nested loop joins altogether when an "equivalent" hash join plan is
> available. I think of it as preferring the hash join plan because it
> will have virtually the same performance characteristics when you have
> a good cardinality estimate (probably very often), but far better
> performance characteristics when you don't. We can perhaps be
> approximately 100% sure that something like that will be true in all
> cases, no matter the details. That seems like a very different concept
> to what you've proposed.

I think the point the original poster as making, and I have made in the
past, is that even of two optimizer costs are the same, one might be
more penalized by misestimation than the other, and we don't have a good
way of figuring that into our plan choices.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

29 September 2022, 23:40:14

On Thu, Sep 29, 2022 at 4:27 PM Bruce Momjian <bruce@momjian.us> wrote:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Right. But that seems fraught with difficulty. I suspect that the
costs that the planner attributes to each plan often aren't very
reliable in any absolute sense, even when everything is working very
well by every available metric. Even a very noisy cost model with
somewhat inaccurate selectivity estimates will often pick the cheapest
plan, or close enough.

Having a cost-based optimizer that determines the cheapest plan quite
reliably is one thing. Taking the same underlying information and
adding the dimension of risk to it and expecting a useful result is
quite another -- that seems far harder.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Tom Lane

Date:

29 September 2022, 23:46:18

Bruce Momjian <bruce@momjian.us> writes:
> I think the point the original poster as making, and I have made in the
> past, is that even of two optimizer costs are the same, one might be
> more penalized by misestimation than the other, and we don't have a good
> way of figuring that into our plan choices.

Agreed, but dealing with uncertainty in those numbers is an enormous
task if you want to do it right.  "Doing it right", IMV, would start
out by extending all the selectivity estimation functions to include
error bars; then we could have error bars on rowcount estimates and
then costs; then we could start adding policies about avoiding plans
with too large a possible upper-bound cost.  Trying to add such
policy with no data to go on is not going to work well.

I think Peter's point is that a quick-n-dirty patch is likely to make
as many cases worse as it makes better.  That's certainly my opinion
about the topic.

            regards, tom lane

Re: disfavoring unparameterized nested loops

From

Bruce Momjian

Date:

29 September 2022, 23:51:47

On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I think the point the original poster as making, and I have made in the
> > past, is that even of two optimizer costs are the same, one might be
> > more penalized by misestimation than the other, and we don't have a good
> > way of figuring that into our plan choices.
> 
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.
> 
> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Agreed on all points --- I was thinking error bars too.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

30 September 2022, 00:06:32

On Thu, Sep 29, 2022 at 4:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

In general I suspect that we'd be better off focussing on mitigating
the impact at execution time. There are at least a few things that we
could do there, at least in theory. Mostly very ambitious, long term
things.

I like the idea of just avoiding unparameterized nested loop joins
altogether when an "equivalent" hash join plan is available because
it's akin to an execution-time mitigation, despite the fact that it
happens during planning. While it doesn't actually change anything in
the executor, it is built on the observation that we have virtually
everything to gain and nothing to lose during execution, no matter
what happens.

It seems like a very small oasis of certainty in a desert of
uncertainty -- which seems nice, as far as it goes.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

Right. Though I am actually sympathetic to the idea that users might
gladly pay a cost for performance stability -- even a fairly large
cost. That part doesn't seem like the problem.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Bruce Momjian

Date:

30 September 2022, 02:30:54

On Thu, Sep 29, 2022 at 07:51:47PM -0400, Bruce Momjian wrote:
> On Thu, Sep 29, 2022 at 07:46:18PM -0400, Tom Lane wrote:
> > Agreed, but dealing with uncertainty in those numbers is an enormous
> > task if you want to do it right.  "Doing it right", IMV, would start
> > out by extending all the selectivity estimation functions to include
> > error bars; then we could have error bars on rowcount estimates and
> > then costs; then we could start adding policies about avoiding plans
> > with too large a possible upper-bound cost.  Trying to add such
> > policy with no data to go on is not going to work well.
> > 
> > I think Peter's point is that a quick-n-dirty patch is likely to make
> > as many cases worse as it makes better.  That's certainly my opinion
> > about the topic.
> 
> Agreed on all points --- I was thinking error bars too.

Actually, if we wanted to improve things in this area, we should have a
set of queries that don't chose optimal plans we can test with.  We used
to see them a lot before we had extended statistics, but I don't
remember seeing many recently, let alone a collection of them.  I guess
that is good.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: disfavoring unparameterized nested loops

From

David Rowley

Date:

30 September 2022, 03:59:49

On Fri, 30 Sept 2022 at 13:06, Peter Geoghegan <pg@bowt.ie> wrote:
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I'm not sure if it's a good idea to assume that performing
non-parameterised Nested Loops when we shouldn't is the only shape of
plan that causes us problems.

We also have the case where we assume early start-up plans are
favourable.  For example:

SELECT * FROM t WHERE a = 1 ORDER BY b LIMIT 10;

where we have two indexes, one on t(a) and another on t(b).

Should we use the t(b) index and filter out the rows that don't match
a = 1 and hope we get 10 a=1 rows soon in the t(b) index? or do we use
t(a) and then perform a sort? Best case for using the t(b) index is
that we find 10 a=1 rows in the first 10 rows of the index scan, the
worst case is that there are no rows with a=1.

Having something coded into the cost model is a more generic way of
addressing this issue.  Providing we design the cost model correctly,
we'd be able to address future issues we discover using which ever
cost model infrastructure that we design for this.

I understand that what you propose would be a fast way to fix this
issue. However, if we went and changed the join path creation code to
not add non-parameterised nested loop paths when other paths exist,
then how could we ever dare to put that code back again when we come
up with a better solution?

David

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

30 September 2022, 06:05:31

> Right. But that seems fraught with difficulty. I suspect that the
> costs that the planner attributes to each plan often aren't very
> reliable in any absolute sense, even when everything is working very
> well by every available metric. Even a very noisy cost model with
> somewhat inaccurate selectivity estimates will often pick the cheapest
> plan, or close enough.

Sure, the absolute cost of a complex plan will always be inaccurate at best.
My point is that we can be very confident in the cardinalities of base tables. As the paper states in "3.1. Estimates
forBase Tables": 

"The median q-error is close to the optimal value of 1 for all systems,
indicating that the majority of all selections are estimated correctly."

Thanks to the statistics will practically never be off by an order of magnitude when estimating base table
cardinalities.

The paper also clearly shows (and that certainly coincides with my experience) that those cardinality underestimations
growexponentially as they propagate up the join tree. 

Given the research I'd stipulate that at any given level of the join tree, the current depth is a reasonable indicator
ofunderestimation. Taking that into account (even if only to mitigate nested loops on higher levels) is IMV a
principledapproach, and not necesseraly a hack. 

Obviously having something like error bars as proposed by Tom would be even better and perhaps more general, but that
ison a whole different level in terms of complexity and I certainly have no idea how we would easily get there.

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

30 September 2022, 06:44:46

> In general I suspect that we'd be better off focussing on mitigating
> the impact at execution time. There are at least a few things that we
> could do there, at least in theory. Mostly very ambitious, long term
> things.

I think these things are orthogonal. No matter how good the cost model ever gets, we will always have degenerate cases.
Having some smarts about that in the executor is surely a good thing, but it shouldn't distract us from improving on
theplanner front. 

>
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I agree with you, that those plans are too risky. But let's maybe find a more general way of dealing with this.

> Right. Though I am actually sympathetic to the idea that users might
> gladly pay a cost for performance stability -- even a fairly large
> cost. That part doesn't seem like the problem.

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

30 September 2022, 07:09:30

> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

Error bars would be fantastic, no question. But that would make things very complex.
A lot of judgment calls would be necessary for the policy behind upper-bound pruning, picking up on Peter's comment
about"conviction multiplier of conviction multiplier" ;) 
Also, the math in deriving those bounds based on the stats and how they propagate up the join tree doesn't seem trivial
either.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

As in my reply to Peter, I think the join level/depth metric is a simple but principled way of dealing with it, given
thereferenced research. 
In the first step, we'd use this merely to be more risk-averse towards nested loop joins as we climb up the join tree -
weare not fiddling with the cost model itself, nor the join ordering, just when it comes to considering that particular
joinalgorithm. Later this could be expanded to be more broadly scoped. 

Please not give up on a simple way to reap most of the fruits just yet.

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

30 September 2022, 08:00:31

> Actually, if we wanted to improve things in this area, we should have a
> set of queries that don't chose optimal plans we can test with.  We used
> to see them a lot before we had extended statistics, but I don't
> remember seeing many recently, let alone a collection of them.  I guess
> that is good.

In the VLDB paper they actually created their own "Join Order Benchmark", which is publicly available under
https://github.com/gregrahn/join-order-benchmark
It would probably be more suited for this kind of testing than, e.g. the TPC benchmarks.

If there is interest, I could also compile a set of relevant cases based on the message history of the performance
mailinglist.

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

30 September 2022, 17:43:10

On Thu, Sep 29, 2022 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

I think that the point of the paper which started the thread that this
discussion branched from was essentially that trying to add such a
policy with no data to go on worked extremely well in practice, and
other database systems are already doing it, and we're hurting
ourselves by not doing it. And specifically what they did was to
disfavor unparameterized nested loops.

And I think that actually makes a lot of sense. If you think through
parameterized nested loops, unparameterized nested loops, hash joins,
and merge joins, which is basically all the strategies, and you
imagine having many more or fewer rows on one side of the join or the
other than you thought, unparameterized nested loops are the standout
case. It's the only kind of join where the expense grows as the
product of the sizes of the two inputs. The other cases tend to be
more like O(n+m) rather than O(nm), and even if there are some lg n or
lg m terms in there too they don't tend to make a whole lot of
difference in practice.

So I think what you would find if you did all of this analysis is
that, basically, every time you did the cost computation for a
parameterized nested loop, a hash join, or a merge join, the error
bars would be whatever they were, and then when you did a cost
computation for an unparameterized nested loop, the error bars would
be way worse. Like, if we assume that the estimates for each side of a
hash join are off by 100x, then the cost will be off by roughly 100x
if it still fits in work_mem and by several hundred x if it now spills
to disk. But if we assume the same thing for an unparameterized nested
loop, the cost is now off by 10000x. And that is massively, massively
more, so clearly we should avoid the unparameterized nested loop. But
it's unnecessary to do this computation at runtime for every separate
unparameterized nested loop: we can do it right here, in a generic
way, for every such loop.

Now, it's true that the actual risk depends on how certain we are of
the estimates for the input rels, but that's difficult to quantify and
I'm not convinced it really matters at all. Like, if the input is a
base relation with no filter condition, then the error should be
small, unless the table size changes a lot between planning and
execution. If it's the output of a user-defined aggregate, the error
could be really, really large. But that has no impact on the
*relative* dangers of the unparameterized nested loop vs. some other
join method. If join A is between two input rels whose sizes are
probably known fairly precisely, and join B is between two input rels
whose sizes we might be drastically wrong about, then a hash join is
riskier for join B than it is for join A. But that really does not
matter. What *does* matter is that an unparameterized nested loop is
riskier for join A than a hash join is for join A; and likewise for
join B.

I think we're kind of just making life complicated for ourselves here
by pretending that unparameterized nested loops are part of some
general class of uncertainty problems that we need to worry about. In
some sense, they are, and David Rowley is right to mention the other
one that comes up pretty frequently. But like that list of two is
pretty much the whole list. I think we've talked ourselves into
believing that this problem is much harder than it really is. Maybe a
blanket ban on unparameterized nested loops is too strong (or maybe
it's exactly the right thing) but it can't possibly be wrong to think
about that case in particular as something we need to solve. It's the
only join method that can go quadratic in easy, realistic scenarios --
and it often does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

30 September 2022, 18:24:16

On Fri, Sep 30, 2022 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But it's unnecessary to do this computation at runtime for every separate
> unparameterized nested loop: we can do it right here, in a generic
> way, for every such loop.

It's not just that the risks are ludicrously high, of course. The
potential benefits must *also* be very low. It's both factors,
together.

> Now, it's true that the actual risk depends on how certain we are of
> the estimates for the input rels, but that's difficult to quantify and
> I'm not convinced it really matters at all.

We're talking about a problem that is fairly unlikely to occur in
general, I think -- let's not forget that. These are presumably rare
events that nevertheless cause many real practical problems.

If we're going to add error bars, why wouldn't we also need error bars
for our error bars?

> I think we're kind of just making life complicated for ourselves here
> by pretending that unparameterized nested loops are part of some
> general class of uncertainty problems that we need to worry about. In
> some sense, they are, and David Rowley is right to mention the other
> one that comes up pretty frequently. But like that list of two is
> pretty much the whole list. I think we've talked ourselves into
> believing that this problem is much harder than it really is.

+1

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

30 September 2022, 18:44:25

On Thu, Sep 29, 2022 at 9:00 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I understand that what you propose would be a fast way to fix this
> issue. However, if we went and changed the join path creation code to
> not add non-parameterised nested loop paths when other paths exist,
> then how could we ever dare to put that code back again when we come
> up with a better solution?

But why would it matter, even then?

I don't deny that something like that could make sense, but I don't
see why it should be in tension with this proposal. We're talking
about a plan shape that is (in practical terms) inherently
unreasonable, given the availability of an alternative plan shape. Why
wouldn't that continue to be true in every such case, forever?

To put it another way, the proposal seems like taking away something
that we don't want to have, ever. It seems like a subtractive thing to
me. The potential upside of allowing unparameterized nestloop joins
seems infinitesimal; zero for all practical purposes. So even with a
far more sophisticated framework for "plan riskiness" in place, it
would still make sense to treat unparameterized nestloop joins as
inherently undesirable. There is perhaps a theoretical sense in which
that isn't quite true, but it's true for all practical purposes, which
should be enough.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

30 September 2022, 19:04:38

On Thu, Sep 29, 2022 at 11:44 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
> I think these things are orthogonal.

I agree that they're orthogonal. I just meant that execution time
strategies seem underexplored in general.

> No matter how good the cost model ever gets, we will always have degenerate cases.

Sure, but the model isn't the problem here, really -- not to me. The
problem is that the planner can in some cases choose a plan that is
inherently unreasonable, at least in practical terms. You're talking
about uncertainties. But I'm actually talking about the opposite thing
-- certainty (albeit a limited kind of certainty that applies only to
one narrow set of conditions).

It's theoretically possible that bogosort will be faster than
quicksort in some individual cases. After all, bogosort is O(n) in the
best case, which is impossible to beat! But there is no practical
sense in which bogosort could ever be better than quicksort. Having
fewer choices by just not offering inherently bad choices seems quite
unrelated to what you're talking about.

For all I know you might be onto something. But it really seems
independent to me.

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Benjamin Coutu

Date:

30 September 2022, 22:19:16

> Sure, but the model isn't the problem here, really -- not to me. The
> problem is that the planner can in some cases choose a plan that is
> inherently unreasonable, at least in practical terms. You're talking
> about uncertainties. But I'm actually talking about the opposite thing
> -- certainty (albeit a limited kind of certainty that applies only to
> one narrow set of conditions).

I absolutely agree and support your proposal to simply not generate those paths at all unless necessary.

> For all I know you might be onto something. But it really seems
> independent to me.

Yeah, I‘m sorry if I highjacked this thread for something related but technically different. I just wanted to expand on
yourproposal by taking into account the join depth and also not just talking about unparametrized nested loop joins.
Theresearch is very clear that the uncertainty is proportional to the join level, and that is the point I am trying to
focusthe discussion on. 

I really encourage everyone to read the VLDB paper. BTW, the unnamed proprietary DBMSs in that paper are the 3 big ones
fromWashington, California and NY, in that order.

Re: disfavoring unparameterized nested loops

From

Robert Haas

Date:

02 October 2022, 10:43:45

On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It's not just that the risks are ludicrously high, of course. The
> potential benefits must *also* be very low. It's both factors,
> together.

Hmm, maybe. But it also wouldn't surprise me very much if someone can
come up with a test case where a nested loop with a single row (or
maybe no rows) on one side or the other and it's significantly faster
than any alternative plan. I believe, though, that even if such cases
exist, they are probably relatively few in number compared to the
cases where parameterized nested loops hurt, and the savings are
probably small compared to the multiple-orders-of-magnitude slowdowns
that you can get when a nested loop goes bad. But they might still be
relatively large -- 2x, 3x? -- in absolute terms.

In the prior discussion, the only person who showed a case in which he
thought that an unparameterized nested loop might be a clear winner
was Tom, but it was just sort of a general "this kind of case might be
a problem" thing rather than a fully worked out example with real
timings. Perhaps someone ought to try to characterize the kinds of
cases he mentioned, to help us get a clearer feeling about what, if
anything, we're gaining from the current scheme.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

02 October 2022, 18:39:49

On Sun, Oct 2, 2022 at 3:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 30, 2022 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It's not just that the risks are ludicrously high, of course. The
> > potential benefits must *also* be very low. It's both factors,
> > together.
>
> Hmm, maybe. But it also wouldn't surprise me very much if someone can
> come up with a test case where a nested loop with a single row (or
> maybe no rows) on one side or the other and it's significantly faster
> than any alternative plan.

That's certainly possible, but wouldn't the difference all come from
fixed startup costs? If we're talking about a single row, with a
minimal test case, then the potential downside of this more
conservative strategy might indeed amount to something like a 2x or 3x
slowdown, if we look at it in isolation. But why measure it that way?
I think that absolute differences like milliseconds of execution time
are much more relevant.

Real production databases have many queries with very diverse
characteristics -- there is a lot going on at any given moment. The
proportion of queries that will be affected either way by avoiding
unparamaterized nested loop joins is probably going to be very small.
Nobody is going to notice if only a small subset or all queries are
maybe 1 ms or 2 ms slower. As long as it's well within the margin of
noise in 100% of all cases, it really shouldn't matter.

AFAICT the choice is one of "bounded, low upside versus unbounded,
high downside".

> I believe, though, that even if such cases
> exist, they are probably relatively few in number compared to the
> cases where parameterized nested loops hurt, and the savings are
> probably small compared to the multiple-orders-of-magnitude slowdowns
> that you can get when a nested loop goes bad. But they might still be
> relatively large -- 2x, 3x? -- in absolute terms.

I suspect it won't even matter if disallowing unparamaterized nested
loop joins loses on average.

I am reminded of this:
https://en.wikipedia.org/wiki/St._Petersburg_paradox#Expected_utility_theory

-- 
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Peter Geoghegan

Date:

02 October 2022, 20:22:01

On Fri, Sep 30, 2022 at 3:19 PM Benjamin Coutu <ben.coutu@zeyos.com> wrote:
> > For all I know you might be onto something. But it really seems
> > independent to me.
>
> Yeah, I‘m sorry if I highjacked this thread for something related but technically different.

I certainly wouldn't say that you hijacked the thread. I'm glad that
you revived the discussion, in fact.

The way that I've framed the problem is probably at least somewhat
controversial. In fact I'm almost certain that at least one or two
people will flat out disagree with me. But even if everybody else
already thought about unparameterized nested loop joins in the same
terms, it might still be useful to make the arguments that you've
made.

What I'm saying is that the probability of "getting it right" is
virtually irrelevant in the case of these unparameterized nested loop
join plans specifically. Any probability that's less than 1.0 is
already unacceptable, more or less. A probability of 1.0 is never
unattainable in the real world, no matter what, so why should the true
probability (whatever that means) matter at all? The appropriate
course of action will still be "just don't do that, ever".

To me this dynamic seems qualitatively different to other cases, where
we might want to give some weight to uncertainty. Understanding where
the boundaries lie between those trickier cases and this simpler case
seems important and relevant to me.

--
Peter Geoghegan

Re: disfavoring unparameterized nested loops

From

Andrey Lepikhov

Date:

20 September 2023, 07:56:53

On 29/9/2022 21:32, Benjamin Coutu wrote:
> I'd like to revamp this important discussion.
> 
> As is well described in this fairly recent paper here https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks
atPostgres) "estimation errors quickly grow as the number of joins increases, and that these errors are usually the
reasonfor bad plans" - I think we can all get behind that statement.
 
> 
> While nested loop joins work great when cardinality estimates are correct, they are notoriously bad when the
optimizerunderestimates and they degrade very fast in such cases - the good vs. bad here is very asymmetric. On the
otherhand hash joins degrade much more gracefully - they are considered very robust against underestimation. The above
mentionedpaper illustrates that all mayor DBMS (including Postgres) tend to underestimate and usually that
underestimationincreases drastically with the number of joins (see Figures 3+4 of the paper).
 
> 
> Now, a simple approach to guarding against bad plans that arise from underestimation could be to use what I would
calla nested-loop-conviction-multiplier based on the current depth of the join tree, e.g. for a base table that
multiplierwould obviously be 1, but would then grow (e.g.) quadratically. That conviction-multiplier would *NOT* be
usedto skew the cardinality estimates themselves, but rather be applied to the overall nested loop join cost at each
particularstage of the plan when comparing it to other more robust join strategies like hash or sort-merge joins. That
waywhen we can be sure to have a good estimate at the bottom of the join tree we treat all things equal, but favor
nestedloops less and less as we move up the join tree for the sake of robustness.
 
> Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets
areoff at that point - we should definitely treat nested loop joins as out of favor in this instance and that could
easilybe incorporated by simply increasing the conviction-mutliplier.
 
> 
> What are your thoughts on this simple idea - is it perhaps too simple?
In my practice, parameterized nested loop reduces, sometimes 
drastically, execution time. If your query touches a lot of tables but 
extracts only a tiny part of the data, and you have good coverage by 
indexes, PNL works great.
Moreover, I have pondered extending parameterization through subqueries 
and groupings.

What could you say about a different way: hybrid join? In MS SQL Server, 
they have such a feature [1], and, according to their description, it 
requires low overhead. They start from HashJoin and switch to NestLoop 
if the inner input contains too small tuples. It solves the issue, Isn't it?

[1] 
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: disfavoring unparameterized nested loops

From

David Rowley

Date:

20 September 2023, 09:49:51

On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> What could you say about a different way: hybrid join? In MS SQL Server,
> they have such a feature [1], and, according to their description, it
> requires low overhead. They start from HashJoin and switch to NestLoop
> if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

I've no idea how the SQL Server engineers solved that.

David

> [1]
> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

Re: disfavoring unparameterized nested loops

From

"Lepikhov Andrei"

Date:

20 September 2023, 15:17:46

On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote:
> On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> What could you say about a different way: hybrid join? In MS SQL Server,
>> they have such a feature [1], and, according to their description, it
>> requires low overhead. They start from HashJoin and switch to NestLoop
>> if the inner input contains too small tuples. It solves the issue, Isn't it?
>
> A complexity which you may not be considering here is that Nested Loop
> joins always preserve the tuple order from the outer side of the join,
> whereas hash joins will not do this when multi-batching.

My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner some predefined number of tuples and, if
theplanner has made a mistake and overestimated it, move hash join inner to NestLoop as an outer.

The opposite strategy, "underestimation" - starting with NestLoop and switching to HashJoin looks more difficult, but
themain question is: is it worthwhile to research?

> I've no idea how the SQL Server engineers solved that.

>> [1]
>> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

-- 
Regards,
Andrei Lepikhov