Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers

From Robert Haas
Subject Re: disfavoring unparameterized nested loops
Date
Msg-id CA+Tgmob3Cpm5dSYwLfT7F7cHMtOfeZv-Kni_hxDxT7vyzUC=nw@mail.gmail.com
Whole thread Raw
In response to Re: disfavoring unparameterized nested loops  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: disfavoring unparameterized nested loops
List pgsql-hackers
On Thu, Sep 29, 2022 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

I think that the point of the paper which started the thread that this
discussion branched from was essentially that trying to add such a
policy with no data to go on worked extremely well in practice, and
other database systems are already doing it, and we're hurting
ourselves by not doing it. And specifically what they did was to
disfavor unparameterized nested loops.

And I think that actually makes a lot of sense. If you think through
parameterized nested loops, unparameterized nested loops, hash joins,
and merge joins, which is basically all the strategies, and you
imagine having many more or fewer rows on one side of the join or the
other than you thought, unparameterized nested loops are the standout
case. It's the only kind of join where the expense grows as the
product of the sizes of the two inputs. The other cases tend to be
more like O(n+m) rather than O(nm), and even if there are some lg n or
lg m terms in there too they don't tend to make a whole lot of
difference in practice.

So I think what you would find if you did all of this analysis is
that, basically, every time you did the cost computation for a
parameterized nested loop, a hash join, or a merge join, the error
bars would be whatever they were, and then when you did a cost
computation for an unparameterized nested loop, the error bars would
be way worse. Like, if we assume that the estimates for each side of a
hash join are off by 100x, then the cost will be off by roughly 100x
if it still fits in work_mem and by several hundred x if it now spills
to disk. But if we assume the same thing for an unparameterized nested
loop, the cost is now off by 10000x. And that is massively, massively
more, so clearly we should avoid the unparameterized nested loop. But
it's unnecessary to do this computation at runtime for every separate
unparameterized nested loop: we can do it right here, in a generic
way, for every such loop.

Now, it's true that the actual risk depends on how certain we are of
the estimates for the input rels, but that's difficult to quantify and
I'm not convinced it really matters at all. Like, if the input is a
base relation with no filter condition, then the error should be
small, unless the table size changes a lot between planning and
execution. If it's the output of a user-defined aggregate, the error
could be really, really large. But that has no impact on the
*relative* dangers of the unparameterized nested loop vs. some other
join method. If join A is between two input rels whose sizes are
probably known fairly precisely, and join B is between two input rels
whose sizes we might be drastically wrong about, then a hash join is
riskier for join B than it is for join A. But that really does not
matter. What *does* matter is that an unparameterized nested loop is
riskier for join A than a hash join is for join A; and likewise for
join B.

I think we're kind of just making life complicated for ourselves here
by pretending that unparameterized nested loops are part of some
general class of uncertainty problems that we need to worry about. In
some sense, they are, and David Rowley is right to mention the other
one that comes up pretty frequently. But like that list of two is
pretty much the whole list. I think we've talked ourselves into
believing that this problem is much harder than it really is. Maybe a
blanket ban on unparameterized nested loops is too strong (or maybe
it's exactly the right thing) but it can't possibly be wrong to think
about that case in particular as something we need to solve. It's the
only join method that can go quadratic in easy, realistic scenarios --
and it often does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PATCH v1] [meson] add a default option prefix=/usr/local/pgsql
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: [PATCH] Add peer authentication TAP test