Re: disfavoring unparameterized nested loops - Mailing list pgsql-hackers

From Robert Haas
Subject Re: disfavoring unparameterized nested loops
Date
Msg-id CA+TgmoapLgEee-LGzVJh250ZM6vhZVo95T-eit=iBCGHo_HTYQ@mail.gmail.com
Whole thread Raw
In response to Re: disfavoring unparameterized nested loops  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: disfavoring unparameterized nested loops
List pgsql-hackers
On Mon, Jun 21, 2021 at 6:41 AM David Rowley <dgrowleyml@gmail.com> wrote:
> For example, in an unparameterized Nested Loop that estimates the
> outer Path to have 1 row will cost an entire additional inner cost if
> there are 2 rows.  With Hash Join the cost is just an additional
> hashtable lookup, which is dead cheap.   I don't know exactly how
> add_path() would weigh all that up, but it seems to me that I wouldn't
> take the risk unless I was 100% certain that the Nested Loop's outer
> Path would only return 1 row exactly, if there was any chance at all
> it could return more, I'd be picking some other join method.

It seems like everyone agrees that it would be good to do something
about this problem, but the question is whether it's best to do
something that tries to be general, or whether we should instead do
something about this specific case. I favor the latter approach. Risk
and uncertainty exist all over the place, but dealing with that in a
general way seems difficult, and maybe unnecessary. Addressing the
case of unparameterized nest loops specifically seems simpler, because
it's easier to reason about what the alternatives are. Your last
sentence here seems right on point to me.

Basically, what you argue for there is disabling unparameterized
nested loops entirely except when we can prove that the inner side
will never generate more than one row. But, that's almost never going
to be something that we can prove. If the inner side is coming from a
table or sub-join, it can turn out to be big. As far as I can see, the
only way that this doesn't happen is if it's something like a subquery
that aggregates everything down to one row, or has LIMIT 1, but those
are such special cases that I don't even think we should be worrying
about them.

So my counter-proposal is: how about if we split
merge_unsorted_outer() into two functions, one of which generates
nested loops only based on parameterized paths and the other of which
generates nested loops based only on unparameterized paths, and then
rejigger add_paths_to_joinrel() so that we do the latter between the
steps that are currently number 5 and 6 and only if we haven't got any
other paths yet? If somebody later comes up with logic for proving
that the inner side can never have more than 1 row, we can let this be
run in those cases as well. In the meantime, if somebody does
something like a JOIN b ON a.x < b.x, we will still generate these
paths because there's no other approach, or similarly if it's a.x =
b.x but for some strange type that doesn't have a hash-joinable or
merge-joinable equality operator. But otherwise we omit those paths
entirely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Next
From: Robert Haas
Date:
Subject: Re: Use simplehash.h instead of dynahash in SMgr