On Mon, Jun 21, 2021 at 6:41 AM David Rowley <dgrowleyml@gmail.com> wrote:
> For example, in an unparameterized Nested Loop that estimates the
> outer Path to have 1 row will cost an entire additional inner cost if
> there are 2 rows. With Hash Join the cost is just an additional
> hashtable lookup, which is dead cheap. I don't know exactly how
> add_path() would weigh all that up, but it seems to me that I wouldn't
> take the risk unless I was 100% certain that the Nested Loop's outer
> Path would only return 1 row exactly, if there was any chance at all
> it could return more, I'd be picking some other join method.
It seems like everyone agrees that it would be good to do something
about this problem, but the question is whether it's best to do
something that tries to be general, or whether we should instead do
something about this specific case. I favor the latter approach. Risk
and uncertainty exist all over the place, but dealing with that in a
general way seems difficult, and maybe unnecessary. Addressing the
case of unparameterized nest loops specifically seems simpler, because
it's easier to reason about what the alternatives are. Your last
sentence here seems right on point to me.
Basically, what you argue for there is disabling unparameterized
nested loops entirely except when we can prove that the inner side
will never generate more than one row. But, that's almost never going
to be something that we can prove. If the inner side is coming from a
table or sub-join, it can turn out to be big. As far as I can see, the
only way that this doesn't happen is if it's something like a subquery
that aggregates everything down to one row, or has LIMIT 1, but those
are such special cases that I don't even think we should be worrying
about them.
So my counter-proposal is: how about if we split
merge_unsorted_outer() into two functions, one of which generates
nested loops only based on parameterized paths and the other of which
generates nested loops based only on unparameterized paths, and then
rejigger add_paths_to_joinrel() so that we do the latter between the
steps that are currently number 5 and 6 and only if we haven't got any
other paths yet? If somebody later comes up with logic for proving
that the inner side can never have more than 1 row, we can let this be
run in those cases as well. In the meantime, if somebody does
something like a JOIN b ON a.x < b.x, we will still generate these
paths because there's no other approach, or similarly if it's a.x =
b.x but for some strange type that doesn't have a hash-joinable or
merge-joinable equality operator. But otherwise we omit those paths
entirely.
--
Robert Haas
EDB: http://www.enterprisedb.com