Re: Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers

From David Rowley
Subject Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date
Msg-id CAApHDvr=qNLbYV=P3tk-xQ9S25MvKZFekLeOKC4kH3heQ48=Yw@mail.gmail.com
Whole thread Raw
In response to Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (Zhihong Yu <zyu@yugabyte.com>)
Responses Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Wed, 31 Mar 2021 at 05:34, Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> In paraminfo_get_equal_hashops(),
>
> +       /* Reject if there are any volatile functions */
> +       if (contain_volatile_functions(expr))
> +       {
>
> You can move the above code to just ahead of:
>
> +       if (IsA(expr, Var))
> +           var_relids = bms_make_singleton(((Var *) expr)->varno);
>
> This way, when we return early, var_relids doesn't need to be populated.

Thanks for having another look.  I did a bit more work in that area
and removed that code. I dug a little deeper and I can't see any way
that a lateral_var on a rel can refer to anything inside the rel.  It
looks like that code was just a bit over paranoid about that.

I also added some additional caching in RestrictInfo to cache the hash
equality operator to use for the result cache. This saves checking
this each time we consider a join during the join search. In many
cases we would have used the value cached in
RestrictInfo.hashjoinoperator, however, for non-equaliy joins, that
would have be set to InvalidOid. We can still use Result Cache for
non-equality joins.

I've now pushed the main patch.

There's a couple of things I'm not perfectly happy with:

1.  The name.  There's a discussion on [1] if anyone wants to talk about that.
2.  For lateral joins, there's no place to cache the hash equality
operator.  Maybe there's some rework to do to add the ability to check
things for those like we use RestrictInfo for regular joins.
3.  No ability to cache n_distinct estimates.  This must be repeated
each time we consider a join.  RestrictInfo allows caching for this to
speed up clauselist_selectivity() for other join types.

There was no consensus reached on the name of the node. "Tuple Cache"
seems like the favourite so far, but there's not been a great deal of
input. At least not enough that I was motivated to rename everything.
People will perhaps have more time to consider names during beta.

Thank you to everyone who gave input and reviewed this patch. It would
be great to get feedback on the performance with real workloads. As
mentioned in the commit message, there is a danger that it causes
performance regressions when n_distinct estimates are significantly
underestimated.

I'm off to look at the buildfarm now.

David

[1] https://www.postgresql.org/message-id/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com



pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: using extended statistics to improve join estimates
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Issue with point_ops and NaN