On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?
I'm interested in what Jim has to say, but to me it seems like we
should try to find a way to add a tlist entry for the hash value to
avoid recomputing it. That's likely to require some tricky planner
surgery, but it's probably doable.
What really seems finicky to me about this whole project is the
costing. In the best case it's a a huge win; in the worst case it's a
significant loss; and whether it's a gain or a loss is not easy to
figure out from the information that we have available. We generally
do not have an accurate count of the number of distinct values we're
likely to see (which is important).
Worse, when you start to consider pushdown, you realize that the cost
of the scan depends on the bloom filter we push down to it. So
consider something like A IJ B IJ C. It seems like it could be the
case that once we decide to do the A-B join as a hash join with a
bloom filter, it makes sense to also do the join to C as a hash join
and push down the bloom filter, because we'll be able to combine the
two filters and the extra probes will be basically free. But if we
weren't already doing the A-B join with a bloom filter, then maybe the
filter wouldn't make sense for C either.
Maybe I'm worrying over nothing here, or the wrong things, but costing
this well enough to avoid regressions really looks hard.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company