Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Hash Joins vs. Bloom Filters / take 2
Date
Msg-id CA+Tgmoa4M4tOv93EM10CcMJ0h0T1mp9fZm9bpnhsh7qOsY_q+Q@mail.gmail.com
Whole thread Raw
In response to Re: Hash Joins vs. Bloom Filters / take 2  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: Hash Joins vs. Bloom Filters / take 2  (Jim Finnerty <jfinnert@amazon.com>)
List pgsql-hackers
On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?

I'm interested in what Jim has to say, but to me it seems like we
should try to find a way to add a tlist entry for the hash value to
avoid recomputing it.  That's likely to require some tricky planner
surgery, but it's probably doable.

What really seems finicky to me about this whole project is the
costing.  In the best case it's a a huge win; in the worst case it's a
significant loss; and whether it's a gain or a loss is not easy to
figure out from the information that we have available.  We generally
do not have an accurate count of the number of distinct values we're
likely to see (which is important).

Worse, when you start to consider pushdown, you realize that the cost
of the scan depends on the bloom filter we push down to it.  So
consider something like A IJ B IJ C.  It seems like it could be the
case that once we decide to do the A-B join as a hash join with a
bloom filter, it makes sense to also do the join to C as a hash join
and push down the bloom filter, because we'll be able to combine the
two filters and the extra probes will be basically free.  But if we
weren't already doing the A-B join with a bloom filter, then maybe the
filter wouldn't make sense for C either.

Maybe I'm worrying over nothing here, or the wrong things, but costing
this well enough to avoid regressions really looks hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: COPY FROM WHEN condition
Next
From: Tom Lane
Date:
Subject: Re: Vacuum Full does not release the disk size space after delete from table