Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Hash Joins vs. Bloom Filters / take 2
Date
Msg-id CAH2-Wzn9xgoo4Xu624wKmsE4B20e77Ew0BW_d9tUor4OFr69JQ@mail.gmail.com
Whole thread Raw
In response to Hash Joins vs. Bloom Filters / take 2  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Hash Joins vs. Bloom Filters / take 2
Re: Hash Joins vs. Bloom Filters / take 2
List pgsql-hackers
On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> In 2015/2016 I've been exploring if we could improve hash joins by
> leveraging bloom filters [1], and I was reminded about this idea in a
> thread about amcheck [2]. I also see that bloom filters were briefly
> mentioned in the thread about parallel hash [3].
>
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

Cool.

> The issues are essentially about predicting when the bloom filter can
> improve anything, which is more a question of the hash join selectivity
> than the bloom filter parameters.

Absolutely. A Bloom filter could be considered an opportunistic thing.
The false positive rate that you estimate is going to be based on the
planner's rowcount estimate for the inner side of the join, which is
liable to be wrong anyway. It's important to be resilient to
misestimations, which Bloom filters are good at.

> This is not the same as "false positive rate" which is a feature of the
> bloom filter itself, and can be measured by simply counting bits set in
> the filter. That is important too, of course, but simple to determine.

Perhaps I'm being pedantic, but it's not exactly true that you can
measure the false positive rate by counting the bits. I do agree with
what I think you meant here, though, which is that the precise false
positive rate is not necessarily that important, while the selectivity
of the join is very important.

In general, hash joins work best when the join qual is highly
selective. This is not the same thing as having a small hash table, of
course.

> After thinking about this a bit more, I think this is pretty much what
> we do during join cardinality estimation - we estimate how many of the
> rows in "fact" have a matching row in "dim". I do think we can reuse
> that to decide if to use a bloom filter or not.
>
> Of course, the decision may need to be more elaborate (and perhaps
> formulated as part of costing), but the filter selectivity is the
> crucial piece. The attached patch does not do that yet, though.

I suspect that it could make sense to use a Bloom filter to summarize
the entire inner side of the join all at once, even when there are
multiple batches. I also suspect that this is particularly beneficial
with parallel hash joins, where IPC/synchronization overhead can be a
big problem.

> 4) Initially, the patch was aimed at hashjoins with batches, on the
> premise that it can save some of the serialize/deserialize costs. But as
> mentioned in the discussion, bloom filters can be beneficial even in the
> single-batch case, when three conditions are met:
>
> a) join is selective (i.e. some rows don't have match in hash table)
>
> b) hash table > CPU cache
>
> c) bloom filter < CPU cache

Seems plausible. CPU cache efficiency is also affected by how many
hash functions you end up using -- use too many, and it slows things
down.

> We don't have a good way to determine size of the CPU cache, but I think
> we can use some crude arbitrary limits. E.g. require that hash hash
> table is larger than 8MB and bloom filter is smaller than 4MB, or
> something like that.

FWIW, the logical way to size the Bloom filter is based on the number
of (distinct) values, not based on the projected size of the hash
table. The lossy nature of Bloom filters makes the average size of the
original, hashed elements irrelevant to a sizing calculation that
targets a certain final false positive rate. Making Bloom filter size
any fixed fraction of hash table size seems wrong to me for that
reason alone.

You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: TupleTableSlot abstraction
Next
From: Claudio Freire
Date:
Subject: Re: Hash Joins vs. Bloom Filters / take 2