Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Hash Joins vs. Bloom Filters / take 2
Date
Msg-id CAGTBQpYioi71inVQSFMtP2BVBE_PUd6Goh2nfTU+wQHG36Yzrw@mail.gmail.com
Whole thread Raw
In response to Re: Hash Joins vs. Bloom Filters / take 2  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Hash Joins vs. Bloom Filters / take 2  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Next
From: Peter Geoghegan
Date:
Subject: Re: Hash Joins vs. Bloom Filters / take 2