Home > mailing lists

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From	Claudio Freire
Subject	Re: Hash Joins vs. Bloom Filters / take 2
Date	February 21, 2018 05:17:28
Msg-id	CAGTBQpYioi71inVQSFMtP2BVBE_PUd6Goh2nfTU+wQHG36Yzrw@mail.gmail.com Whole thread Raw
In response to	Re: Hash Joins vs. Bloom Filters / take 2 (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: Hash Joins vs. Bloom Filters / take 2 (Peter Geoghegan <pg@bowt.ie>)
List	pgsql-hackers

Tree view

On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.

pgsql-hackers by date:

From: Peter Geoghegan
Date: 21 February 2018, 05:06:50
Subject: Re: Hash Joins vs. Bloom Filters / take 2

From: Peter Geoghegan
Date: 21 February 2018, 05:23:58
Subject: Re: Hash Joins vs. Bloom Filters / take 2

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

Previous

Next