Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Hash Joins vs. Bloom Filters / take 2
Date
Msg-id 2a2dd710-767c-8974-1ba7-f2da0f5dab80@2ndquadrant.com
Whole thread Raw
In response to Re: Hash Joins vs. Bloom Filters / take 2  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: Hash Joins vs. Bloom Filters / take 2  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
On 02/22/2018 08:33 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>>> ...
>>>
>>> An HLL can be used to estimate set size, the paper makes no
>>> mention of it, probably assuming only distinct items are added to
>>> the set.
>>>
>>
>> The problem with HLL is, it's only an estimate of how many entries
>> you saw so far. It only tells you that after observing the items,
>> and it only tells you how many items you saw so far. What we need
>> for sizing a bloom filter is an estimate of number of distinct
>> values in advance.
>>
>> In other words, HLL is entirely useless for sizing Bloom Filters.
> 
> Normal BFs, yes. But that's exactly what you need for scalable BFs. 
> You need an estimate of the amount of distinct entries you've added
> to your current filter, not the total set size.
> 

No, you don't need that. For the SBF you need to know when the BF gets
full, which can be determined solely based on number of bits set to 1.
Essentially, the BF reaches the false positive rate when it reaches 50%.

Which is exactly what the SBF paper says on page 4: ... when filters get
full due to the limit on fill ratio, new one is added.

>> Furthermore, we could estimate number of observed distinct values from
>> the number of 1s in the bloom filter
> 
> That's kinda slow to do per-item. I tried to "count" distinct items by
> checking the BF before adding (don't add redundantly), but that's less
> precise than a HLL in my experience.

But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.

So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Online enabling of checksums
Next
From: Magnus Hagander
Date:
Subject: Re: Allow workers to override datallowconn