Home > mailing lists

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Hash Joins vs. Bloom Filters / take 2
Date	February 23, 2018 02:11:44
Msg-id	2a2dd710-767c-8974-1ba7-f2da0f5dab80@2ndquadrant.com Whole thread Raw
In response to	Re: Hash Joins vs. Bloom Filters / take 2 (Claudio Freire <klaussfreire@gmail.com>)
Responses	Re: Hash Joins vs. Bloom Filters / take 2 (Claudio Freire <klaussfreire@gmail.com>)
List	pgsql-hackers

Tree view

On 02/22/2018 08:33 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>>> ...
>>>
>>> An HLL can be used to estimate set size, the paper makes no
>>> mention of it, probably assuming only distinct items are added to
>>> the set.
>>>
>>
>> The problem with HLL is, it's only an estimate of how many entries
>> you saw so far. It only tells you that after observing the items,
>> and it only tells you how many items you saw so far. What we need
>> for sizing a bloom filter is an estimate of number of distinct
>> values in advance.
>>
>> In other words, HLL is entirely useless for sizing Bloom Filters.
> 
> Normal BFs, yes. But that's exactly what you need for scalable BFs. 
> You need an estimate of the amount of distinct entries you've added
> to your current filter, not the total set size.
> 

No, you don't need that. For the SBF you need to know when the BF gets
full, which can be determined solely based on number of bits set to 1.
Essentially, the BF reaches the false positive rate when it reaches 50%.

Which is exactly what the SBF paper says on page 4: ... when filters get
full due to the limit on fill ratio, new one is added.

>> Furthermore, we could estimate number of observed distinct values from
>> the number of 1s in the bloom filter
> 
> That's kinda slow to do per-item. I tried to "count" distinct items by
> checking the BF before adding (don't add redundantly), but that's less
> precise than a HLL in my experience.

But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.

So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Peter Eisentraut
Date: 23 February 2018, 02:09:31
Subject: Re: Online enabling of checksums

From: Magnus Hagander
Date: 23 February 2018, 02:14:29
Subject: Re: Allow workers to override datallowconn

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

Previous

Next