Re: Bloom index - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Bloom index
Date
Msg-id 407d949e1001180513g4216844dl48f3701ab32310d1@mail.gmail.com
Whole thread Raw
In response to Re: Bloom index  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-hackers
2010/1/18 Teodor Sigaev <teodor@sigaev.ru>:
>> So for example if your bloom filter is 4 bits per column your error
>> rate for a single column where clause will be 1/2^(4/1.44) or a little
>> worse than returning 1 record in 7. If you test two or three columns
>> then it'll be about 1 in 49 or 1 in 343.
>
> Hmm, I don't understand your calculations. In your example (4 bits per
> column, index has only one column) each row is hashed by 4 bits in 80-bit
> vector (signature), so, probability of collision is (1/80)^4

Sorry, I meant if you had n columns in the index, and the signature
was 4*n bits, and the where clause has one key in the search
condition. According to some guy editing the wikipedia page the number
of bits per value in the set a bloom filter needs to achieve an error
rate of e is 1.44*log_2(1/e). So the error rate for 4 bits per column
would be 1/2^(1.44*nbits). That's larger than the error rate for a
simple hash which would be 1/2^nbits.

The reason bloom filters are advantageous when checking for a value
*anywhere* in the set is that with a simple hash of each value you
would have that error rate for each check, so you would have a much
higher false positive rate for finding it somewhere. But in your case
you know which position you want to check.

-- 
greg


pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: Bloom index
Next
From: Greg Stark
Date:
Subject: Bloom filters bloom filters bloom filters