Home > mailing lists

Re: Bloom index - Mailing list pgsql-hackers

From	Greg Stark
Subject	Re: Bloom index
Date	January 18, 2010 12:13:51
Msg-id	407d949e1001180513g4216844dl48f3701ab32310d1@mail.gmail.com Whole thread Raw
In response to	Re: Bloom index (Teodor Sigaev <teodor@sigaev.ru>)
List	pgsql-hackers

Tree view

2010/1/18 Teodor Sigaev <teodor@sigaev.ru>:
>> So for example if your bloom filter is 4 bits per column your error
>> rate for a single column where clause will be 1/2^(4/1.44) or a little
>> worse than returning 1 record in 7. If you test two or three columns
>> then it'll be about 1 in 49 or 1 in 343.
>
> Hmm, I don't understand your calculations. In your example (4 bits per
> column, index has only one column) each row is hashed by 4 bits in 80-bit
> vector (signature), so, probability of collision is (1/80)^4

Sorry, I meant if you had n columns in the index, and the signature
was 4*n bits, and the where clause has one key in the search
condition. According to some guy editing the wikipedia page the number
of bits per value in the set a bloom filter needs to achieve an error
rate of e is 1.44*log_2(1/e). So the error rate for 4 bits per column
would be 1/2^(1.44*nbits). That's larger than the error rate for a
simple hash which would be 1/2^nbits.

The reason bloom filters are advantageous when checking for a value
*anywhere* in the set is that with a simple hash of each value you
would have that error rate for each check, so you would have a much
higher false positive rate for finding it somewhere. But in your case
you know which position you want to check.

-- 
greg

pgsql-hackers by date:

From: Teodor Sigaev
Date: 18 January 2010, 11:48:08
Subject: Re: Bloom index

From: Greg Stark
Date: 18 January 2010, 12:23:54
Subject: Bloom filters bloom filters bloom filters

Re: Bloom index - Mailing list pgsql-hackers

Previous

Next