Re: Bloom index - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Bloom index
Date
Msg-id 407d949e1001180037k575aeb16v572ac64a569c14f0@mail.gmail.com
Whole thread Raw
In response to Re: Bloom index  (Greg Stark <gsstark@mit.edu>)
Responses Re: Bloom index  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-hackers
On Mon, Jan 18, 2010 at 2:29 AM, Greg Stark <gsstark@mit.edu> wrote:
> Bloom filters need to be sized to have n bits per set element to
> maintain a targeted false positive rate. And that false positive rate
> would be higher than just having an n bit hash for each set element.
> Normally they have the advantage that you can test for membership
> anywhere in the set quickly -- but in this case you actually only want
> to test a specific position in the set anyways.
>
> So I think you can get a lower false positive rate by just truncating
> the each column's hash value to n bits and storing an array of them.

So for example if your bloom filter is 4 bits per column your error
rate for a single column where clause will be 1/2^(4/1.44) or a little
worse than returning 1 record in 7. If you test two or three columns
then it'll be about 1 in 49 or 1 in 343.

If you just stored a nibble for each column then you could look up the
specific nibble for the column in a single column where clause and get
an error rate of 1 in 16. If you test two or three columns then it
would be 1 in 256 or 1 in 4,096.

If you're aiming at very large numbers of columns with very large
where clauses then perhaps you only need, say, 2 bits per column. But
for any number of bits per column I think you're always better off
just storing that many bits of the each hash rather than using a bloom
filter for this.


Also, as it stands now it's an unordered list. I assume you have no
plans to implement vacuum for this? Perhaps it would be better to
store a btree of signatures ordered by htid. That would let you easily
vacuum dead entries. Though it would necessitate random seeks to do
the full scan of the index unless you can solve the full index scan
concurrent with page splits problem.

-- 
greg


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Hot Standby and handling max_standby_delay
Next
From: Simon Riggs
Date:
Subject: Re: Hot Standby and handling max_standby_delay