Bloom Filter improvements in postgesql - Mailing list pgsql-hackers
From | Ross Heaney |
---|---|
Subject | Bloom Filter improvements in postgesql |
Date | |
Msg-id | CAJMh_DJQLmUONKyDhZPhpWboqRV_+i0JbwtADB1R6PZX9uacwA@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Hi All,
I would like to propose some improvements to the bloom filter implementation that exists in postgres at the moment. I will describe the changes along with a patch with some tests. I wanted to get feedback before I spend more time on this so the patch is not production ready(WIP) but is sufficient for feedback. The patch code compiles and tests successfully and doesn't break anything(to the best of my knowledge!)
Problem Statement
The existing bloom filter uses an integer number of hash functions. This is completely fine but we can do better. In a recent paper its described how bloom filters can be tuned to use non-integer values of hash functions. This is particularly nice as often the optimal number of hash functions to use is not an integer number. When the hash functions(K) to be used is calculated, it could be something like 2.3. Then we would round up to 3 and use k=3 hash functions for the bloom filter. This means that the filter size will be ceil(k)/k = 1/k times larger than the optimal filter size. If we take floor(k) instead we get an increase in the filter's density which increases the false positive rate which may not be desirable.
Proposed Improvements
Instead we can relax the assumption that it must be an integer value for k. To do this we split the k value into an integer and a rational part. Call this k and r. So in the example above where we calculated k to be equal to 2.3 this would mean k = 2 and r = 0.3. We always use k hash functions but we apply a third hash function 30% of the time(corresponding to r = 0.3 in this case). This is done deterministically so the same element always gets the same treatment(keeps the bloom filter promise of no false negatives). The paper has more information but this is the gist of the rational part of the bloom filter.
However we can take this further and look to make the bloom filter lossless. Using the rational part described above we've given ourselves some wriggle room so to speak. We can compliment the rational 'lossy' bloom filter with an additional array called a witness. This allows us to reconstruct the original input, which just means a false positive rate of 0. The witness(W) is constructed concurrently with the bloom filter bitmap. The witness encodes information for every element that passes the initial bloom filter test(both the true and false positives). Clearly this establishes a direct relationship between the false positive rate and the size of the witness. As this is all done deterministically the decoding process is just the reverse of the encoding process. We can easily figure out true positives from false positives by checking the witness. By extension of the deterministic process there can be no false negatives.
Tradeoffs and Considerations
As stated in the previous section there is a direct relationship between the false positive rate and the size of the witness. The lossless capability comes at the price of more memory for the witness. The expected witness size can be calculated t * (1 + r ^2) where t = true positives and r = 1/ln2. Let's give a concrete example
t = 1000000 (a million items in the filter)
fp = 0.01
size of the filter = 9,585,059 (bits or approx 1.14 mb)
k = 6.65 (assuming we use rational number of hfc's)
For a bloom filter setup for these parameters we can expect a witness size of approximately 3,000,000 bits. The total space complexity is still 0(N). The main tradeoffs are of course the memory overhead of the witness. I have done some preliminary benchmarking and as expected the use of the witness does slow down inserts and lookups. The insert speed is around 3x slower and the lookup speed 2x slower. I definitely think this can be made faster but in any case it will always be slower than the traditional bloom filter if we wanted to also have a 'lossless' option.
Final thoughts and comments
The rational bloom filter would be a nice improvement to the existing bloom filter implementation. The lossless option could be useful in some scenarios at the expense of extra memory and slower lookup and insertion times. I have attached two patches showing a WIP of the changes(mainly to do with the lossless option). I would appreciate any feedback on whether any/all of this proposal would be considered. I'd be happy to make the changes myself.
Attachment
pgsql-hackers by date: