Thread: Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !

Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k" number of hash functions are used to generate hashes for the data. And then the appropriate bits are set to 1. 

I was doing the following test where i was generating 10 million records and testing bloom indexes. 

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. length and the number of bits for each column.

Here is the problem or the question i have. 

I have created the following 2 Indexes. 

Index 1
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);

Index 2
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2, col8=2);

With change in length, we of course see a difference in the Index size which is understandable. Here i have the same length for both indexes. But, i reduced the number of bits per each index column from 4 to 2. Both the above indexes are of the same size. But, there is a very slight difference in the execution time between Index 1 and Index 2 but with the same cost. 

So the question here is -
I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives. 
Please let me know if there is anything simple that i am missing here. 

Query 

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833 and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5 = 326756 and id6 = 597560 and zipcode = 10545; 

With Index 1 

           QUERY PLAN                                              
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=307.000..307.000 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1 width=0) (actual time=306.997..306.997 rows=0 loops=1)
         Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
         Buffers: shared hit=147059
 Planning Time: 0.106 ms
 Execution Time: 307.030 ms
(9 rows)

With Index 2 

           QUERY PLAN                                              
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=420.881..420.881 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1 width=0) (actual time=420.878..420.878 rows=0 loops=1)
         Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
         Buffers: shared hit=147059
 Planning Time: 0.104 ms
 Execution Time: 420.913 ms
(9 rows)

Thanks,
Avinash. 



Hello Avinash,

> I was testing bloom indexes today. I understand bloom indexes uses bloom
> filters. [...]
>
> So the question here is -
> I assume - number of bits = k. Where k is the total number of hash
> functions used on top of the data that needs to validated. Is that correct
> ? If yes, why do we see the Index 1 performing better than Index 2 ?
> Because, the data has to go through more hash functions (4 vs 2) in Index 1
> than Index 2. So, with Index 1 it should take more time.
> Also, both the indexes have ZERO false positives.
> Please let me know if there is anything simple that i am missing here.

You may have a look at the blog entry about these parameters I redacted a 
few year ago:

   http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html

-- 
Fabien.



Thanks Fabien,

But the 2 direct questions i have are : 

1. What is the structure of the Bloom Index ? Can you please let me know what are the fields of a Bloom Index ? Is it just the Item Pointer and BloomSignatureWord ?
When i describe my bloom index it looks like following. 

postgres=# \d+ foo.idx_bloom_bar 
                   Index "foo.idx_bloom_bar"
 Column  |  Type   | Key? | Definition | Storage | Stats target 
---------+---------+------+------------+---------+--------------
 id      | integer | yes  | id         | plain   | 
 dept    | integer | yes  | dept       | plain   | 
 id2     | integer | yes  | id2        | plain   | 
 id3     | integer | yes  | id3        | plain   | 
 id4     | integer | yes  | id4        | plain   | 
 id5     | integer | yes  | id5        | plain   | 
 id6     | integer | yes  | id6        | plain   | 
 zipcode | integer | yes  | zipcode    | plain   | 
bloom, for table "foo.bar"
Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4


2. If we are storing just one signature word per row, how is this aggregated for all column values of that row into one signature in high level ? 
For example, if length = 64, does it mean that a bit array of 64 bits is generated per each row ? 
If col1=4, does it mean the value of col1 is passed to 4 hash functions that generate 4 positions that can be set to 1 in the bit array ?

On Sat, Jun 8, 2019 at 3:11 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Avinash,

> I was testing bloom indexes today. I understand bloom indexes uses bloom
> filters. [...]
>
> So the question here is -
> I assume - number of bits = k. Where k is the total number of hash
> functions used on top of the data that needs to validated. Is that correct
> ? If yes, why do we see the Index 1 performing better than Index 2 ?
> Because, the data has to go through more hash functions (4 vs 2) in Index 1
> than Index 2. So, with Index 1 it should take more time.
> Also, both the indexes have ZERO false positives.
> Please let me know if there is anything simple that i am missing here.

You may have a look at the blog entry about these parameters I redacted a
few year ago:

   http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html

--
Fabien.




--
9000799060
> But the 2 direct questions i have are :
>
> 1. What is the structure of the Bloom Index ? Can you please let me know
> what are the fields of a Bloom Index ? Is it just the Item Pointer
> and BloomSignatureWord ?

I'm not sure of Postgres actual implementation, I have just looked at the 
underlying hash functions implementation.

> When i describe my bloom index it looks like following.
>
> postgres=# \d+ foo.idx_bloom_bar
>                   Index "foo.idx_bloom_bar"
> Column  |  Type   | Key? | Definition | Storage | Stats target
> ---------+---------+------+------------+---------+--------------
> id      | integer | yes  | id         | plain   |
> dept    | integer | yes  | dept       | plain   |
> id2     | integer | yes  | id2        | plain   |
> id3     | integer | yes  | id3        | plain   |
> id4     | integer | yes  | id4        | plain   |
> id5     | integer | yes  | id5        | plain   |
> id6     | integer | yes  | id6        | plain   |
> zipcode | integer | yes  | zipcode    | plain   |
> bloom, for table "foo.bar"

The bloom index associates a signature, i.e. a bitfield the size of which 
is specified by the parameter "length", to the tuple location. The 
bitfield is computed by hashing the value of columns which are listed 
above in the index definition. The many values are somehow compressed into 
the small signature.

> Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
> col8=4

I doubt that these parameters are good. The is no point to include a 
unique column into a bloom index. If you look at my blog, the number of 
bits associated to each field should depend on the expected selectivity, 
which depends on the entropy of each field and the signature size. The 
column entropy can be computed with a query.

> 2. If we are storing just one signature word per row, how is this
> aggregated for all column values of that row into one signature in high
> level ?

The aggregation, if performed, is not very useful in practice because it 
can only be effective on a few (first) bits, which are randomly computed 
anyway and the value of the query is not likely to hit them.

Fundamentally all bitfields are scanned to extract which tuples are 
possibly of interest, i.e. are not excluded by the index. The "full scan" 
of the bloom index is not a bad thing if it is much smaller than the table 
itself.

> For example, if length = 64, does it mean that a bit array of 64 bits is
> generated per each row ?

Yes.

> If col1=4, does it mean the value of col1 is passed to 4 hash functions
> that generate 4 positions that can be set to 1 in the bit array ?

Yes.

Try to apply the formula in the blog to see what you get for your 
parameters, but it is likely to be < 4.

-- 
Fabien.