Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) ! - Mailing list pgsql-hackers

From Avinash Kumar
Subject Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !
Date
Msg-id CAN0TujeD-DxQ=TjdQkEPXdVSxtt6MpAhJ=TazsAHxuRQ4_E5qQ@mail.gmail.com
Whole thread Raw
Responses Re: Bloom Indexes - bit array length and the total number of bits(or hash functions ?? ) !
List pgsql-hackers
Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k" number of hash functions are used to generate hashes for the data. And then the appropriate bits are set to 1. 

I was doing the following test where i was generating 10 million records and testing bloom indexes. 

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. length and the number of bits for each column.

Here is the problem or the question i have. 

I have created the following 2 Indexes. 

Index 1
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);

Index 2
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2, col8=2);

With change in length, we of course see a difference in the Index size which is understandable. Here i have the same length for both indexes. But, i reduced the number of bits per each index column from 4 to 2. Both the above indexes are of the same size. But, there is a very slight difference in the execution time between Index 1 and Index 2 but with the same cost. 

So the question here is -
I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives. 
Please let me know if there is anything simple that i am missing here. 

Query 

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833 and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5 = 326756 and id6 = 597560 and zipcode = 10545; 

With Index 1 

           QUERY PLAN                                              
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=307.000..307.000 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1 width=0) (actual time=306.997..306.997 rows=0 loops=1)
         Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
         Buffers: shared hit=147059
 Planning Time: 0.106 ms
 Execution Time: 307.030 ms
(9 rows)

With Index 2 

           QUERY PLAN                                              
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=420.881..420.881 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1 width=0) (actual time=420.878..420.878 rows=0 loops=1)
         Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
         Buffers: shared hit=147059
 Planning Time: 0.104 ms
 Execution Time: 420.913 ms
(9 rows)

Thanks,
Avinash. 



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Temp table handling after anti-wraparound shutdown (Was: BUG#15840)
Next
From: Arthur Zakirov
Date:
Subject: Re: [PROPOSAL] Drop orphan temp tables in single-mode