Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) ! - Mailing list pgsql-hackers
From | Avinash Kumar |
---|---|
Subject | Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) ! |
Date | |
Msg-id | CAN0TujeD-DxQ=TjdQkEPXdVSxtt6MpAhJ=TazsAHxuRQ4_E5qQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Bloom Indexes - bit array length and the total number of bits(or hash functions ?? ) !
|
List | pgsql-hackers |
Hi All,
I was testing bloom indexes today. I understand bloom indexes uses bloom filters.
As i understand, a bloom filter is a bit array of m bits and a constant "k" number of hash functions are used to generate hashes for the data. And then the appropriate bits are set to 1.
I was doing the following test where i was generating 10 million records and testing bloom indexes.
CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);
INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;
As per the documentation, bloom indexes accepts 2 parameters. length and the number of bits for each column.
Here is the problem or the question i have.
I have created the following 2 Indexes.
Index 1
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);
Index 2
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2, col8=2);
With change in length, we of course see a difference in the Index size which is understandable. Here i have the same length for both indexes. But, i reduced the number of bits per each index column from 4 to 2. Both the above indexes are of the same size. But, there is a very slight difference in the execution time between Index 1 and Index 2 but with the same cost.
So the question here is -
I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.
Query
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833 and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5 = 326756 and id6 = 597560 and zipcode = 10545;
With Index 1
QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=307.000..307.000 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=306.997..306.997 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.106 ms
Execution Time: 307.030 ms
(9 rows)
With Index 2
QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=420.881..420.881 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=420.878..420.878 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.104 ms
Execution Time: 420.913 ms
(9 rows)
Thanks,
Avinash.
Avinash.
pgsql-hackers by date: