Re: Bloom Indexes - bit array length and the total number of bits(or hash functions ?? ) ! - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: Bloom Indexes - bit array length and the total number of bits(or hash functions ?? ) !
Date
Msg-id alpine.DEB.2.21.1906100758230.9922@lancre
Whole thread Raw
In response to Re: Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !  (Avinash Kumar <avinash.vallarapu@gmail.com>)
List pgsql-hackers
> But the 2 direct questions i have are :
>
> 1. What is the structure of the Bloom Index ? Can you please let me know
> what are the fields of a Bloom Index ? Is it just the Item Pointer
> and BloomSignatureWord ?

I'm not sure of Postgres actual implementation, I have just looked at the 
underlying hash functions implementation.

> When i describe my bloom index it looks like following.
>
> postgres=# \d+ foo.idx_bloom_bar
>                   Index "foo.idx_bloom_bar"
> Column  |  Type   | Key? | Definition | Storage | Stats target
> ---------+---------+------+------------+---------+--------------
> id      | integer | yes  | id         | plain   |
> dept    | integer | yes  | dept       | plain   |
> id2     | integer | yes  | id2        | plain   |
> id3     | integer | yes  | id3        | plain   |
> id4     | integer | yes  | id4        | plain   |
> id5     | integer | yes  | id5        | plain   |
> id6     | integer | yes  | id6        | plain   |
> zipcode | integer | yes  | zipcode    | plain   |
> bloom, for table "foo.bar"

The bloom index associates a signature, i.e. a bitfield the size of which 
is specified by the parameter "length", to the tuple location. The 
bitfield is computed by hashing the value of columns which are listed 
above in the index definition. The many values are somehow compressed into 
the small signature.

> Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
> col8=4

I doubt that these parameters are good. The is no point to include a 
unique column into a bloom index. If you look at my blog, the number of 
bits associated to each field should depend on the expected selectivity, 
which depends on the entropy of each field and the signature size. The 
column entropy can be computed with a query.

> 2. If we are storing just one signature word per row, how is this
> aggregated for all column values of that row into one signature in high
> level ?

The aggregation, if performed, is not very useful in practice because it 
can only be effective on a few (first) bits, which are randomly computed 
anyway and the value of the query is not likely to hit them.

Fundamentally all bitfields are scanned to extract which tuples are 
possibly of interest, i.e. are not excluded by the index. The "full scan" 
of the bloom index is not a bad thing if it is much smaller than the table 
itself.

> For example, if length = 64, does it mean that a bit array of 64 bits is
> generated per each row ?

Yes.

> If col1=4, does it mean the value of col1 is passed to 4 hash functions
> that generate 4 positions that can be set to 1 in the bit array ?

Yes.

Try to apply the formula in the blog to see what you get for your 
parameters, but it is likely to be < 4.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Zhenghua Lyu
Date:
Subject: Questions of 'for update'
Next
From: Alex
Date:
Subject: Why to index a "Recently DEAD" tuple when creating index