Extending array intersection ops to bloom indexes - Mailing list pgsql-hackers

From Lauri Svan
Subject Extending array intersection ops to bloom indexes
Date
Msg-id CAAsqVvWbz0n7JYMNRJCc_-10wFsVSmWz+Ws0L5gS99kFedz75w@mail.gmail.com
Whole thread Raw
List pgsql-hackers
The current Bloom index implementation only supports = operator, but states in the documentation that it could be extended to handle array intersection or union queries in future. As the original authors are hanging out in this list, I would like to discuss what such implementation might be.

I just started the investigation and most of what I am writing here is likely faulty. What I currently ponder is if it were sufficient to OR the hashed array values together and store in one go to the index, or if each array value would need to be stored separately. The index insertion looks somewhat costly, and it does not feel right to repeat that work for each array element. The same would apply to the set membership queries - single or many. If I am not totally mistaken, this depends on the number of different hashing functions used, as each function would yield different values. However, if I understood the operator class for the current bloom index, we only give one hash function?

My own C skills are getting a bit rusty and I am new to pg codebase, but I'll take that as a positive challenge. Any help in the algorithm and/or implementation would be appreciated (even a small paid contribution would not be out of question).

As to the problem I aim to solve, our small startup recently bumped into a performance problem where I believe bloom index might help. We use arrays extensively in our activity log, which stores a reference to pretty much every record that changed during an activity. It is a somewhat heavy structure, and while we index it using GIN indexes and partition by timestamp, it already becomes slow to query with >10M rows. Surprisingly the GIN index works slower than a full table scan to the partition within the query time range. The table structure could be changed (e.g. unroll the arrays, resulting in 10-20x the rows), but I'd like to exercise if the problem could be solved with proper indexes, first.

I believe bloom index, supported by array intersection queries would be an ideal solution - the hashed values would be neatly packed into one hash (with the obvious need to do re-checks, esp. when the arrays have a lot of values).

Thanks in advance,
Lauri

pgsql-hackers by date:

Previous
From: "Tharakan, Robins"
Date:
Subject: RE: track_planning causing performance regression
Next
From: Asim Praveen
Date:
Subject: Re: Parallelize stream replication process