Thread: bloom filter indexes?

bloom filter indexes?

From
"Mason Hale"
Date:
I've been working on partitioning a rather large dataset into multiple
tables. One limitation I've run into the lack of cross-partition-table
unique indexes. In my case I need to guarantee the uniqueness of a
two-column pair across all partitions -- and this value is not used to
partition the tables. The table is partitioned based on a insert date
timestamp.

To check the uniqueness of this value I've added an insert/update
trigger to search for matches in the other partitions. This trigger is
adding significant overhead to inserts and updates.

This sort of 'membership test' where I need only need to know if the
key exists in the table is a perfect match for bloom filter. (see:
http://en.wikipedia.org/wiki/Bloom_filter).

The Bloom filter can give false positives so using it alone won't
provide the uniqueness check I need, but it should greatly speed up
this process.

Searching around for "postgresql bloom filter" I found this message
from 2005 along the same lines:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01475.php

This thread indicates bloom filters are used in the intarray contrib
module and the tsearch2 (and I assume the built-in 8.3 full-text
search features).

I also found this assignment for CS course at the University of
Toronto, when entails using bloom filters to speed up large joins:
http://queens.db.toronto.edu/~koudas/courses/cscd43/hw2.pdf

So, my question: are there any general-purpose bloom filter
implementations for postgresql? I'm particularly interested
implementations that would be useful for partitioned tables. Is anyone
working on something like this?

thanks,
- Mason

Re: bloom filter indexes?

From
Jeff Davis
Date:
On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote:
> I've been working on partitioning a rather large dataset into multiple
> tables. One limitation I've run into the lack of cross-partition-table
> unique indexes. In my case I need to guarantee the uniqueness of a
> two-column pair across all partitions -- and this value is not used to
> partition the tables. The table is partitioned based on a insert date
> timestamp.

You're looking for a constraint across tables.

> To check the uniqueness of this value I've added an insert/update
> trigger to search for matches in the other partitions. This trigger is
> adding significant overhead to inserts and updates.

Do you lock all of the tables before doing the check? If not, then you
have a race condition.

> This sort of 'membership test' where I need only need to know if the
> key exists in the table is a perfect match for bloom filter. (see:
> http://en.wikipedia.org/wiki/Bloom_filter).

This is more of an implementation detail. Is a bloom filter faster than
BTree in your case?

> The Bloom filter can give false positives so using it alone won't
> provide the uniqueness check I need, but it should greatly speed up
> this process.

False positives are OK, that's what RECHECK is for.

It's possible this index strategy will be better for your case.
However, I think what you really want is some kind of multi-table
primary key. Have you considered storing the key in its own two-column
table with a UNIQUE index and having the partitions reference it?

Regards,
    Jeff Davis


Re: bloom filter indexes?

From
"Mason Hale"
Date:
On Tue, Jun 3, 2008 at 12:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote:
>> I've been working on partitioning a rather large dataset into multiple
>> tables. One limitation I've run into the lack of cross-partition-table
>> unique indexes. In my case I need to guarantee the uniqueness of a
>> two-column pair across all partitions -- and this value is not used to
>> partition the tables. The table is partitioned based on a insert date
>> timestamp.
>
> You're looking for a constraint across tables.
>

Yes, for this particular case. But I'm also interested in speeding up
cross-partition queries whether it is for a uniqueness check or not.
This uniqueness check is just one (important) instance of a
cross-partition query.

>> To check the uniqueness of this value I've added an insert/update
>> trigger to search for matches in the other partitions. This trigger is
>> adding significant overhead to inserts and updates.
>
> Do you lock all of the tables before doing the check? If not, then you
> have a race condition.
>

Yes, I was concerned about that.

> It's possible this index strategy will be better for your case.
> However, I think what you really want is some kind of multi-table
> primary key. Have you considered storing the key in its own two-column
> table with a UNIQUE index and having the partitions reference it?

Thanks for the suggestion -- I'll explore maintaining the compound key
in its own non-partitioned table. I was trying to avoid any
application-layer code changes. I guess I can still accomplish that by
updating this table via an insert/update trigger.

But to reiterate, having bloom filter-based index would allow constant
time determination of whether a given partition *may* contain the
data. This would be very useful for large partitioned data-sets,
especially in (very common) cases where performance is critical.

This feature would also be useful for applications where data is
partitioned (aka 'federated') across multiple servers.

Re: bloom filter indexes?

From
"Joshua D. Drake"
Date:

On Tue, 2008-06-03 at 13:06 -0500, Mason Hale wrote:
> On Tue, Jun 3, 2008 at 12:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote:
> >> I've been working on partitioning a rather large dataset into multiple
> >> tables. One limitation I've run into the lack of cross-partition-table
> >> unique indexes. In my case I need to guarantee the uniqueness of a
> >> two-column pair across all partitions -- and this value is not used to
> >> partition the tables. The table is partitioned based on a insert date
> >> timestamp.
> >
> > You're looking for a constraint across tables.
> >
>
> Yes, for this particular case. But I'm also interested in speeding up
> cross-partition queries whether it is for a uniqueness check or not.
> This uniqueness check is just one (important) instance of a
> cross-partition query.

I simple way to do this (potentially) would be to push the trigger to C.

Joshua D. Drake