Thread: bloom filter indexes?
I've been working on partitioning a rather large dataset into multiple tables. One limitation I've run into the lack of cross-partition-table unique indexes. In my case I need to guarantee the uniqueness of a two-column pair across all partitions -- and this value is not used to partition the tables. The table is partitioned based on a insert date timestamp. To check the uniqueness of this value I've added an insert/update trigger to search for matches in the other partitions. This trigger is adding significant overhead to inserts and updates. This sort of 'membership test' where I need only need to know if the key exists in the table is a perfect match for bloom filter. (see: http://en.wikipedia.org/wiki/Bloom_filter). The Bloom filter can give false positives so using it alone won't provide the uniqueness check I need, but it should greatly speed up this process. Searching around for "postgresql bloom filter" I found this message from 2005 along the same lines: http://archives.postgresql.org/pgsql-hackers/2005-05/msg01475.php This thread indicates bloom filters are used in the intarray contrib module and the tsearch2 (and I assume the built-in 8.3 full-text search features). I also found this assignment for CS course at the University of Toronto, when entails using bloom filters to speed up large joins: http://queens.db.toronto.edu/~koudas/courses/cscd43/hw2.pdf So, my question: are there any general-purpose bloom filter implementations for postgresql? I'm particularly interested implementations that would be useful for partitioned tables. Is anyone working on something like this? thanks, - Mason
On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote: > I've been working on partitioning a rather large dataset into multiple > tables. One limitation I've run into the lack of cross-partition-table > unique indexes. In my case I need to guarantee the uniqueness of a > two-column pair across all partitions -- and this value is not used to > partition the tables. The table is partitioned based on a insert date > timestamp. You're looking for a constraint across tables. > To check the uniqueness of this value I've added an insert/update > trigger to search for matches in the other partitions. This trigger is > adding significant overhead to inserts and updates. Do you lock all of the tables before doing the check? If not, then you have a race condition. > This sort of 'membership test' where I need only need to know if the > key exists in the table is a perfect match for bloom filter. (see: > http://en.wikipedia.org/wiki/Bloom_filter). This is more of an implementation detail. Is a bloom filter faster than BTree in your case? > The Bloom filter can give false positives so using it alone won't > provide the uniqueness check I need, but it should greatly speed up > this process. False positives are OK, that's what RECHECK is for. It's possible this index strategy will be better for your case. However, I think what you really want is some kind of multi-table primary key. Have you considered storing the key in its own two-column table with a UNIQUE index and having the partitions reference it? Regards, Jeff Davis
On Tue, Jun 3, 2008 at 12:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote: >> I've been working on partitioning a rather large dataset into multiple >> tables. One limitation I've run into the lack of cross-partition-table >> unique indexes. In my case I need to guarantee the uniqueness of a >> two-column pair across all partitions -- and this value is not used to >> partition the tables. The table is partitioned based on a insert date >> timestamp. > > You're looking for a constraint across tables. > Yes, for this particular case. But I'm also interested in speeding up cross-partition queries whether it is for a uniqueness check or not. This uniqueness check is just one (important) instance of a cross-partition query. >> To check the uniqueness of this value I've added an insert/update >> trigger to search for matches in the other partitions. This trigger is >> adding significant overhead to inserts and updates. > > Do you lock all of the tables before doing the check? If not, then you > have a race condition. > Yes, I was concerned about that. > It's possible this index strategy will be better for your case. > However, I think what you really want is some kind of multi-table > primary key. Have you considered storing the key in its own two-column > table with a UNIQUE index and having the partitions reference it? Thanks for the suggestion -- I'll explore maintaining the compound key in its own non-partitioned table. I was trying to avoid any application-layer code changes. I guess I can still accomplish that by updating this table via an insert/update trigger. But to reiterate, having bloom filter-based index would allow constant time determination of whether a given partition *may* contain the data. This would be very useful for large partitioned data-sets, especially in (very common) cases where performance is critical. This feature would also be useful for applications where data is partitioned (aka 'federated') across multiple servers.
On Tue, 2008-06-03 at 13:06 -0500, Mason Hale wrote: > On Tue, Jun 3, 2008 at 12:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2008-06-03 at 09:43 -0500, Mason Hale wrote: > >> I've been working on partitioning a rather large dataset into multiple > >> tables. One limitation I've run into the lack of cross-partition-table > >> unique indexes. In my case I need to guarantee the uniqueness of a > >> two-column pair across all partitions -- and this value is not used to > >> partition the tables. The table is partitioned based on a insert date > >> timestamp. > > > > You're looking for a constraint across tables. > > > > Yes, for this particular case. But I'm also interested in speeding up > cross-partition queries whether it is for a uniqueness check or not. > This uniqueness check is just one (important) instance of a > cross-partition query. I simple way to do this (potentially) would be to push the trigger to C. Joshua D. Drake