* Atri Sharma (atri.jiit@gmail.com) wrote:
> My point is that I would like to help in the implementation, if possible. :)
Feel free to go ahead and implement it.. I'm not sure when I'll have a
chance to (probably not in the next week or two anyway). Unfortunately,
the bigger issue here is really about testing the results and
determining if it's actually faster/better with various data sets
(including ones which have duplicates). I've got one test data set
which has some interesting characteristics (for one thing, hashing the
"large" side and then seq-scanning the "small" side is actually faster
than going the other way, which is quite 'odd' imv for a hashing
system): http://snowman.net/~sfrost/test_case2.sql
You might also look at the other emails that I sent regarding this
subject and NTUP_PER_BUCKET. Having someone confirm what I saw wrt
changing that parameter would be nice and it would be a good comparison
point against any kind of pre-filtering that we're doing.
One thing that re-reading the bloom filter description reminded me of is
that it's at least conceivable that we could take the existing hash
functions for each data type and do double-hashing or perhaps seed the
value to be hashed with additional data to produce an "independent" hash
result to use. Again, a lot of things that need to be tested and
measured to see if they improve overall performance.
Thanks,
Stephen