Hi,
attached is v2 of the patch, with a number of improvements:
0) This relies on the the other hashjoin patches (delayed build of
buckets and batching), as it allows sizing the bloom filter.
1) enable_hashjoin_bloom GUC
This is mostly meant for debugging and testing, not for committing.
2) Outer joins should be working fine now. That is, the results should
be correct and faster as the outer rows without matches should not
be batched at all.
3) The bloom filter is now built for all hash joins, not just when
batching is happening. I've been a bit skeptical about the
non-batched cases, but it seems that I can get a sizable speedup
(~20-30%, depending on the selectivity of the join).
4) The code is refactored quite a bit, adding BloomFilterData instead
of just sprinkling the fields on HashJoinState or HashJoinTableData.
5) To size the bloom filter, we now use HyperLogLog couter, which we
now have in core thanks to the sorting improvements done by Peter
Geoghegan. This allows making the bloom filter much smaller when
possible.
The patch also extends the HyperLogLog API a bit (which I'll submit
to the CF independently).
There's a bunch of comments in the code, mostly with ideas about more
possible improvements.
The main piece missing in the patch (IMHO) is optimizer code making
decisions whether to enable bloom filters for the hash join, based on
cardinality estimates. And also the executor code disabling the bloom
filter if they turn inefficient. I don't think that's a major issue at
this point, though, and I think it'll be easier to do based on testing
the current patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services