Thread: WIP: Hash Join-Filter Pruning using Bloom Filters

WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Jonah H. Harris"
Date:
All,

Attached is an initial patch I've been playing with which uses Bloom
filters to reduce unnecessary processing of outer tuples in hash
joins.  In short, this works by creating a Bloom filter, adding all
relevant tuples for the inner relation, and querying the filter (for
existence) when retrieving tuples from the outer relation.  This
avoids unnecessary tuple movement and bucket searches for matches we
already know can't exist.  Currently it works only for JOIN_INNER, but
could be modified to optimize anti/semi joins as well.  Similarly, I
created a GUC to enable pruning, named bloom_pruning.

Rather than performing k hash functions, this implementation simply
sets a bit based on the already-computed hash value.  I wanted to send
this around for reviews and comments before working on it further.  As
this isn't overly intrusive, if someone can commit to reviewing and
providing input, I'll commit to having this ready for 8.4.

--
Jonah H. Harris, Senior DBA
myYearbook.com

Attachment

Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Hannes Eder"
Date:
On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Similarly, I
> created a GUC to enable pruning, named bloom_pruning.

I guess calls to bloom_filter_XXX should be surrounded by "if
(bloom_pruning) ..." or a similar construct, i.e. make use of the GUC
variable bloom_pruning in the rest of the code.

Can you provide some figures on the performance impact of the bloom filter?

Best,
-Hannes


Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Jonah H. Harris"
Date:
On Sun, Nov 2, 2008 at 5:36 PM, Hannes Eder <hannes@hanneseder.net> wrote:
> On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
>> Similarly, I
>> created a GUC to enable pruning, named bloom_pruning.
>
> I guess calls to bloom_filter_XXX should be surrounded by "if
> (bloom_pruning) ..." or a similar construct, i.e. make use of the GUC
> variable bloom_pruning in the rest of the code.

It's effective as-is for a preliminary patch.  The GUC code is the
least of my worries.

> Can you provide some figures on the performance impact of the bloom filter?

It depends on the queries.  I've been trying to find a good suite of
hash join tests... but not much luck.

CREATE TABLE t1 (id INTEGER PRIMARY KEY, x INTEGER);
CREATE TABLE t2 (id INTEGER PRIMARY KEY, x INTEGER);
INSERT INTO t1 (SELECT ge, ge % 100 FROM generate_series(1, 1000000) ge);
INSERT INTO t2 (SELECT * FROM t1);
VACUUM ANALYZE;
SELECT COUNT(*) FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
SET bloom_pruning TO off;
EXPLAIN
SELECT COUNT(*) FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
SELECT COUNT(*) FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
EXPLAIN
SELECT * FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
SELECT * FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
SET bloom_pruning TO on;
\timing
SELECT COUNT(*) FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
EXPLAIN
SELECT * FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing
SELECT * FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
\timing

-- Without Pruning
Time: 1142.843 ms
Time: 1567.355 ms

-- With Pruning
Time: 891.557 ms
Time: 1269.634 ms



-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
Greg Stark
Date:
I think the k hash functions are actually normally just different  
slices of bits taken from one actual hash function anyways so it  
sounds like you've done the right thing.

This sounds most interesting for multibatch hash joins if you could  
build a bloom filter for the future batches to avoid having to spool  
tuples that will never match.

greg

On 2 Nov 2008, at 09:49 PM, "Jonah H. Harris" <jonah.harris@gmail.com>  
wrote:

> All,
>
> Attached is an initial patch I've been playing with which uses Bloom
> filters to reduce unnecessary processing of outer tuples in hash
> joins.  In short, this works by creating a Bloom filter, adding all
> relevant tuples for the inner relation, and querying the filter (for
> existence) when retrieving tuples from the outer relation.  This
> avoids unnecessary tuple movement and bucket searches for matches we
> already know can't exist.  Currently it works only for JOIN_INNER, but
> could be modified to optimize anti/semi joins as well.  Similarly, I
> created a GUC to enable pruning, named bloom_pruning.
>
> Rather than performing k hash functions, this implementation simply
> sets a bit based on the already-computed hash value.  I wanted to send
> this around for reviews and comments before working on it further.  As
> this isn't overly intrusive, if someone can commit to reviewing and
> providing input, I'll commit to having this ready for 8.4.
>
> -- 
> Jonah H. Harris, Senior DBA
> myYearbook.com
> <bloompruning_v1.patch>
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Lawrence, Ramon"
Date:
> -----Original Message-----
> On Sun, Nov 2, 2008 at 10:49 PM, Jonah H. Harris
> <jonah.harris@gmail.com> wrote:
> It's effective as-is for a preliminary patch.  The GUC code is the
> least of my worries.
>
> > Can you provide some figures on the performance impact of the bloom
> filter?

I have tested the Bloom filter patch.  It compiles cleanly against HEAD.

As indicated, the performance improvements for hash join are good,
especially when the build table is filtered with a selection condition.
Performance improvements range from a couple of percent up to 20% for
multi-batch joins.  Note that the bloom filter will slightly slow
queries where the filter has no benefit.

I have not looked at the actual implementation of the Bloom filter, but
will proceed to do that next.  One issue to be considered is how the
space used for the bloom filter is related to the work_mem allocated to
the join. That is, does the bloom filter consume some of the work_mem
space or is it treated as additional memory allocated to the join.

Experimental Results (in ms)
Query    Time With Filter    Time Without Filter    % Faster
1    2,166            2,648                18%
2    1,665            1,772                6%
3    5,308            6,374                17%
4    63,690        75,715            15%
5    87,864        81,552            -8%
6    12,492        11,696            -7%


Query 1: (provided by patch author) = 190,000 results
CREATE TABLE t1 (id INTEGER PRIMARY KEY, x INTEGER); CREATE TABLE t2 (id
INTEGER PRIMARY KEY, x INTEGER); INSERT INTO t1 (SELECT ge, ge % 100
FROM generate_series(1, 1000000) ge); INSERT INTO t2 (SELECT * FROM t1);
VACUUM ANALYZE;
SELECT COUNT(*) FROM t1, t2WHERE t1.id = t2.id      AND t1.x < 30      AND t2.x > 10;
The next five queries are on the TPCH 1GB data set.

Query 2: (in-memory join with restrictive build filter) = 56,110 results
select count(*) from customer, orders where c_custkey = o_custkey and
c_nationkey = 10;

Query 3: (multi-batch join with less restrictive build filter) = 841,767
results
select count(*) from customer, orders where c_custkey = o_custkey and
c_nationkey > 10;

Query 4: (large multi-batch join) = 3,215,402 results
select count(*) from orders, lineitem where o_orderkey = l_orderkey and
o_totalprice > 150000;

Query 5: (large multi-batch join with no filter - hard case for BLOOM
filter) = 6,000,003 results
select count(*) from orders, lineitem where o_orderkey = l_orderkey;

Query 6: (large probe, in-memory build with no filter - hard case for
BLOOM filter) = 6,000,003 results
select count(*) from supplier, lineitem where s_suppkey = l_suppkey;

All tests were run 4 times and the times were averaged.  The initial run
time was discarded to deal with buffering issues.

--
Dr. Ramon Lawrence
University of British Columbia Okanagan


Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Jonah H. Harris"
Date:
On Mon, Nov 10, 2008 at 2:42 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote:
> I have tested the Bloom filter patch.  It compiles cleanly against HEAD.

Thank you for testing this!

> As indicated, the performance improvements for hash join are good,
> especially when the build table is filtered with a selection condition.
> Performance improvements range from a couple of percent up to 20% for
> multi-batch joins.  Note that the bloom filter will slightly slow
> queries where the filter has no benefit.

I have a new patch which does not create a bloom filter unless it sees
that the hash join is going to batch.  I'll send it along later
tonight.

> I have not looked at the actual implementation of the Bloom filter, but
> will proceed to do that next.  One issue to be considered is how the
> space used for the bloom filter is related to the work_mem allocated to
> the join. That is, does the bloom filter consume some of the work_mem
> space or is it treated as additional memory allocated to the join.

Currently it's additional space not accounted for by work_mem.
Additionally, it's a good amount more space than is required.  This is
fixed in the newer patch as well.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
"Lawrence, Ramon"
Date:
> -----Original Message-----
> From: Jonah H. Harris [mailto:jonah.harris@gmail.com]
> I have a new patch which does not create a bloom filter unless it sees
> that the hash join is going to batch.  I'll send it along later
> tonight.
>
> Currently it's additional space not accounted for by work_mem.
> Additionally, it's a good amount more space than is required.  This is
> fixed in the newer patch as well.

I think that the bloom filter will also improve the performance of
in-memory joins as well.  The basic trade-off in that case is the time
to probe multiple entries in a bucket in the hash table (which currently
defaults to 10) versus the cost of building/probing the bloom filter.
The bloom filter should win in this case as long as there are tuples in
the probe relation that cannot find a match in the build relation.

My suggestion would be to keep it enabled for all joins.  If possible,
it would be valuable to try to estimate what percentage of tuples that
the bloom filter filters out.  A simple estimate would be to determine
the percentage of the build table that is involved in the join.  For
instance, the good test cases had between 40-90% of the customer
relation filtered out and a corresponding percentage of the probe
relation, lineitem, was filtered out by the bloom filter.  The bad case
used all of customer, so the bloom filter stopped no probe tuples.

It would be useful for testing to track the number and percentage of
probe tuples that the bloom filter prevents a probe for.  You may
further record which of these tuples were in the in-memory batch and
on-disk batches.  These statistics may help you get the bloom filter
optimized for all cases.

--
Ramon Lawrence




Re: WIP: Hash Join-Filter Pruning using Bloom Filters

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> I have a new patch which does not create a bloom filter unless it sees
> that the hash join is going to batch.  I'll send it along later
> tonight.

So ... where's the updated patch?
        regards, tom lane