Home > mailing lists

Re: WIP: bloom filter in Hash Joins with batches - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: WIP: bloom filter in Hash Joins with batches
Date	January 10, 2016 19:12:57
Msg-id	5692AD27.6010300@2ndquadrant.com Whole thread Raw
In response to	Re: WIP: bloom filter in Hash Joins with batches (Peter Geoghegan <pg@heroku.com>)
List	pgsql-hackers

Tree view

Hi,

On 01/10/2016 01:08 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So, this seems to bring reasonable speedup, as long as the selectivity is
>> below 50%, and the data set is sufficiently large.
>
> What about semijoins? Apparently they can use bloom filters
> particularly effectively. Have you considered them as a special case?

You mean to handle them in a special way in the code, or just to perform 
benchmark semijoins (and not just regular joins)?

> Also, have you considered Hash join conditions with multiple
> attributes as a special case? I'm thinking of cases like this:
>
> regression=# set enable_mergejoin = off;
> SET
> regression=# explain analyze select * from tenk1 o join tenk2 t on
> o.twenty = t.twenty and t.hundred = o.hundred;
>                                                         QUERY PLAN
> ──────────────────────────────────────────────────────────────────────
>   Hash Join  (cost=595.00..4103.00 rows=50000 width=488) (actual
> time=12.086..1026.194 rows=1000000 loops=1)
>     Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))
>     ->  Seq Scan on tenk1 o  (cost=0.00..458.00 rows=10000 width=244)
> (actual time=0.017..4.212 rows=10000 loops=1)
>     ->  Hash  (cost=445.00..445.00 rows=10000 width=244) (actual
> time=12.023..12.023 rows=10000 loops=1)
>           Buckets: 16384  Batches: 1  Memory Usage: 2824kB
>           ->  Seq Scan on tenk2 t  (cost=0.00..445.00 rows=10000
> width=244) (actual time=0.006..3.453 rows=10000 loops=1)
>   Planning time: 0.567 ms
>   Execution time: 1116.094 ms
> (8 rows)
>
> (Note that while the optimizer has a slight preference for a merge
> join in this case, the plan I show here is a bit faster on my
> machine).

The patch I posted actually does not build bloom filter on the values 
directly, but on the hashvalue we already use. So it handles hashjoins 
with arbitrary number of attributes just fine.

Or perhaps you're thinking about some special optimization that might 
improve such cases (I can't think of any)?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Tom Lane
Date: 10 January 2016, 18:25:48
Subject: Why doesn't src/backend/port/win32/socket.c implement bind()?

From: Steve Singer
Date: 10 January 2016, 19:57:27
Subject: Re: pglogical - logical replication contrib module

Re: WIP: bloom filter in Hash Joins with batches - Mailing list pgsql-hackers

Previous

Next