Home > mailing lists

Re: New style of hash join proposal - Mailing list pgsql-hackers

From	Gregory Stark
Subject	Re: New style of hash join proposal
Date	March 17, 2008 13:12:17
Msg-id	877ig1cpbd.fsf@oxford.xeocode.com Whole thread Raw
In response to	Re: New style of hash join proposal (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: New style of hash join proposal Re: New style of hash join proposal
List	pgsql-hackers

Tree view

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> Please give an example of what you're talking about that you think we
>>> can't do now.
>
>> Note that we're doing a full sequential scan of "a" even though we've already
>> finished hashing "b" and know full well which keys we'll need. If we have an
>> index on "a" and "b" is sufficiently smaller than "a", as in this case, then
>> we could do a bitmap index scan on "a" and pull out just those keys.
>
> You mean like this?
>
> regression=# explain select * from tenk1 a where unique1 in (select f1 from int4_tbl b);
>                                      QUERY PLAN                                      
> -------------------------------------------------------------------------------------
>  Nested Loop  (cost=1.06..42.52 rows=5 width=244)
>    ->  HashAggregate  (cost=1.06..1.11 rows=5 width=4)
>          ->  Seq Scan on int4_tbl b  (cost=0.00..1.05 rows=5 width=4)
>    ->  Index Scan using tenk1_unique1 on tenk1 a  (cost=0.00..8.27 rows=1 width=244)
>          Index Cond: (a.unique1 = b.f1)
> (5 rows)
>
> In the example you give, this type of plan was rejected because there
> were too many rows in the subplan (or so I suppose anyway; you might
> play around with the cost constants and see what happens).

Sort of, except using a bitmap index scan which would be precisely what allows
it to work for much larger sets of records.

It doesn't have to be an IN join either. Hash joins are used for normal joins
for even quite large sets of records. If the join is very selective then a
nested loop is fine, and if it covers most of the table then a sequential scan
might be fine. But in many cases you're dealing with a smallish percentage of
a very large table. That's what bitmap index scans are tailor made for.

Cases like:

select * from invoice join invoice_detail on (invoice_id) where invoice.quarter='Q4'

Is going to pull out thousands of invoices from the invoice table, then want
to pull out all the matching invoice_detail records from the invoice_detail
table. It'll probably be 5-10% of the invoice_detail table making a sequential
scan a big waste but enough records that a nested loop with a plain index scan
will be very slow. Worse, if it does the hash join there may be enough
invoices to force the hash join to work in batches.

It would be ideal if it could scan the invoices using an index, toss them all
in a hash, then do a bitmap index scan to pull out all the matching detail
records. If there are multiple batches it can start a whole new index scan for
the each of the batches.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

pgsql-hackers by date:

From: Mike Aubury
Date: 17 March 2008, 13:08:58
Subject: Re: postgresql.org dns problems

From: "Dave Page"
Date: 17 March 2008, 13:13:31
Subject: Re: postgresql.org dns problems

Re: New style of hash join proposal - Mailing list pgsql-hackers

Previous

Next