Re: New style of hash join proposal - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: New style of hash join proposal
Date
Msg-id 877ig1cpbd.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: New style of hash join proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: New style of hash join proposal
Re: New style of hash join proposal
List pgsql-hackers
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> Please give an example of what you're talking about that you think we
>>> can't do now.
>
>> Note that we're doing a full sequential scan of "a" even though we've already
>> finished hashing "b" and know full well which keys we'll need. If we have an
>> index on "a" and "b" is sufficiently smaller than "a", as in this case, then
>> we could do a bitmap index scan on "a" and pull out just those keys.
>
> You mean like this?
>
> regression=# explain select * from tenk1 a where unique1 in (select f1 from int4_tbl b);
>                                      QUERY PLAN                                      
> -------------------------------------------------------------------------------------
>  Nested Loop  (cost=1.06..42.52 rows=5 width=244)
>    ->  HashAggregate  (cost=1.06..1.11 rows=5 width=4)
>          ->  Seq Scan on int4_tbl b  (cost=0.00..1.05 rows=5 width=4)
>    ->  Index Scan using tenk1_unique1 on tenk1 a  (cost=0.00..8.27 rows=1 width=244)
>          Index Cond: (a.unique1 = b.f1)
> (5 rows)
>
> In the example you give, this type of plan was rejected because there
> were too many rows in the subplan (or so I suppose anyway; you might
> play around with the cost constants and see what happens).

Sort of, except using a bitmap index scan which would be precisely what allows
it to work for much larger sets of records.

It doesn't have to be an IN join either. Hash joins are used for normal joins
for even quite large sets of records. If the join is very selective then a
nested loop is fine, and if it covers most of the table then a sequential scan
might be fine. But in many cases you're dealing with a smallish percentage of
a very large table. That's what bitmap index scans are tailor made for.

Cases like:

select * from invoice join invoice_detail on (invoice_id) where invoice.quarter='Q4'

Is going to pull out thousands of invoices from the invoice table, then want
to pull out all the matching invoice_detail records from the invoice_detail
table. It'll probably be 5-10% of the invoice_detail table making a sequential
scan a big waste but enough records that a nested loop with a plain index scan
will be very slow. Worse, if it does the hash join there may be enough
invoices to force the hash join to work in batches.

It would be ideal if it could scan the invoices using an index, toss them all
in a hash, then do a bitmap index scan to pull out all the matching detail
records. If there are multiple batches it can start a whole new index scan for
the each of the batches.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


pgsql-hackers by date:

Previous
From: Mike Aubury
Date:
Subject: Re: postgresql.org dns problems
Next
From: "Dave Page"
Date:
Subject: Re: postgresql.org dns problems