Thread: Hash Join is very slooow in some cases

Hash Join is very slooow in some cases

From
"Hiroshi Inoue"
Date:
Hi all,

We are converting Oracle system to PostgreSQL.
But some queries takes very looong time.

I have tried variously and made a typical example
in PostgreSQL 6.5.2 .
It's already known ?

select count(*) from a,b where a.id1=b.id1;

returns immeidaitely and EXPLAIN shows

Aggregate  (cost=3380.24 rows=24929 width=8) ->  Hash Join  (cost=3380.24 rows=24929 width=8)       ->  Seq Scan on a
(cost=1551.41rows=27558 width=4)       ->  Hash  (cost=604.21 rows=8885 width=4)             ->  Seq Scan on b
(cost=604.21rows=8885 width=4)
 

But

select count(*) from a,b where a.id1=b.id1 and a.id2=b.id2;

takes very looong time.
EXPLAIN shows

Aggregate  (cost=3382.24 rows=8195 width=12) ->  Hash Join  (cost=3382.24 rows=8195 width=12)       ->  Seq Scan on a
(cost=1551.41rows=27558 width=6)       ->  Hash  (cost=604.21 rows=8885 width=6)             ->  Seq Scan on b
(cost=604.21rows=8885 width=6)
 

Query plans are almost same.
Why is there a such difference ?

I examined an output by EXPLAIN VERBOSE and found that
the 1st query uses id1 as its hashkey and 2nd query uses id2
as its hashkey.

So I tried the following query.

select count(*) from a,b where a.id2=b.id2 and a.id1=b.id1;

This returns immediately as I expected.

id1-s are type int4 and their disbursions are 0.00010181/ 
0.000203409.
id2-s are type int2 and their disbursions are 0.523526/
0.328712 .
Is id2 suitable for a hashkey ?

I can't try it in current now,sorry.
But seems the following code in createplan.c is unchanged.
/* Now the righthand op of the sole hashclause is the inner hash key. */       innerhashkey =
get_rightop(lfirst(hashclauses)); 
 

Comments ?

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] Hash Join is very slooow in some cases

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> select count(*) from a,b where a.id1=b.id1;
> returns immeidaitely ...
> But
> select count(*) from a,b where a.id1=b.id1 and a.id2=b.id2;
> takes very looong time.
> I examined an output by EXPLAIN VERBOSE and found that
> the 1st query uses id1 as its hashkey and 2nd query uses id2
> as its hashkey.

Yes, and since id2 has terrible disbursion, most of the hashtable
entries end up in a small number of hash buckets, resulting in
an unexpectedly large number of comparisons done for each outer
tuple.  I've seen this effect before.

I have a TODO item to make the optimizer pay attention to disbursion
when estimating the cost of a hashjoin.  That would cause it to make
the right choice of key in this example.  Not done yet though :-(.
Feel free to jump in if you need it today...
        regards, tom lane