Thread: Hash Join is very slooow in some cases
Hi all, We are converting Oracle system to PostgreSQL. But some queries takes very looong time. I have tried variously and made a typical example in PostgreSQL 6.5.2 . It's already known ? select count(*) from a,b where a.id1=b.id1; returns immeidaitely and EXPLAIN shows Aggregate (cost=3380.24 rows=24929 width=8) -> Hash Join (cost=3380.24 rows=24929 width=8) -> Seq Scan on a (cost=1551.41rows=27558 width=4) -> Hash (cost=604.21 rows=8885 width=4) -> Seq Scan on b (cost=604.21rows=8885 width=4) But select count(*) from a,b where a.id1=b.id1 and a.id2=b.id2; takes very looong time. EXPLAIN shows Aggregate (cost=3382.24 rows=8195 width=12) -> Hash Join (cost=3382.24 rows=8195 width=12) -> Seq Scan on a (cost=1551.41rows=27558 width=6) -> Hash (cost=604.21 rows=8885 width=6) -> Seq Scan on b (cost=604.21rows=8885 width=6) Query plans are almost same. Why is there a such difference ? I examined an output by EXPLAIN VERBOSE and found that the 1st query uses id1 as its hashkey and 2nd query uses id2 as its hashkey. So I tried the following query. select count(*) from a,b where a.id2=b.id2 and a.id1=b.id1; This returns immediately as I expected. id1-s are type int4 and their disbursions are 0.00010181/ 0.000203409. id2-s are type int2 and their disbursions are 0.523526/ 0.328712 . Is id2 suitable for a hashkey ? I can't try it in current now,sorry. But seems the following code in createplan.c is unchanged. /* Now the righthand op of the sole hashclause is the inner hash key. */ innerhashkey = get_rightop(lfirst(hashclauses)); Comments ? Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > select count(*) from a,b where a.id1=b.id1; > returns immeidaitely ... > But > select count(*) from a,b where a.id1=b.id1 and a.id2=b.id2; > takes very looong time. > I examined an output by EXPLAIN VERBOSE and found that > the 1st query uses id1 as its hashkey and 2nd query uses id2 > as its hashkey. Yes, and since id2 has terrible disbursion, most of the hashtable entries end up in a small number of hash buckets, resulting in an unexpectedly large number of comparisons done for each outer tuple. I've seen this effect before. I have a TODO item to make the optimizer pay attention to disbursion when estimating the cost of a hashjoin. That would cause it to make the right choice of key in this example. Not done yet though :-(. Feel free to jump in if you need it today... regards, tom lane