Home > mailing lists

Why is a hash join being used? - Mailing list pgsql-performance

From	Tim Jacobs
Subject	Why is a hash join being used?
Date	June 20, 2012 04:54:22
Msg-id	9D1AF2AE-9605-41B5-8BCC-177B5EF6F1A5@gmail.com Whole thread Raw
Responses	Re: Why is a hash join being used? Re: Why is a hash join being used?
List	pgsql-performance

Tree view

I am running the following query:

SELECT res1.x, res1.y, res1.z
FROM test t
JOIN residue_atom_coords res1 ON
        t.struct_id_1 = res1.struct_id AND
        res1.atomno IN (1,2,3,4) AND
        (res1.seqpos BETWEEN t.pair_1_helix_1_begin AND t.pair_1_helix_1_end)
WHERE
t.compare_id BETWEEN 1 AND 10000;

The 'test' table is very large (~270 million rows) as is the residue_atom_coords table (~540 million rows).

The number of compare_ids I select in the 'WHERE' clause determines the join type in the following way:

t.compare_id BETWEEN 1 AND 5000;

 Nested Loop  (cost=766.52..15996963.12 rows=3316307 width=24)
   ->  Index Scan using test_pkey on test t  (cost=0.00..317.20 rows=5372 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 5000))
   ->  Bitmap Heap Scan on residue_atom_coords res1  (cost=766.52..2966.84 rows=625 width=44)
         Recheck Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <=
t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) 
         ->  Bitmap Index Scan on residue_atom_coords_pkey  (cost=0.00..766.36 rows=625 width=0)
               Index Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <=
t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) 

t.compare_id BETWEEN 1 AND 10000;

 Hash Join  (cost=16024139.91..20940899.94 rows=6633849 width=24)
   Hash Cond: (t.struct_id_1 = res1.struct_id)
   Join Filter: ((res1.seqpos >= t.pair_1_helix_1_begin) AND (res1.seqpos <= t.pair_1_helix_1_end))
   ->  Index Scan using test_pkey on test t  (cost=0.00..603.68 rows=10746 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 10000))
   ->  Hash  (cost=13357564.16..13357564.16 rows=125255660 width=44)
         ->  Seq Scan on residue_atom_coords res1  (cost=0.00..13357564.16 rows=125255660 width=44)
               Filter: (atomno = ANY ('{1,2,3,4}'::integer[]))

The nested loop join performs very quickly, whereas the hash join is incredibly slow. If I disable the hash join
temporarilythen a nested loop join is used in the second case and is the query runs much more quickly. How can I change
myconfiguration to favor the nested join in this case? Is this a bad idea? Alternatively, since I will be doing
selectionslike this many times, what indexes can be put in place to expedite the query without mucking with the query
optimizer?I've already created an index on the struct_id field of residue_atom_coords (each unique struct_id should
onlyhave a small number of rows for the residue_atom_coords table). 

Thanks in advance,
Tim

pgsql-performance by date:

From: Eyal Wilde
Date: 20 June 2012, 04:54:10
Subject: index-only scan is missing the INCLUDE feature

From: "Strange, John W"
Date: 20 June 2012, 04:54:36
Subject: pgbouncer - massive overhead?

Why is a hash join being used? - Mailing list pgsql-performance

Previous

Next