Why is a hash join being used? - Mailing list pgsql-performance
From | Tim Jacobs |
---|---|
Subject | Why is a hash join being used? |
Date | |
Msg-id | 9D1AF2AE-9605-41B5-8BCC-177B5EF6F1A5@gmail.com Whole thread Raw |
Responses |
Re: Why is a hash join being used?
(Sergey Konoplev <sergey.konoplev@postgresql-consulting.com>)
Re: Why is a hash join being used? ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
List | pgsql-performance |
I am running the following query: SELECT res1.x, res1.y, res1.z FROM test t JOIN residue_atom_coords res1 ON t.struct_id_1 = res1.struct_id AND res1.atomno IN (1,2,3,4) AND (res1.seqpos BETWEEN t.pair_1_helix_1_begin AND t.pair_1_helix_1_end) WHERE t.compare_id BETWEEN 1 AND 10000; The 'test' table is very large (~270 million rows) as is the residue_atom_coords table (~540 million rows). The number of compare_ids I select in the 'WHERE' clause determines the join type in the following way: t.compare_id BETWEEN 1 AND 5000; Nested Loop (cost=766.52..15996963.12 rows=3316307 width=24) -> Index Scan using test_pkey on test t (cost=0.00..317.20 rows=5372 width=24) Index Cond: ((compare_id >= 1) AND (compare_id <= 5000)) -> Bitmap Heap Scan on residue_atom_coords res1 (cost=766.52..2966.84 rows=625 width=44) Recheck Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <= t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) -> Bitmap Index Scan on residue_atom_coords_pkey (cost=0.00..766.36 rows=625 width=0) Index Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <= t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) t.compare_id BETWEEN 1 AND 10000; Hash Join (cost=16024139.91..20940899.94 rows=6633849 width=24) Hash Cond: (t.struct_id_1 = res1.struct_id) Join Filter: ((res1.seqpos >= t.pair_1_helix_1_begin) AND (res1.seqpos <= t.pair_1_helix_1_end)) -> Index Scan using test_pkey on test t (cost=0.00..603.68 rows=10746 width=24) Index Cond: ((compare_id >= 1) AND (compare_id <= 10000)) -> Hash (cost=13357564.16..13357564.16 rows=125255660 width=44) -> Seq Scan on residue_atom_coords res1 (cost=0.00..13357564.16 rows=125255660 width=44) Filter: (atomno = ANY ('{1,2,3,4}'::integer[])) The nested loop join performs very quickly, whereas the hash join is incredibly slow. If I disable the hash join temporarilythen a nested loop join is used in the second case and is the query runs much more quickly. How can I change myconfiguration to favor the nested join in this case? Is this a bad idea? Alternatively, since I will be doing selectionslike this many times, what indexes can be put in place to expedite the query without mucking with the query optimizer?I've already created an index on the struct_id field of residue_atom_coords (each unique struct_id should onlyhave a small number of rows for the residue_atom_coords table). Thanks in advance, Tim
pgsql-performance by date: