Thread: Why is a hash join being used?

Why is a hash join being used?

From
Tim Jacobs
Date:
I am running the following query:

SELECT res1.x, res1.y, res1.z
FROM test t
JOIN residue_atom_coords res1 ON
        t.struct_id_1 = res1.struct_id AND
        res1.atomno IN (1,2,3,4) AND
        (res1.seqpos BETWEEN t.pair_1_helix_1_begin AND t.pair_1_helix_1_end)
WHERE
t.compare_id BETWEEN 1 AND 10000;

The 'test' table is very large (~270 million rows) as is the residue_atom_coords table (~540 million rows).

The number of compare_ids I select in the 'WHERE' clause determines the join type in the following way:

t.compare_id BETWEEN 1 AND 5000;

 Nested Loop  (cost=766.52..15996963.12 rows=3316307 width=24)
   ->  Index Scan using test_pkey on test t  (cost=0.00..317.20 rows=5372 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 5000))
   ->  Bitmap Heap Scan on residue_atom_coords res1  (cost=766.52..2966.84 rows=625 width=44)
         Recheck Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <=
t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) 
         ->  Bitmap Index Scan on residue_atom_coords_pkey  (cost=0.00..766.36 rows=625 width=0)
               Index Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <=
t.pair_1_helix_1_end)AND (atomno = ANY ('{1,2,3,4}'::integer[]))) 

t.compare_id BETWEEN 1 AND 10000;

 Hash Join  (cost=16024139.91..20940899.94 rows=6633849 width=24)
   Hash Cond: (t.struct_id_1 = res1.struct_id)
   Join Filter: ((res1.seqpos >= t.pair_1_helix_1_begin) AND (res1.seqpos <= t.pair_1_helix_1_end))
   ->  Index Scan using test_pkey on test t  (cost=0.00..603.68 rows=10746 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 10000))
   ->  Hash  (cost=13357564.16..13357564.16 rows=125255660 width=44)
         ->  Seq Scan on residue_atom_coords res1  (cost=0.00..13357564.16 rows=125255660 width=44)
               Filter: (atomno = ANY ('{1,2,3,4}'::integer[]))

The nested loop join performs very quickly, whereas the hash join is incredibly slow. If I disable the hash join
temporarilythen a nested loop join is used in the second case and is the query runs much more quickly. How can I change
myconfiguration to favor the nested join in this case? Is this a bad idea? Alternatively, since I will be doing
selectionslike this many times, what indexes can be put in place to expedite the query without mucking with the query
optimizer?I've already created an index on the struct_id field of residue_atom_coords (each unique struct_id should
onlyhave a small number of rows for the residue_atom_coords table). 

Thanks in advance,
Tim



Re: Why is a hash join being used?

From
Sergey Konoplev
Date:
On Wed, Jun 20, 2012 at 1:34 AM, Tim Jacobs <tjacobs2@email.unc.edu> wrote:
> The nested loop join performs very quickly, whereas the hash join is incredibly slow. If I disable the hash join
temporarilythen a nested loop join is used in the second case and is the query runs much more quickly. How can I change
myconfiguration to favor the nested join in this case? Is this a bad idea? 

First do ANALYZE the tables and try the tests again.

If it helped check your autovacuum configuration. Look at
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#AUTOVACUUM
and the pg_stat_user_tables table (last_* and *_count fields).

If it still produces wrong plan then try to increase statistics
entries by ALTER TABLE SET STATISTICS (do not forget to ANALYZE after
doing it) or by the default_statistics_target configuration parameter.
Read more about it here
http://www.postgresql.org/docs/9.1/static/planner-stats.html.

> Alternatively, since I will be doing selections like this many times, what indexes can be put in place to expedite
thequery without mucking with the query optimizer? I've already created an index on the struct_id field of
residue_atom_coords(each unique struct_id should only have a small number of rows for the residue_atom_coords table). 

As I can see everything is okay with indexes.

>
> Thanks in advance,
> Tim
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
Sergey Konoplev

a database architect, software developer at PostgreSQL-Consulting.com
http://www.postgresql-consulting.com

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +79160686204

Re: Why is a hash join being used?

From
"Kevin Grittner"
Date:
Tim Jacobs <tjacobs2@email.unc.edu> wrote:

> The nested loop join performs very quickly, whereas the hash join
> is incredibly slow. If I disable the hash join temporarily then a
> nested loop join is used in the second case and is the query runs
> much more quickly. How can I change my configuration to favor the
> nested join in this case? Is this a bad idea?

Before anyone can make solid suggestions on what you might want to
change in your configuration, they would need to know more.  Please
read this page:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

... and repost with your PostgreSQL version, your configuration
overrides, a description of your hardware, and EXPLAIN ANALYZE
output from the query (rather than just EXPLAIN output).

You might not be modeling your costs correctly, you might not be
allocating resources well, you might be on an old version without an
optimizer as smart as more recent versions, your statistics might be
out of date, or you might be running into an optimizer weakness of
some sort.

-Kevin