Re: Why hash join instead of nested loop? - Mailing list pgsql-performance

Rhett Garber <rhettg@gmail.com> writes:
> This is postgres 7.4.1
> All the rows involved are integers.

Hmph.  There is something really strange going on here.  I tried to
duplicate your problem in 7.4.*, thus:

regression=# create table rtmessagestate(id int, f1 char(6));
CREATE TABLE
regression=# insert into rtmessagestate values(1,'z');
INSERT 559399 1
regression=# insert into rtmessagestate values(2,'z');
INSERT 559400 1
regression=# insert into rtmessagestate values(3,'z');
INSERT 559401 1
regression=# insert into rtmessagestate values(4,'z');
INSERT 559402 1
regression=# insert into rtmessagestate values(5,'z');
INSERT 559403 1
regression=# vacuum analyze rtmessagestate;
VACUUM
regression=# create table connection(connection_registry_id int, obj1 int, obj2 int);
CREATE TABLE
regression=# create index connection_regid_obj1_index on connection(connection_registry_id,obj1);
CREATE INDEX
regression=# insert into  connection values(40105,73582,3);
INSERT 559407 1
regression=# explain analyze select rtmessagestate.* from rtmessagestate,connection where (connection_registry_id =
40105)AND (obj1  = 73582) and id = obj2; 
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=4.83..5.91 rows=1 width=14) (actual time=0.498..0.544 rows=1 loops=1)
   Hash Cond: ("outer".id = "inner".obj2)
   ->  Seq Scan on rtmessagestate  (cost=0.00..1.05 rows=5 width=14) (actual time=0.030..0.072 rows=5 loops=1)
   ->  Hash  (cost=4.83..4.83 rows=1 width=4) (actual time=0.305..0.305 rows=0 loops=1)
         ->  Index Scan using connection_regid_obj1_index on connection  (cost=0.00..4.83 rows=1 width=4) (actual
time=0.236..0.264rows=1 loops=1) 
               Index Cond: ((connection_registry_id = 40105) AND (obj1 = 73582))
 Total runtime: 1.119 ms
(7 rows)

This duplicates your example as to plan and row counts:

> Hash Join  (cost=5.96..7.04 rows=1 width=14) (actual
> time=10.591..10.609 rows=1 loops=1)
> Hash Cond: ("outer".id = "inner".obj2)
> ->  Seq Scan on rtmessagestate  (cost=0.00..1.05 rows=5 width=14)
> (actual time=0.011..0.022 rows=5 loops=1)
> ->  Hash  (cost=5.96..5.96 rows=1 width=4) (actual
> time=0.109..0.109 rows=0 loops=1)
> ->  Index Scan using connection_regid_obj1_index on
> connection  (cost=0.00..5.96 rows=1 width=4) (actual time=0.070..0.076
> rows=1 loops=1)
> Index Cond: ((connection_registry_id = 40105) AND (obj1
> = 73582)) Total runtime: 11.536 ms
> (7 rows)

My machine is considerably slower than yours, to judge by the actual
elapsed times in the scan nodes ... so why is it beating the pants
off yours in the join step?

Can you try the above script verbatim in a scratch database and see
what you get?  (Note it's worth trying the explain two or three
times to be sure the values have settled out.)

I'm testing a fairly recent 7.4-branch build (7.4.8 plus), so that's one
possible reason for the discrepancy between my results and yours, but I
do not see anything in the 7.4 CVS logs that looks like it's related to
hashjoin performance.

I'd be interested to see results from other people using 7.4.* too.

            regards, tom lane

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Finding bottleneck
Next
From: "Steinar H. Gunderson"
Date:
Subject: Re: Why hash join instead of nested loop?