Thread: Hash Join not using hashed index?

Hash Join not using hashed index?

From
Ang Chin Han
Date:
I'm using Postgresql 7.02.

======================================================================
# explain select city.name, country.name from country, city
where city.country_id = country.country_id;
NOTICE:  QUERY PLAN:
Hash Join  (cost=8.85..16.76 rows=75 width=18) ->  Seq Scan on city  (cost=0.00..1.75 rows=75 width=16) ->  Hash
(cost=5.53..5.53rows=253 width=2)       ->  Seq Scan on country  (cost=0.00..5.53 rows=253 width=2)         
 

EXPLAIN 
# create index country_id_idx on country using hash (country_id);
CREATE
# vacuum analyze;
VACUUM
# explain select city.name, country.name from country, city
where city.country_id = country.country_id;
NOTICE:  QUERY PLAN:
Hash Join  (cost=8.85..16.76 rows=75 width=18) ->  Seq Scan on city  (cost=0.00..1.75 rows=75 width=16) ->  Hash
(cost=5.53..5.53rows=253 width=2)       ->  Seq Scan on country  (cost=0.00..5.53 rows=253 width=2)
 
EXPLAIN
======================================================================

Why doesn't PostgreSQL use country_id_idx, but rather rehashing 
country_id?


Re: Hash Join not using hashed index?

From
Tom Lane
Date:
Hash joins don't have anything to do with hash indexes.

A hash join is a join that makes use of a temporary hashtable
built on-the-fly *in memory* for that join.

The planner could choose to use an indexscan on a hash index
as an input for the join, but it'd only be likely to do so
if there is a restriction clause matching the index.  In your
example you have only a join WHERE clause.

Plain btree indexes on city.country_id and country.country_id
might work better --- at least they'd offer the option of
a merge join without doing explicit sort.
        regards, tom lane


Re: Hash Join not using hashed index?

From
Tom Lane
Date:
Ang Chin Han <angch@pollcities.com> writes:
>> The planner could choose to use an indexscan on a hash index
>> as an input for the join, but it'd only be likely to do so
>> if there is a restriction clause matching the index.  In your
>> example you have only a join WHERE clause.

> Well, in my original query, there was, but the plan's the same.
> Probably the clause wasn't restrictive enough (" and region < n").

If it was like that then a hash index wouldn't have been applicable
anyway; hashes are only good for strict equality checks.  If you want
something that can do ordering checks you need a btree index.

(There are good reasons why btree is the default index type ;-))

> Original cost est:
> Hash Join  (cost=8.85..16.76 rows=75 width=18)
>   -> Seq Scan on city  (cost=0.00..1.75 rows=75 width=16) 
>   -> Hash  (cost=5.53..5.53 rows=253 width=2)
>        -> Seq Scan on country  (cost=0.00..5.53 rows=253 width=2) 

> I guess the problem is that country-city is a one-to-many relation,
> BUT I've more countries than cities (note the # of rows above), thus
> throwing the planner off...

Off what?  This looks like a pretty reasonable plan to me, given the
fairly small table sizes.  Do you have evidence that another plan
type would be quicker for this problem?
        regards, tom lane


Re: Hash Join not using hashed index?

From
Tom Lane
Date:
Ang Chin Han <angch@pollcities.com> writes:
> No evidence, but I was hoping that having a prehashed country_id
> would speed things up a bit, since the seq scan on country could
> be redundant, requring only a seq scan on city and a index (hash)
> lookup on country.

> Or maybe this is a related question (just curious):

> pintoo=# explain select country_id from country order by country_id;
> Sort  (cost=15.63..15.63 rows=253 width=2)
>   -> Seq Scan on country  (cost=0.00..5.53 rows=253 width=2)

> If there is already in b-tree index on country_id, why bother
> re-sorting it, when it could be output'd by traversing the tree?

In fact the planner does consider both of those alternatives (at
least in 7.0), but it estimates that they are *slower* than doing
it in what appears to be the hard way.  Accessing an index is not
particularly cheap.

You can investigate what the planner thinks of these different
alternatives using the rather crude tool of disabling particular
scan types.  For example:

regression=# create table country (country_id int2);
CREATE
regression=# create index country_id_index on country (country_id);
CREATE
regression=# explain select country_id from country order by country_id;
NOTICE:  QUERY PLAN:

Index Scan using country_id_index on country  (cost=0.00..60.00 rows=1000 width=2)

(Of course, since I haven't vacuumed this newly-created table, these
cost numbers are bogus; they are based on default assumptions about
table size, which are 1000 rows in 10 disk pages IIRC.  With those
particular numbers an indexscan does look the cheapest.  Anyway,
we're coming to the point now:)

regression=# set enable_indexscan = off;
SET VARIABLE
regression=# explain select country_id from country order by country_id;
NOTICE:  QUERY PLAN:

Sort  (cost=69.83..69.83 rows=1000 width=2) ->  Seq Scan on country  (cost=0.00..20.00 rows=1000 width=2)

EXPLAIN

The planner chooses the indexscan here because 60.00 < 69.83, so it
thinks the indexscan is faster.  But different values for the table
size and number of rows can easily drive it to make the other choice.

You could experiment with changing enable_indexscan and enable_seqscan
to see what the planner's estimates are for your tables and then
measure what the actual runtime is like for each plan type.

From my point of view, it's only a bug if the ratio of the cost
estimates is radically out of line with the ratio of the actual
runtimes.
        regards, tom lane


Re: Hash Join not using hashed index?

From
Ang Chin Han
Date:
On Wed, Jun 28, 2000 at 10:56:17AM -0400, Tom Lane wrote:
> Ang Chin Han <angch@pollcities.com> writes:
> If it was like that then a hash index wouldn't have been applicable
> anyway; hashes are only good for strict equality checks.  If you want
> something that can do ordering checks you need a btree index.
> 
> (There are good reasons why btree is the default index type ;-))

There _was_ a btree index, before I added the extra hash index:

pintoo=# \dcountry_pkeyIndex "country_pkey"Attribute  |   Type
------------+----------country_id | smallint
unique btree (primary key)

> > Original cost est:
> > Hash Join  (cost=8.85..16.76 rows=75 width=18)
> >   -> Seq Scan on city  (cost=0.00..1.75 rows=75 width=16) 
> >   -> Hash  (cost=5.53..5.53 rows=253 width=2)
> >        -> Seq Scan on country  (cost=0.00..5.53 rows=253 width=2) 
> 
> > I guess the problem is that country-city is a one-to-many relation,
> > BUT I've more countries than cities (note the # of rows above), thus
> > throwing the planner off...
> 
> Off what?  This looks like a pretty reasonable plan to me, given the
> fairly small table sizes.  Do you have evidence that another plan
> type would be quicker for this problem?

No evidence, but I was hoping that having a prehashed country_id
would speed things up a bit, since the seq scan on country could
be redundant, requring only a seq scan on city and a index (hash)
lookup on country.


Or maybe this is a related question (just curious):

pintoo=# explain select country_id from country order by country_id;
NOTICE:  QUERY PLAN:
Sort  (cost=15.63..15.63 rows=253 width=2) ->  Seq Scan on country  (cost=0.00..5.53 rows=253 width=2)

pintoo=# explain select name from country order by name;
NOTICE:  QUERY PLAN:
Sort  (cost=15.63..15.63 rows=253 width=12) ->  Seq Scan on country  (cost=0.00..5.53 rows=253 width=12)

If there is already in b-tree index on country_id, why bother
re-sorting it, when it could be output'd by traversing the tree?
Comparing with an unindexed column, we can see that the index
is not used at all.


Re: Hash Join not using hashed index?

From
Ang Chin Han
Date:
On Wed, Jun 28, 2000 at 03:00:04AM -0400, Tom Lane wrote:
> Hash joins don't have anything to do with hash indexes.

> A hash join is a join that makes use of a temporary hashtable
> built on-the-fly *in memory* for that join.

Oh, I see.

> The planner could choose to use an indexscan on a hash index
> as an input for the join, but it'd only be likely to do so
> if there is a restriction clause matching the index.  In your
> example you have only a join WHERE clause.

Well, in my original query, there was, but the plan's the same.
Probably the clause wasn't restrictive enough (" and region < n").

Original cost est:
Hash Join  (cost=8.85..16.76 rows=75 width=18) ->  Seq Scan on city  (cost=0.00..1.75 rows=75 width=16)  ->  Hash
(cost=5.53..5.53rows=253 width=2)       ->  Seq Scan on country  (cost=0.00..5.53 rows=253 width=2) 
 

I guess the problem is that country-city is a one-to-many relation,
BUT I've more countries than cities (note the # of rows above), thus
throwing the planner off...

OTOH, what's bugging me is that Postgresql could have used
pre-generated hash index rather rebuilding it on the fly again.

> Plain btree indexes on city.country_id and country.country_id
> might work better --- at least they'd offer the option of
> a merge join without doing explicit sort.

I tried, and it did worse.


Hmmm... I think I'm better off creating a temporary table
to store the results, since the table is seldom updated
but that query is run often. Rules to update that temp. table, too, 
of course.

(cost is now 1.75, if anyone cares)