Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why? - Mailing list pgsql-performance

From Tom Lane
Subject Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Date
Msg-id 7570.1288151284@sss.pgh.pa.us
Whole thread Raw
In response to Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?  (Scott Carey <scott@richrelevance.com>)
List pgsql-performance
Robert Haas <robertmhaas@gmail.com> writes:
> I'm also a bit suspicious of the fact that the hash condition has a
> cast to text on both sides, which implies, to me anyway, that the
> underlying data types are not text.  That might mean that the query
> planner doesn't have very good statistics, which might mean that the
> join selectivity estimates are wackadoo, which can apparently cause
> this problem:

Um ... you're guilty of the same thing as the OP, ie not showing how
you got this example.  But I'm guessing that it was something like

create table little as select * from generate_series(1,10) a;
create table big as select * from generate_series(1,100000) a;
... wait for auto-analyze of big ...
explain select * from little, big where little.a = big.a;

Here, big is large enough to prod autovacuum into analyzing it,
whereas little isn't.  So when the planner runs, it sees

(1) big is known to have 100000 rows, and big.a is known unique;
(2) little is estimated to have many fewer rows, but nothing is
    known about the distribution of little.a.

In this situation, it's going to prefer to hash big, because hash join
behaves pretty nicely when the inner rel is uniformly distributed and
the outer not, but not nicely at all when it's the other way round.
It'll change its mind as soon as you analyze little, but it doesn't
like taking a chance on an unknown distribution.  See cost_hashjoin
and particularly estimate_hash_bucketsize.

I'm not convinced this explains Scott's results though --- the numbers
he's showing don't seem to add up even if you assume a pretty bad
distribution for the smaller rel.

            regards, tom lane

pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: odd postgresql performance (excessive lseek)
Next
From: Tom Lane
Date:
Subject: Re: odd postgresql performance (excessive lseek)