Thread: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

8.4.5

I consistently see HashJoin plans that hash the large table, and scan the small table.  This is especially puzzling in
somecases where I have 30M rows in the big table and ~ 100 in the small... shouldn't it hash the small table and scan
thebig one? 

Here is one case I saw just recently

               Hash Cond: ((a.e_id)::text = (ta.name)::text)
               ->  Index Scan using c_a_s_e_id on a  (cost=0.00..8.21 rows=14 width=27)
                     Index Cond: (id = 12)
               ->  Hash  (cost=89126.79..89126.79 rows=4825695 width=74)
                     ->  Seq Scan on p_a_1287446030 tmp  (cost=0.00..89126.79 rows=4825695 width=74)
                           Filter: (id = 12)

Does this ever make sense?  Isn't it always better to hash the smaller side of the join, or at least predominantly so?
Maybeif  you want the order of elements returning from the join to coincide with the order of the outer part of the
joinfor a join higher up the plan tree.  in this specific case, I want the order to be based on the larger table for
thejoin higher up (not shown) in the plan so that its index scan is in the order that tmp already is. 

Certainly, for very small hash tables (< 1000 entries) the cache effects strongly favor small tables -- the lookup
shouldbe very cheap.  Building a very large hash is not cheap, and wastes lots of memory.  I suppose at very large
sizessomething else might come into play that favors hashing the bigger table, but I can't think of what that would be
forthe general case. 

Any ideas?  I've seen this with dozens of queries, some simple, some with 5 or 6 tables and joins.  I even tried making
work_memvery small in a 30M row to 500 row join, and it STILL hashed the big table.  At first I thought that I was
readingthe plan wrong, but google suggests its doing what it looks like its doing.  Perhaps this is a bug? 

Scott Carey <scott@richrelevance.com> writes:
> I consistently see HashJoin plans that hash the large table, and scan
> the small table.

Could we see a self-contained test case?  And what cost parameters are
you using, especially work_mem?

> This is especially puzzling in some cases where I have 30M rows in the big table and ~ 100 in the small... shouldn't
ithash the small table and scan the big one? 

Well, size of the table isn't the only factor; in particular, a highly
nonuniform distribution of the key value will inflate the cost estimate
for using a table on the inner size of the hash.  But the example you
show here seems a bit extreme.

            regards, tom lane

On Oct 18, 2010, at 8:43 PM, Tom Lane wrote:

> Scott Carey <scott@richrelevance.com> writes:
>> I consistently see HashJoin plans that hash the large table, and scan
>> the small table.
>
> Could we see a self-contained test case?  And what cost parameters are
> you using, especially work_mem?

I'll see if I can make a test case.

Tough to do since I catch these on a production machine when a query is taking a long time, and some of the tables are
transient.
work_mem is 800MB, I tried 128K and it didn't matter.  It will switch to merge join at some point, but not a smaller
hash,and the merge join is definitely slower. 

>
>> This is especially puzzling in some cases where I have 30M rows in the big table and ~ 100 in the small... shouldn't
ithash the small table and scan the big one? 
>
> Well, size of the table isn't the only factor; in particular, a highly
> nonuniform distribution of the key value will inflate the cost estimate
> for using a table on the inner size of the hash.  But the example you
> show here seems a bit extreme.
>

In the case today, the index scan returns unique values, and the large table has only a little skew on the join key.

Another case I ran into a few weeks ago is odd.
(8.4.3 this time)

rr=> explain INSERT INTO pav2 (p_id, a_id, values, last_updated, s_id) SELECT av.p_id, av.a_id, av.values,
av.last_updated,a.s_id  
FROM pav av, attr a where av.a_id = a.id;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Hash Join  (cost=2946093.92..631471410.73 rows=1342587125 width=69)
   Hash Cond: (a.id = av.a_id)
   ->  Seq Scan on attr a  (cost=0.00..275.21 rows=20241 width=8)
   ->  Hash  (cost=1200493.44..1200493.44 rows=70707864 width=65)
         ->  Seq Scan on pav av  (cost=0.00..1200493.44 rows=70707864 width=65)

If the cost to hash is 1200493, and it needs to probe the hash 20241 times, why would the total cost be 631471410?  The
costto probe can't be that big! A cost of 500 to probe and join?   
Why favor hashing the large table and probing with the small values rather than the other way around?

In this case, I turned enable_mergejoin off in a test because it was deciding to sort the 70M rows instead of hash 20k
rowsand scan 70M, and then got this 'backwards' hash.  The merge join is much slower, but the cost estimate is much
lessand no combination of cost parameters will make it switch (both estimates are affected up and down similarly by the
costparameters). 

Both tables analyzed, etc.  One of them is a bulk operation staging table with no indexes (the big one), but it is
analyzed. The (av.p_id, av.a_id) pair is unique in it. a.id is unique (primary key). The above thinks it is going to
match20 times on average (but it actually matches only 1 -- PK join).  av.a_id is somewhat skewed , but that is
irrelevantif it always matches one.  Even if it did match 20 on average, is it worse to probe a hash table 70M times
andretrieve 20 maches each time than probe 20k times and retrive 70000 matches each time?  Its the same number of hash
functioncalls and comparisons, but different memory profile. 




>             regards, tom lane


Scott Carey wrote:
>
> If the cost to hash is 1200493, and it needs to probe the hash 20241 times, why would the total cost be 631471410?
Thecost to probe can't be that big! A cost of 500 to probe and join?   
> Why favor hashing the large table and probing with the small values rather than the other way around?
>
>

May I ask a stupid question: how is the query cost calculated? What are
the units? I/O requests? CPU cycles? Monopoly money?


--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:

> how is the query cost calculated? What are
> the units? I/O requests? CPU cycles? Monopoly money?

http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

-Kevin

On Mon, Oct 18, 2010 at 9:40 PM, Scott Carey <scott@richrelevance.com> wrote:
> 8.4.5
>
> I consistently see HashJoin plans that hash the large table, and scan the small table.  This is especially puzzling
insome cases where I have 30M rows in the big table and ~ 100 in the small... shouldn't it hash the small table and
scanthe big one? 
>
> Here is one case I saw just recently
>
>               Hash Cond: ((a.e_id)::text = (ta.name)::text)
>               ->  Index Scan using c_a_s_e_id on a  (cost=0.00..8.21 rows=14 width=27)
>                     Index Cond: (id = 12)
>               ->  Hash  (cost=89126.79..89126.79 rows=4825695 width=74)
>                     ->  Seq Scan on p_a_1287446030 tmp  (cost=0.00..89126.79 rows=4825695 width=74)
>                           Filter: (id = 12)

Can we have the complex EXPLAIN output here, please?  And the query?
For example, this would be perfectly sensible if the previous line
started with "Hash Semi Join" or "Hash Anti Join".

rhaas=# explain select * from little where exists (select * from big
where big.a = little.a);
                              QUERY PLAN
-----------------------------------------------------------------------
 Hash Semi Join  (cost=3084.00..3478.30 rows=10 width=4)
   Hash Cond: (little.a = big.a)
   ->  Seq Scan on little  (cost=0.00..1.10 rows=10 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=4)
         ->  Seq Scan on big  (cost=0.00..1443.00 rows=100000 width=4)
(5 rows)

I'm also a bit suspicious of the fact that the hash condition has a
cast to text on both sides, which implies, to me anyway, that the
underlying data types are not text.  That might mean that the query
planner doesn't have very good statistics, which might mean that the
join selectivity estimates are wackadoo, which can apparently cause
this problem:

rhaas=# explain select * from little, big where little.a = big.a;
                        QUERY PLAN
-----------------------------------------------------------------------
 Hash Join  (cost=3084.00..3577.00 rows=2400 width=8)
   Hash Cond: (little.a = big.a)
   ->  Seq Scan on little  (cost=0.00..34.00 rows=2400 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=4)
         ->  Seq Scan on big  (cost=0.00..1443.00 rows=100000 width=4)
(5 rows)

rhaas=# analyze;
ANALYZE
rhaas=# explain select * from little, big where little.a = big.a;
                            QUERY PLAN
-------------------------------------------------------------------
 Hash Join  (cost=1.23..1819.32 rows=10 width=8)
   Hash Cond: (big.a = little.a)
   ->  Seq Scan on big  (cost=0.00..1443.00 rows=100000 width=4)
   ->  Hash  (cost=1.10..1.10 rows=10 width=4)
         ->  Seq Scan on little  (cost=0.00..1.10 rows=10 width=4)
(5 rows)

This doesn't appear to make a lot of sense, but...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas <robertmhaas@gmail.com> writes:
> I'm also a bit suspicious of the fact that the hash condition has a
> cast to text on both sides, which implies, to me anyway, that the
> underlying data types are not text.  That might mean that the query
> planner doesn't have very good statistics, which might mean that the
> join selectivity estimates are wackadoo, which can apparently cause
> this problem:

Um ... you're guilty of the same thing as the OP, ie not showing how
you got this example.  But I'm guessing that it was something like

create table little as select * from generate_series(1,10) a;
create table big as select * from generate_series(1,100000) a;
... wait for auto-analyze of big ...
explain select * from little, big where little.a = big.a;

Here, big is large enough to prod autovacuum into analyzing it,
whereas little isn't.  So when the planner runs, it sees

(1) big is known to have 100000 rows, and big.a is known unique;
(2) little is estimated to have many fewer rows, but nothing is
    known about the distribution of little.a.

In this situation, it's going to prefer to hash big, because hash join
behaves pretty nicely when the inner rel is uniformly distributed and
the outer not, but not nicely at all when it's the other way round.
It'll change its mind as soon as you analyze little, but it doesn't
like taking a chance on an unknown distribution.  See cost_hashjoin
and particularly estimate_hash_bucketsize.

I'm not convinced this explains Scott's results though --- the numbers
he's showing don't seem to add up even if you assume a pretty bad
distribution for the smaller rel.

            regards, tom lane

On Oct 26, 2010, at 8:48 PM, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm also a bit suspicious of the fact that the hash condition has a
>> cast to text on both sides, which implies, to me anyway, that the
>> underlying data types are not text.  That might mean that the query
>> planner doesn't have very good statistics, which might mean that the
>> join selectivity estimates are wackadoo, which can apparently cause
>> this problem:
>
> Um ... you're guilty of the same thing as the OP, ie not showing how
> you got this example.  But I'm guessing that it was something like
>
> create table little as select * from generate_series(1,10) a;
> create table big as select * from generate_series(1,100000) a;
> ... wait for auto-analyze of big ...
> explain select * from little, big where little.a = big.a;
>
> Here, big is large enough to prod autovacuum into analyzing it,
> whereas little isn't.  So when the planner runs, it sees
>
> (1) big is known to have 100000 rows, and big.a is known unique;
> (2) little is estimated to have many fewer rows, but nothing is
>    known about the distribution of little.a.
>
> In this situation, it's going to prefer to hash big, because hash join
> behaves pretty nicely when the inner rel is uniformly distributed and
> the outer not, but not nicely at all when it's the other way round.
> It'll change its mind as soon as you analyze little, but it doesn't
> like taking a chance on an unknown distribution.  See cost_hashjoin
> and particularly estimate_hash_bucketsize.

The type of hash table will make a difference in how it behaves with skew.  Open addressing versus linking, etc.

>
> I'm not convinced this explains Scott's results though --- the numbers
> he's showing don't seem to add up even if you assume a pretty bad
> distribution for the smaller rel.

Answering both messages partially:

The join is on varchar columns.  So no they are cast ::text because its from two slightly different varchar
declarationsto ::text. 

The small relation in this case is unique on the key.  But I think postgres doesn't know that because there is a unique
indexon: 

(id, name) and the query filters for id = 12 in the example, leaving name unique.  But postgres does not identify this
asa unique condition for the key.  However, there are only about 150 distinct values of id, so even if 'name' collides
sometimesacross ids, there can be no more than 150 values that map to one key. 

I gave a partial plan, the parent joins are sometimes anti-joins.  The general query form is two from the temp table:
An update to a main table where the unique index keys match the temp (inner join)
An insert into the main table where the unique index keys do not exist (NOT EXISTS query, results in anti-join).  The
largerelation is the one being inserted/updated into the main table.   
Both have the same subplan that takes all the time in the middle -- a hash that hashes the large table and probes from
thesmall side.  I am still completely confused on how the hashjoin is calculating costs.  In one of my examples it
seemsto be way off.   


Why does hashjoin behave poorly when the inner relation is not uniformly distributed and the outer is?  In particular
foranti-join this should be the best case scenario. 

My assumption is this:  Imagine the worst possible skew on the small table.  Every value is in the same key and there
are20,000 entries in one list under one key.  The large table is in the outer relation, and probes for every element
andhas uniform distribution, 20 million -- one thousand rows for each of 20,000 keys.   
Inner Join:
  The values that match the key join agains the list.  There is no faster way to join once the list is identified.  It
isessentially nested loops per matching key.  1000 matches each output 20,000 rows.  If the hashjoin was reversed, then
theinner relation would be the large table and the outer relation would be the small table.  There would be 20,000
matchesthat each output 1000 rows. 
Semi-Join:
  The same thing happens, but additionally rows that match nothing are emitted.  If the relation that is always kept is
thelarge one, it makes more sense to hash the small. 
Anti-Join:
  Two relations, I'll call one "select" the other is "not exists".  The "select" relation is the one we are keeping,
butonly if its key does not match any key in the "not exists" relation. 
Case 1:  The large uniform table is "select" and the small one "not exists"
  It is always optimal to have the small one as the inner hashed relation, no matter what the skew.  If the inner
relationis the "not exists" table, only the key needs to be kept in the inner relation hash, the values or number of
matchesdo not matter, so skew is irrelevant and actually makes it faster than even distribution of keys. 
Case 2:  The small relation is the "select"
  Using the large "not exists" relation as the inner relation works well, as only the existence of a key needs to be
kept. Hashing the small "select" relation works if it is small enough.  Whenever a key matches from the outer relation,
removeit from the hash, then at the end take what remains in the hash as the result.  This is also generally immune to
skew.  

Am I missing something?  Is the Hash that is built storing each tuple in an open-addressed entry, and thus sensitive to
key-valueskew? or something like that?  If the hash only has one entry per key, with a linked list of values that match
thekey, I don't see how skew is a factor for hashjoin.  I am probably missing something. 


>
>             regards, tom lane


Scott Carey <scott@richrelevance.com> writes:
> Why does hashjoin behave poorly when the inner relation is not
> uniformly distributed and the outer is?

Because a poorly distributed inner relation leads to long hash chains.
In the very worst case, all the keys are on the same hash chain and it
degenerates to a nested-loop join.  (There is an assumption in the
costing code that the longer hash chains also tend to get searched more
often, which maybe doesn't apply if the outer rel is flat, but it's not
obvious that it's safe to not assume that.)

> In particular for anti-join this should be the best case scenario.

Not really.  It's still searching a long hash chain; maybe it will find
an exact match early in the chain, or maybe not.  It's certainly not
*better* than antijoin with a well-distributed inner rel.  Although the
point is moot, anyway, since if it's an antijoin there is only one
candidate for which rel to put on the outside.

            regards, tom lane