Planner chose a much slower plan in hashjoin, using a large table asthe inner table. - Mailing list pgsql-hackers

From Jinbao Chen
Subject Planner chose a much slower plan in hashjoin, using a large table asthe inner table.
Date
Msg-id CACLUs_hGEDPyciO6Y-js2tYVo7_qNcWajvLKBKEh=6VT+xvhsw@mail.gmail.com
Whole thread Raw
Responses Re: Planner chose a much slower plan in hashjoin, using a large tableas the inner table.  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi Hackers, 

The planner will use big table as inner table in hash join if small table have fewer unique values.
But this plan is much slower than using small table as inner table. This problem occurs on master
branch without parallel scan.

For example

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000);
insert into t_big select i%100000 from generate_series(1, 100000000)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

and the plan made by planner is
demo2=# explain select * from t_small, t_big where a = b;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Hash Join  (cost=3083104.72..3508073.65 rows=3045990 width=8)
   Hash Cond: (t_small.a = t_big.b)
   ->  Seq Scan on t_small  (cost=0.00..44.01 rows=3001 width=4)
   ->  Hash  (cost=1442478.32..1442478.32 rows=100000032 width=4)
         ->  Seq Scan on t_big  (cost=0.00..1442478.32 rows=100000032 width=4)

and it runs nearly 58s
demo2=# select * from t_small, t_big where a = b;
Time: 58544.525 ms (00:58.545)

But if we do some hack and use the small table as inner. It runs 19s.
demo2=# explain select * from t_small, t_big where a = b;
                               QUERY PLAN
-------------------------------------------------------------------------
 Hash Join  (cost=81.52..1723019.82 rows=3045990 width=8)
   Hash Cond: (t_big.b = t_small.a)
   ->  Seq Scan on t_big  (cost=0.00..1442478.32 rows=100000032 width=4)
   ->  Hash  (cost=44.01..44.01 rows=3001 width=4)
         ->  Seq Scan on t_small  (cost=0.00..44.01 rows=3001 width=4)

demo2=# select * from t_small, t_big where a = b;
Time: 18751.588 ms (00:18.752)


RCA:

The cost of the inner table mainly comes from creating a hash table.
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
* inner_path_rows;

The cost of the outer table mainly comes from search the hash table.
Calculate the hash value:
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

Traverse the linked list in the bucket and compare:
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

In general, the cost of creating a hash table is higher than the cost of querying a hash table.
So we tend to use small tables as internal tables. But if the average chain length of the bucket
is large, the situation is just the opposite.

In the test case above, the small table has 3000 tuples and 100 distinct values on column ‘a’.
If we use small table as inner table.  The chan length of the bucket is 30. And we need to
search the whole chain on probing the hash table. So the cost of probing is bigger than build
hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in hashtable. But only 100 buckets
has chains with length 30. Other buckets are empty. Only hash values need to be compared.
Its costs are very small. We have 100,000 distinct key and 100,000,000 tuple on outer table.
Only (100/100000)* tuple_num tuples will search the whole chain. The other tuples
(number = (98900/100000)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much smaller than the planner
calculated. This is the reason why using a small table as inner is faster. 

pgsql-hackers by date:

Previous
From: Surafel Temesgen
Date:
Subject: Re: Conflict handling for COPY FROM
Next
From: Amit Kapila
Date:
Subject: Re: SegFault on 9.6.14