Home > mailing lists

Number of buckets in a hash join - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Number of buckets in a hash join
Date	January 28, 2013 13:47:51
Msg-id	5106575E.4010103@vmware.com Whole thread Raw
Responses	Re: Number of buckets in a hash join Re: Number of buckets in a hash join
List	pgsql-hackers

Tree view

While testing Alexander's gistchoose patch, "perf report" showed that 
the test case spent a surprisingly large amount of CPU time in 
ExecScanHashBucket. That function just scans a hash bucket for matches, 
and it should be very quick as long as there are not many collisions.

It turns out that the planner estimated the number of rows in the hash 
to be much smaller than it actually contained, and the hash table was 
initialized with too few buckets as a result. The target is that each 
bucket contains 10 tuples (NTUP_PER_BUCKET), but in this case, the 
average was about 100.

The first question is, why do we aim at 10 tuples per bucket? My gut 
feeling is that that's way too high. I would expect the target to be 1 
tuple per bucket, or maybe a little more, like 2-3. Each bucket consumes 
one pointer's worth of RAM, which is not much. There's also some 
overhead from empty buckets when scanning the hash table, but as long as 
all the buckets have at least one entry, there's no gain from having 
more than one entry per bucket.

However, lowering NTUP_PER_BUCKET would not have helped in this case, 
because we also have a minimum of 1024 buckets. The estimate was so bad 
that even after setting NTUP_PER_BUCKET to 1, it was still pegged at 
that minimum of 1024 buckets.

Ideally, the planner would always make a good guess the number of rows, 
but for the situations that it doesn't, it would be good if the hash 
table was enlarged if it becomes too full. It's a well-known technique 
to double the size of a hash table once the load factor reaches a 
certain threshold, and rehash the existing entries. Another idea is to 
just collect all the entries in e.g a linked list when tuples are 
inserted to the hash table, and create the buckets lazily, after all the 
tuples have been inserted.

Here's an extreme example of this phenomenon. According to perf, about 
95% of the CPU time is spent in ExecScanHashBucket. That would be 
eliminated by sizing the hash table correctly:

create table numbers (id int4);
insert into numbers select g from generate_series(1, 10000000) g;

explain analyze select * from numbers a, generate_series(1, 100000) b 
where b = a.id;                                                              QUERY 
PLAN

---------------------------------------------------------------------------------
------------------------------------------------------ Hash Join  (cost=22.50..2035430.50 rows=53097600 width=8)
(actual
 
time=32.307..2
9141.348 rows=100000 loops=1)   Hash Cond: (a.id = b.b)   ->  Seq Scan on numbers a  (cost=0.00..150443.20
rows=10619520
 
width=4) (actua
l time=0.017..716.503 rows=10000000 loops=1)   ->  Hash  (cost=10.00..10.00 rows=1000 width=4) (actual 
time=32.268..32.268 ro
ws=100000 loops=1)         Buckets: 1024  Batches: 1  Memory Usage: 3516kB         ->  Function Scan on generate_series
b (cost=0.00..10.00 
 
rows=1000 widt
h=4) (actual time=17.966..22.519 rows=100000 loops=1) Total runtime: 29146.011 ms
(7 rows)


- Heikki

pgsql-hackers by date:

From: Andres Freund
Date: 28 January 2013, 13:44:51
Subject: Re: logical changeset generation v4 - Heikki's thoughts about the patch state

From: Jeevan Chalke
Date: 28 January 2013, 14:14:58
Subject: Re: pg_dump --pretty-print-views

Number of buckets in a hash join - Mailing list pgsql-hackers

Previous

Next