Number of buckets in a hash join - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Number of buckets in a hash join |
Date | |
Msg-id | 5106575E.4010103@vmware.com Whole thread Raw |
Responses |
Re: Number of buckets in a hash join
Re: Number of buckets in a hash join |
List | pgsql-hackers |
While testing Alexander's gistchoose patch, "perf report" showed that the test case spent a surprisingly large amount of CPU time in ExecScanHashBucket. That function just scans a hash bucket for matches, and it should be very quick as long as there are not many collisions. It turns out that the planner estimated the number of rows in the hash to be much smaller than it actually contained, and the hash table was initialized with too few buckets as a result. The target is that each bucket contains 10 tuples (NTUP_PER_BUCKET), but in this case, the average was about 100. The first question is, why do we aim at 10 tuples per bucket? My gut feeling is that that's way too high. I would expect the target to be 1 tuple per bucket, or maybe a little more, like 2-3. Each bucket consumes one pointer's worth of RAM, which is not much. There's also some overhead from empty buckets when scanning the hash table, but as long as all the buckets have at least one entry, there's no gain from having more than one entry per bucket. However, lowering NTUP_PER_BUCKET would not have helped in this case, because we also have a minimum of 1024 buckets. The estimate was so bad that even after setting NTUP_PER_BUCKET to 1, it was still pegged at that minimum of 1024 buckets. Ideally, the planner would always make a good guess the number of rows, but for the situations that it doesn't, it would be good if the hash table was enlarged if it becomes too full. It's a well-known technique to double the size of a hash table once the load factor reaches a certain threshold, and rehash the existing entries. Another idea is to just collect all the entries in e.g a linked list when tuples are inserted to the hash table, and create the buckets lazily, after all the tuples have been inserted. Here's an extreme example of this phenomenon. According to perf, about 95% of the CPU time is spent in ExecScanHashBucket. That would be eliminated by sizing the hash table correctly: create table numbers (id int4); insert into numbers select g from generate_series(1, 10000000) g; explain analyze select * from numbers a, generate_series(1, 100000) b where b = a.id; QUERY PLAN --------------------------------------------------------------------------------- ------------------------------------------------------ Hash Join (cost=22.50..2035430.50 rows=53097600 width=8) (actual time=32.307..2 9141.348 rows=100000 loops=1) Hash Cond: (a.id = b.b) -> Seq Scan on numbers a (cost=0.00..150443.20 rows=10619520 width=4) (actua l time=0.017..716.503 rows=10000000 loops=1) -> Hash (cost=10.00..10.00 rows=1000 width=4) (actual time=32.268..32.268 ro ws=100000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 3516kB -> Function Scan on generate_series b (cost=0.00..10.00 rows=1000 widt h=4) (actual time=17.966..22.519 rows=100000 loops=1) Total runtime: 29146.011 ms (7 rows) - Heikki
pgsql-hackers by date: