Thread: Number of buckets in a hash join

Number of buckets in a hash join

From
Heikki Linnakangas
Date:
While testing Alexander's gistchoose patch, "perf report" showed that 
the test case spent a surprisingly large amount of CPU time in 
ExecScanHashBucket. That function just scans a hash bucket for matches, 
and it should be very quick as long as there are not many collisions.

It turns out that the planner estimated the number of rows in the hash 
to be much smaller than it actually contained, and the hash table was 
initialized with too few buckets as a result. The target is that each 
bucket contains 10 tuples (NTUP_PER_BUCKET), but in this case, the 
average was about 100.

The first question is, why do we aim at 10 tuples per bucket? My gut 
feeling is that that's way too high. I would expect the target to be 1 
tuple per bucket, or maybe a little more, like 2-3. Each bucket consumes 
one pointer's worth of RAM, which is not much. There's also some 
overhead from empty buckets when scanning the hash table, but as long as 
all the buckets have at least one entry, there's no gain from having 
more than one entry per bucket.

However, lowering NTUP_PER_BUCKET would not have helped in this case, 
because we also have a minimum of 1024 buckets. The estimate was so bad 
that even after setting NTUP_PER_BUCKET to 1, it was still pegged at 
that minimum of 1024 buckets.

Ideally, the planner would always make a good guess the number of rows, 
but for the situations that it doesn't, it would be good if the hash 
table was enlarged if it becomes too full. It's a well-known technique 
to double the size of a hash table once the load factor reaches a 
certain threshold, and rehash the existing entries. Another idea is to 
just collect all the entries in e.g a linked list when tuples are 
inserted to the hash table, and create the buckets lazily, after all the 
tuples have been inserted.

Here's an extreme example of this phenomenon. According to perf, about 
95% of the CPU time is spent in ExecScanHashBucket. That would be 
eliminated by sizing the hash table correctly:

create table numbers (id int4);
insert into numbers select g from generate_series(1, 10000000) g;

explain analyze select * from numbers a, generate_series(1, 100000) b 
where b = a.id;                                                              QUERY 
PLAN

---------------------------------------------------------------------------------
------------------------------------------------------ Hash Join  (cost=22.50..2035430.50 rows=53097600 width=8)
(actual
 
time=32.307..2
9141.348 rows=100000 loops=1)   Hash Cond: (a.id = b.b)   ->  Seq Scan on numbers a  (cost=0.00..150443.20
rows=10619520
 
width=4) (actua
l time=0.017..716.503 rows=10000000 loops=1)   ->  Hash  (cost=10.00..10.00 rows=1000 width=4) (actual 
time=32.268..32.268 ro
ws=100000 loops=1)         Buckets: 1024  Batches: 1  Memory Usage: 3516kB         ->  Function Scan on generate_series
b (cost=0.00..10.00 
 
rows=1000 widt
h=4) (actual time=17.966..22.519 rows=100000 loops=1) Total runtime: 29146.011 ms
(7 rows)


- Heikki



Re: Number of buckets in a hash join

From
Simon Riggs
Date:
On 28 January 2013 10:47, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> There's also some overhead from empty
> buckets when scanning the hash table

Seems like we should measure that overhead. That way we can plot the
cost against number per bucket, which sounds like it has a minima at
1.0, but that doesn't mean its symmetrical about that point. We can
then see where the optimal setting should be.

Having said that the hash bucket estimate is based on ndistinct, which
we know is frequently under-estimated, so it would be useful to err on
the low side.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Number of buckets in a hash join

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> The first question is, why do we aim at 10 tuples per bucket?

I see nothing particularly wrong with that.  The problem here is with
having 1000 tuples per bucket.

> Ideally, the planner would always make a good guess the number of rows, 
> but for the situations that it doesn't, it would be good if the hash 
> table was enlarged if it becomes too full.

Yeah, possibly.  The proposed test case actually doesn't behave very
badly if work_mem is small, because there is logic in there to adjust
the number of batches.  You didn't say what work_mem you're testing at,
but it's clearly more than the default 1MB.  I think the issue arises if
the initial estimate of hashtable size is a good bit less than work_mem,
so the number of buckets is set to something a good bit less than what
would be optimal if we're using more of work_mem.  This seems a little
reminiscent of what we did recently in tuplesort to make better use of
work_mem --- in both cases we have to choose a pointer-array size that
will make best use of work_mem after the tuples themselves are added.
        regards, tom lane