Thread: hash join hashtable size and work_mem

hash join hashtable size and work_mem

From

"Timothy J. Kordas"

Date:

14 March 2007, 13:39:25

in nodeHash.c, the function ExecChooseHashTableSize() uses two different 
methods for determining the number of buckets to use.

the current code looks something like:

if (ntuples * tuplesize > work_mem * 1024)buckets = (work_mem * 1024) / (tupsize * 10);
elsebuckets = ntuples/10

So for the case where a spill is expected; we use work_mem to decide on our 
hash size. For the case where a spill isn't expected; we rely on the row 
estimate alone -- and make no provision for speeding the join by using the 
memory that we're allowed to use.

When profiling large hash-joins, it often is the case that scanning the 
hash-buckets is a bottleneck; it would be nice for the user to be able to 
"throw memory" at a join to improve performance.

Am I missing something about the current implementation ? I would expect 
that the bucket count would be calculated something like:

buckets = (work_mem * 1024L) / (tup_size * NTUP_PER_BUCKET)

for both cases ?

making this change appears to improve hash-join performance substantially in 
some cases, and as far as I can tell doesn't hurt anything (apart from using 
memory that it is "allowed" to use given a particular work_mem setting).

-Tim
--
tkordas@greenplum.com

Re: hash join hashtable size and work_mem

From

Tom Lane

Date:

14 March 2007, 13:51:14

"Timothy J. Kordas" <tkordas@greenplum.com> writes:
> Am I missing something about the current implementation ?

If the planner has correctly predicted the number of rows, the table
loading should be about NTUP_PER_BUCKET in either regime.  Are you
sure you aren't just wishing that NTUP_PER_BUCKET were smaller?
I don't see that making the hashtable much larger than ntuples
is a good idea --- that just spreads out the live entries over more
cache lines, resulting in more cache thrashing.
        regards, tom lane

Re: hash join hashtable size and work_mem

From

"Timothy J. Kordas"

Date:

14 March 2007, 14:35:16

Tom Lane wrote:
> If the planner has correctly predicted the number of rows, the table
> loading should be about NTUP_PER_BUCKET in either regime.  Are you
> sure you aren't just wishing that NTUP_PER_BUCKET were smaller?

Maybe I wish NTUP_PER_BUCKET was smaller. But I don't think that's the whole 
story.

The planner estimates definitely play a role in my concern here. For 
mis-estimated inner relations, the current calculation may over-subscribe 
the hash-table even if more work_mem was available (that is, there are too 
many hash collisions *and* memory isn't being used to the fullest extent 
allowed).

I've been tracking the number of tuples which land in each bucket, and I'd 
like to see that number go down as I increase work_mem.

I would expect for the same data a hash-join with a work_mem of 256MB to run 
faster than one run with 32MB; even if the inner relation is only 30MB.

the implementation I've been experimenting with actually takes the average 
of the current implementation (ntuples/10) and the spill version 
(work_mem/(tupsize * 10).

-Tim

Re: hash join hashtable size and work_mem

From

Tom Lane

Date:

14 March 2007, 14:49:02

"Timothy J. Kordas" <tkordas@greenplum.com> writes:
> I would expect for the same data a hash-join with a work_mem of 256MB to run 
> faster than one run with 32MB; even if the inner relation is only 30MB.

Once you get to the point where each tuple is in a different bucket, it
is clearly impossible for further increases in hashtable size to improve
matters.  All you can do is waste RAM and cache lines.

Now if we set NTUP_PER_BUCKET = 1 we would not be exactly at that critical
point because of uneven bucket loading and other factors ... but I
question whether there's enough incremental improvement available to
justify making the hashtable much larger than that.
        regards, tom lane

Re: hash join hashtable size and work_mem

From

"Simon Riggs"

Date:

14 March 2007, 18:45:13

On Wed, 2007-03-14 at 10:28 -0700, Timothy J. Kordas wrote:
> I would expect for the same data a hash-join with a work_mem of 256MB
> to run faster than one run with 32MB; even if the inner relation is
> only 30MB. 

Certainly not for all data, but for some distrubutions yes, probably.

The easiest thing to do is prove thats true and then work out how to
spot that case ahead of time, or at least find a place where you can
adjust your assumptions cheaply enough to improve things.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com