Thread: hash join hashtable size and work_mem
in nodeHash.c, the function ExecChooseHashTableSize() uses two different methods for determining the number of buckets to use. the current code looks something like: if (ntuples * tuplesize > work_mem * 1024)buckets = (work_mem * 1024) / (tupsize * 10); elsebuckets = ntuples/10 So for the case where a spill is expected; we use work_mem to decide on our hash size. For the case where a spill isn't expected; we rely on the row estimate alone -- and make no provision for speeding the join by using the memory that we're allowed to use. When profiling large hash-joins, it often is the case that scanning the hash-buckets is a bottleneck; it would be nice for the user to be able to "throw memory" at a join to improve performance. Am I missing something about the current implementation ? I would expect that the bucket count would be calculated something like: buckets = (work_mem * 1024L) / (tup_size * NTUP_PER_BUCKET) for both cases ? making this change appears to improve hash-join performance substantially in some cases, and as far as I can tell doesn't hurt anything (apart from using memory that it is "allowed" to use given a particular work_mem setting). -Tim -- tkordas@greenplum.com
"Timothy J. Kordas" <tkordas@greenplum.com> writes: > Am I missing something about the current implementation ? If the planner has correctly predicted the number of rows, the table loading should be about NTUP_PER_BUCKET in either regime. Are you sure you aren't just wishing that NTUP_PER_BUCKET were smaller? I don't see that making the hashtable much larger than ntuples is a good idea --- that just spreads out the live entries over more cache lines, resulting in more cache thrashing. regards, tom lane
Tom Lane wrote: > If the planner has correctly predicted the number of rows, the table > loading should be about NTUP_PER_BUCKET in either regime. Are you > sure you aren't just wishing that NTUP_PER_BUCKET were smaller? Maybe I wish NTUP_PER_BUCKET was smaller. But I don't think that's the whole story. The planner estimates definitely play a role in my concern here. For mis-estimated inner relations, the current calculation may over-subscribe the hash-table even if more work_mem was available (that is, there are too many hash collisions *and* memory isn't being used to the fullest extent allowed). I've been tracking the number of tuples which land in each bucket, and I'd like to see that number go down as I increase work_mem. I would expect for the same data a hash-join with a work_mem of 256MB to run faster than one run with 32MB; even if the inner relation is only 30MB. the implementation I've been experimenting with actually takes the average of the current implementation (ntuples/10) and the spill version (work_mem/(tupsize * 10). -Tim
"Timothy J. Kordas" <tkordas@greenplum.com> writes: > I would expect for the same data a hash-join with a work_mem of 256MB to run > faster than one run with 32MB; even if the inner relation is only 30MB. Once you get to the point where each tuple is in a different bucket, it is clearly impossible for further increases in hashtable size to improve matters. All you can do is waste RAM and cache lines. Now if we set NTUP_PER_BUCKET = 1 we would not be exactly at that critical point because of uneven bucket loading and other factors ... but I question whether there's enough incremental improvement available to justify making the hashtable much larger than that. regards, tom lane
On Wed, 2007-03-14 at 10:28 -0700, Timothy J. Kordas wrote: > I would expect for the same data a hash-join with a work_mem of 256MB > to run faster than one run with 32MB; even if the inner relation is > only 30MB. Certainly not for all data, but for some distrubutions yes, probably. The easiest thing to do is prove thats true and then work out how to spot that case ahead of time, or at least find a place where you can adjust your assumptions cheaply enough to improve things. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com