Thread: Re: Adjusting hash join memory limit to handle batch explosion

Re: Adjusting hash join memory limit to handle batch explosion

From
Melanie Plageman
Date:
On Thu, Feb 6, 2025 at 1:48 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> Here's a slightly simplified version of the "balancing" patch. I decided
> to stop increasing the nbucket value at runtime, even if the hashtable
> grows larger than the memory limit (which is what we used to calculate
> the initial nbucket value in ExecChooseHashTableSize).

I started looking at these.
First question is if the guc enable_hashjoin_adjust is for development
or you mean it for users (and for it to be off by default).

In 0001, in ExecChooseHashTableSize(), I would start the large block
comment with something that indicates this is a continuation of the
calculation above it for getting the required number of batches.

You say:
     * Optimize the total amount of memory consumed by the hash node.
     * The nbatch calculation above focuses on the size of the in-memory hash
     * table, ignoring the memory used by batch files. But that can be a lot
     * of memory - each batch file has a BLCKSZ buffer, and we may need two
     * files per batch (inner and outer side). So with enough batches this can
     * be significantly more memory than the hashtable itself, and it grows
     * quickly as we're adding more batches.

I might make it more explicit:
nbatch is calculated above purely on the size of the inner relation
and the bytes available for the hashtable, assuming no per-batch
overhead. Now, recalibrate the number of batches and the size of the
hashtable to optimize the total amount of memory consumed by the
hashnode.

Then go into the rest of your details.

This paragraph, while essential, could probably use a bit of massaging

     * This means we're only ever reducing nbatch values, we'll never increase
     * it (as we're not considering nbatch*2). We could counsider that too,
     * depending on which part of the [nbatch,work_mem] table we're in. And
     * for cases with high work_mem values, we would find that adding batches
     * reduces memory usage. But the hashtable size is what we consider when
     * calculating the initial nbatch value, and if it's dominating the memory
     * usage, if means we're not exceeding the expected memory limit (at least
     * not significantly). There is little risk of OOM or memory overruns. Our
     * goal is not to minimize the memory usage, but to enforce the limit set
     * by the user. Minimizing the memory usage would result in spilling many
     * more batch files, which does not seem great for performance. So we only
     * ever reduce nbatch, never increase it.

The point is that if we aren't exceeding the expected memory limit,
then we won't increase the number of batches to try and save memory
because it will probably hurt performance in other ways. All the other
details are useful, but I found myself a bit lost in them (the way
they are phrased now).

     * While growing the hashtable, we also adjust the number of buckets, to
     * not have more than one tuple per bucket. We can only do this during

What does "to not have more than one tuple per bucket" mean?

     * So after the initial sizing (here in ExecChooseHashTableSize), the
     * number of buckets is effectively fixed. ExecHashGetBucketAndBatch
     * could calculate batchno/bucketno in a different way, but that's
     * left as a separate improvement. To some extent this is a preexisting
     * issue - if we set growEnabled=false, this allows the hashtable to
     * exceed the memory limit too, and we don't adjust the bucket count.
     * However, that likely happens due to duplicate values and/or hash
     * collisions, so it's not clear if increasing the bucket count would
     * actually spread the tuples through the buckets. It would help with
     * skewed data sets, when we may disable the growth early, and then
     * add more tuples with distinct hash values.

After "is effectively fixed", I'm not sure how much more of this
detail I would include in this comment. There is already quite a lot
of information. Especially the sentence "ExecHashGetBucketAndBatch
     * could calculate batchno/bucketno in a different way, but that's
     * left as a separate improvement."
seems like it would need more information to be clear enough to the
reader -- so maybe just omit it.

If nothing else, I would move the discussion about why we don't
increase the number of buckets to a place where we are actually _not_
increasing the number of buckets (ExecHashIncreaseBatchSize()). In
this location, we are increasing nbuckets.

As for ExecHashIncreaseBatchSize()

     * XXX We're comparing the current spaceAllowed/batchSpace values, because
     * if we double either of those this is the new memory we'll use.

I don't get this. Firstly why is it XXX? Secondly, why are we using
the current spaceAllowed value?
In fact, I don't quite understand how this is actually increasing the
size of the hashtable at all. All it does is cause us to dump out of
ExecHashIncreaseNumBatches() without increasing the number of batches.
Is the reason you don't increase the actual spaceAllowed value because
you don't want it to be larger for other batches that might benefit
from a batch doubling?

- Melanie