Thread: Re: Adjusting hash join memory limit to handle batch explosion

Re: Adjusting hash join memory limit to handle batch explosion

From

Robert Haas

Date:

06 January, 21:42:16

Hi Tomas,

Thanks for working on this. I haven't studied this problem recently,
but here are some ideas that occur to me:

1. Try to reduce the per-batch overhead.

2. Stop increasing the number of batches when the per-batch overhead
exceeds a small percentage of work_mem (10%? 5%? 1%?).

If you've reached a point where the per-batch overhead is using up
>=10% of your work_mem, then at the next doubling it's going to be
using >=20%, which is pretty insane, and the next doubling after that
is going to be >=40%, which is really silly. For 1MB of work_mem and
what I gather from your remarks is 16kB/batch, we exceed the 10%
threshold at 16 batches. Somebody might claim that capping the number
of batches to 16 is insane, but those 16 batches are using 256kB of
memory and we're supposed to finish the entire operation using <= 1MB
of memory, it really isn't. We pretty obviously are not going to be
able to stay within 1MB no matter what we do.

I think your proposal might be a more refined version of this, where
instead of just completely ceasing to create new batches, you try to
balance creating new batches with overrunning work_mem to get the best
outcome possible overall. Maybe that's a good approach, although
perhaps it is more complicated than we need? At any rate, I found the
vadjust-size patch to be quite hard to understand. I think you if you
want to go that route it would need more comments and to have the
existing ones rewritten so that they are understandable without
needing to scour this email thread (e.g. "Try to move on the
anti-diagonal and see if we'd consume less memory" doesn't seem like
something most people are going to understand without a lot of
context).

...Robert

Re: Adjusting hash join memory limit to handle batch explosion

From

Tomas Vondra

Date:

06 January, 22:51:29

On 1/6/25 16:42, Robert Haas wrote:
> Hi Tomas,
> 
> Thanks for working on this. I haven't studied this problem recently,
> but here are some ideas that occur to me:
> 
> 1. Try to reduce the per-batch overhead.
> 

Yeah. The "use files without buffering" approach may be seen as an
extreme version of this, but it didn't perform well enough. The "shared"
buffering was an attempt to have a buffer that doesn't need to scale
linearly with the number of batches, but that has issues too (I'm sure
some of that is due to my faults in the PoC patch).

I wonder if maybe a better solution would be to allow BufFiles with
smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
that helps, before the buffering stops being effective as the buffer
gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
in half for every nbatch doubling, we'd be down to 1B after 13 rounds
(but the buffer is useless once it gets too small to hold multiple
tuples, it's only like 5 cycles).

Maybe it'd still work well enough if we only did that for large nbatch
values, and ensured the buffer can't get too small (say, less than 1kB).
But that only gives 3 doubling cycles - i.e. instead of 8GB of memory
we'd only use 1GB. That's an improvement, but also not very different
from what the "balancing" achieves, except that it's way more invasive
and complex.

> 2. Stop increasing the number of batches when the per-batch overhead
> exceeds a small percentage of work_mem (10%? 5%? 1%?).
> 
> If you've reached a point where the per-batch overhead is using up
>> =10% of your work_mem, then at the next doubling it's going to be
> using >=20%, which is pretty insane, and the next doubling after that
> is going to be >=40%, which is really silly. For 1MB of work_mem and
> what I gather from your remarks is 16kB/batch, we exceed the 10%
> threshold at 16 batches. Somebody might claim that capping the number
> of batches to 16 is insane, but those 16 batches are using 256kB of
> memory and we're supposed to finish the entire operation using <= 1MB
> of memory, it really isn't. We pretty obviously are not going to be
> able to stay within 1MB no matter what we do.
> 

Agreed.

> I think your proposal might be a more refined version of this, where
> instead of just completely ceasing to create new batches, you try to
> balance creating new batches with overrunning work_mem to get the best
> outcome possible overall. Maybe that's a good approach, although
> perhaps it is more complicated than we need? At any rate, I found the
> vadjust-size patch to be quite hard to understand. I think you if you
> want to go that route it would need more comments and to have the
> existing ones rewritten so that they are understandable without
> needing to scour this email thread (e.g. "Try to move on the
> anti-diagonal and see if we'd consume less memory" doesn't seem like
> something most people are going to understand without a lot of
> context).
> 

Yes, the proposal does essentially this. And you're certainly right some
of the comments are hard to understand without reading some of the
thread, so that would need to improve.

regards

-- 
Tomas Vondra

Re: Adjusting hash join memory limit to handle batch explosion

From

Robert Haas

Date:

07 January, 00:50:54

On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote:
> I wonder if maybe a better solution would be to allow BufFiles with
> smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
> that helps, before the buffering stops being effective as the buffer
> gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
> in half for every nbatch doubling, we'd be down to 1B after 13 rounds
> (but the buffer is useless once it gets too small to hold multiple
> tuples, it's only like 5 cycles).

I was more thinking about whether we need to keep all of those buffers
around all the time. If the number of batches doesn't increase, then
after we finish moving things into batches they should never need to
be moved into a different batch. If it does, then things are
different, but for example if we initially plan on 64 batches and then
later decide we need 256 batches, we should really only need 3 buffers
at a time, except for the initial work during batch 0. (In this
example, a tuple that is initially assigned to batch 1 might need to
be moved to batch 65, 129, or 193, but it can't need to go anywhere
else.)

But I don't quite know how we could avoid needing all the buffers at
once during batch 0. That said, it's questionable whether it really
make sense to have an initial number of batches that is very large.
Does partitioning the input data into 64k batches really make sense,
or would it be more efficient to partition it 256 ways initially and
then do a second pass over each of those to split them up another 256
ways? It's a lot more I/O, but trying to split 64k ways at once is
presumably going to thrash the File table as well as do a lot of
completely random physical I/O, so maybe it's worth considering.

Another thought is that, if we really do want to partition 64k ways
all at once, maybe 16kb set aside for each batch is not the right
approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
memory available for partitioning, wouldn't it make sense to read a
gigabyte of tuples, sort them by batch #, and then open each file that
needs to get at least 1 tuple, write all the tuples into that file,
and close it? This seems more scalable than what we do today, because
it doesn't require a certain amount of memory per batch. The
performance might not be great if you're using a really small amount
of memory for a really large number of batches, but it might still be
better than the current algorithm, which could easily leave a lot of
that memory idling a lot of the time.

Said another way, I think the current algorithm is optimized for small
numbers of batches. Repeatedly filling and flushing a 16kB buffer
makes sense if the number of buffers isn't that big so that flushes
are regular and a buffer is typically going to spend a lot of its time
approximately half full. But when the number of batches becomes large,
buffers will start to be flushed less and less often, especially if
there is skew in the data but to some degree even if there isn't. Any
buffer that sits there for "a long time" -- whatever that means
exactly -- without getting flushed is not a good use of memory.

I'm just spitballing here. Don't confuse anything in this email with a
demand for you to do something different than you are.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Adjusting hash join memory limit to handle batch explosion

From

Tomas Vondra

Date:

07 January, 03:44:12

On 1/6/25 19:50, Robert Haas wrote:
> On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote:
>> I wonder if maybe a better solution would be to allow BufFiles with
>> smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
>> that helps, before the buffering stops being effective as the buffer
>> gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
>> in half for every nbatch doubling, we'd be down to 1B after 13 rounds
>> (but the buffer is useless once it gets too small to hold multiple
>> tuples, it's only like 5 cycles).
> 
> I was more thinking about whether we need to keep all of those buffers
> around all the time. If the number of batches doesn't increase, then
> after we finish moving things into batches they should never need to
> be moved into a different batch. If it does, then things are
> different, but for example if we initially plan on 64 batches and then
> later decide we need 256 batches, we should really only need 3 buffers
> at a time, except for the initial work during batch 0. (In this
> example, a tuple that is initially assigned to batch 1 might need to
> be moved to batch 65, 129, or 193, but it can't need to go anywhere
> else.)
> 

Right.

> But I don't quite know how we could avoid needing all the buffers at
> once during batch 0. That said, it's questionable whether it really
> make sense to have an initial number of batches that is very large.
> Does partitioning the input data into 64k batches really make sense,
> or would it be more efficient to partition it 256 ways initially and
> then do a second pass over each of those to split them up another 256
> ways? It's a lot more I/O, but trying to split 64k ways at once is
> presumably going to thrash the File table as well as do a lot of
> completely random physical I/O, so maybe it's worth considering.
> 

True, but as soon as you limit the number of batches you could generate,
it's that more or less the same as not enforcing the limit on the amount
of memory consumed by the hash table? Because you have to keep the
tuples that belong to the current batch in memory ...

I suppose you could do this recursively, i.e. split to 256 batches, and
once you can keep the current batch in memory, spill it to disk too. And
then read it from file, and split it into 256 more batches. I think we'd
need to remember the minimum nbatch value for each batch (when it was
created), and then go through all the stages up to current nbatch. But
it could work, I guess.

The thing is - I don't think increasing the work_mem is bad - in fact,
it's exactly the thing that may stop the batch explosion when there are
hash collisions / overlaps. That's what the test script does to trigger
the explosion, although I admit it's an artificial / adversary case. But
similar stuff can happen in PROD, and we blindly increase nbatch when it
can't possibly help, stopping only after running out of hash bits.

> Another thought is that, if we really do want to partition 64k ways
> all at once, maybe 16kb set aside for each batch is not the right
> approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
> memory available for partitioning, wouldn't it make sense to read a
> gigabyte of tuples, sort them by batch #, and then open each file that
> needs to get at least 1 tuple, write all the tuples into that file,
> and close it? This seems more scalable than what we do today, because
> it doesn't require a certain amount of memory per batch. The
> performance might not be great if you're using a really small amount
> of memory for a really large number of batches, but it might still be
> better than the current algorithm, which could easily leave a lot of
> that memory idling a lot of the time.
> 

This is pretty much the idea behind the 0002 patch in the "raw files"
PoC patch, although I tried to use a much smaller batch. Maybe with 1GB
(and better coding than in my PoC patch) it would work better.

Still, if we have 1GB for a buffer, maybe it'd be better to use some of
that for a larger hash table, and not need that many batches ...

> Said another way, I think the current algorithm is optimized for small
> numbers of batches. Repeatedly filling and flushing a 16kB buffer
> makes sense if the number of buffers isn't that big so that flushes
> are regular and a buffer is typically going to spend a lot of its time
> approximately half full. But when the number of batches becomes large,
> buffers will start to be flushed less and less often, especially if
> there is skew in the data but to some degree even if there isn't. Any
> buffer that sits there for "a long time" -- whatever that means
> exactly -- without getting flushed is not a good use of memory.
> 

Right. FWIW I suspect we had similar discussions for the hashagg.

> I'm just spitballing here. Don't confuse anything in this email with a
> demand for you to do something different than you are.
> 

No, thanks. It's good to have these discussions and be forced to think
about it from a different angle.

regards

-- 
Tomas Vondra