Thread: Re: Adjusting hash join memory limit to handle batch explosion
Hi Tomas, Thanks for working on this. I haven't studied this problem recently, but here are some ideas that occur to me: 1. Try to reduce the per-batch overhead. 2. Stop increasing the number of batches when the per-batch overhead exceeds a small percentage of work_mem (10%? 5%? 1%?). If you've reached a point where the per-batch overhead is using up >=10% of your work_mem, then at the next doubling it's going to be using >=20%, which is pretty insane, and the next doubling after that is going to be >=40%, which is really silly. For 1MB of work_mem and what I gather from your remarks is 16kB/batch, we exceed the 10% threshold at 16 batches. Somebody might claim that capping the number of batches to 16 is insane, but those 16 batches are using 256kB of memory and we're supposed to finish the entire operation using <= 1MB of memory, it really isn't. We pretty obviously are not going to be able to stay within 1MB no matter what we do. I think your proposal might be a more refined version of this, where instead of just completely ceasing to create new batches, you try to balance creating new batches with overrunning work_mem to get the best outcome possible overall. Maybe that's a good approach, although perhaps it is more complicated than we need? At any rate, I found the vadjust-size patch to be quite hard to understand. I think you if you want to go that route it would need more comments and to have the existing ones rewritten so that they are understandable without needing to scour this email thread (e.g. "Try to move on the anti-diagonal and see if we'd consume less memory" doesn't seem like something most people are going to understand without a lot of context). ...Robert
On 1/6/25 16:42, Robert Haas wrote: > Hi Tomas, > > Thanks for working on this. I haven't studied this problem recently, > but here are some ideas that occur to me: > > 1. Try to reduce the per-batch overhead. > Yeah. The "use files without buffering" approach may be seen as an extreme version of this, but it didn't perform well enough. The "shared" buffering was an attempt to have a buffer that doesn't need to scale linearly with the number of batches, but that has issues too (I'm sure some of that is due to my faults in the PoC patch). I wonder if maybe a better solution would be to allow BufFiles with smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much that helps, before the buffering stops being effective as the buffer gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer in half for every nbatch doubling, we'd be down to 1B after 13 rounds (but the buffer is useless once it gets too small to hold multiple tuples, it's only like 5 cycles). Maybe it'd still work well enough if we only did that for large nbatch values, and ensured the buffer can't get too small (say, less than 1kB). But that only gives 3 doubling cycles - i.e. instead of 8GB of memory we'd only use 1GB. That's an improvement, but also not very different from what the "balancing" achieves, except that it's way more invasive and complex. > 2. Stop increasing the number of batches when the per-batch overhead > exceeds a small percentage of work_mem (10%? 5%? 1%?). > > If you've reached a point where the per-batch overhead is using up >> =10% of your work_mem, then at the next doubling it's going to be > using >=20%, which is pretty insane, and the next doubling after that > is going to be >=40%, which is really silly. For 1MB of work_mem and > what I gather from your remarks is 16kB/batch, we exceed the 10% > threshold at 16 batches. Somebody might claim that capping the number > of batches to 16 is insane, but those 16 batches are using 256kB of > memory and we're supposed to finish the entire operation using <= 1MB > of memory, it really isn't. We pretty obviously are not going to be > able to stay within 1MB no matter what we do. > Agreed. > I think your proposal might be a more refined version of this, where > instead of just completely ceasing to create new batches, you try to > balance creating new batches with overrunning work_mem to get the best > outcome possible overall. Maybe that's a good approach, although > perhaps it is more complicated than we need? At any rate, I found the > vadjust-size patch to be quite hard to understand. I think you if you > want to go that route it would need more comments and to have the > existing ones rewritten so that they are understandable without > needing to scour this email thread (e.g. "Try to move on the > anti-diagonal and see if we'd consume less memory" doesn't seem like > something most people are going to understand without a lot of > context). > Yes, the proposal does essentially this. And you're certainly right some of the comments are hard to understand without reading some of the thread, so that would need to improve. regards -- Tomas Vondra
On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote: > I wonder if maybe a better solution would be to allow BufFiles with > smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much > that helps, before the buffering stops being effective as the buffer > gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer > in half for every nbatch doubling, we'd be down to 1B after 13 rounds > (but the buffer is useless once it gets too small to hold multiple > tuples, it's only like 5 cycles). I was more thinking about whether we need to keep all of those buffers around all the time. If the number of batches doesn't increase, then after we finish moving things into batches they should never need to be moved into a different batch. If it does, then things are different, but for example if we initially plan on 64 batches and then later decide we need 256 batches, we should really only need 3 buffers at a time, except for the initial work during batch 0. (In this example, a tuple that is initially assigned to batch 1 might need to be moved to batch 65, 129, or 193, but it can't need to go anywhere else.) But I don't quite know how we could avoid needing all the buffers at once during batch 0. That said, it's questionable whether it really make sense to have an initial number of batches that is very large. Does partitioning the input data into 64k batches really make sense, or would it be more efficient to partition it 256 ways initially and then do a second pass over each of those to split them up another 256 ways? It's a lot more I/O, but trying to split 64k ways at once is presumably going to thrash the File table as well as do a lot of completely random physical I/O, so maybe it's worth considering. Another thought is that, if we really do want to partition 64k ways all at once, maybe 16kb set aside for each batch is not the right approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of memory available for partitioning, wouldn't it make sense to read a gigabyte of tuples, sort them by batch #, and then open each file that needs to get at least 1 tuple, write all the tuples into that file, and close it? This seems more scalable than what we do today, because it doesn't require a certain amount of memory per batch. The performance might not be great if you're using a really small amount of memory for a really large number of batches, but it might still be better than the current algorithm, which could easily leave a lot of that memory idling a lot of the time. Said another way, I think the current algorithm is optimized for small numbers of batches. Repeatedly filling and flushing a 16kB buffer makes sense if the number of buffers isn't that big so that flushes are regular and a buffer is typically going to spend a lot of its time approximately half full. But when the number of batches becomes large, buffers will start to be flushed less and less often, especially if there is skew in the data but to some degree even if there isn't. Any buffer that sits there for "a long time" -- whatever that means exactly -- without getting flushed is not a good use of memory. I'm just spitballing here. Don't confuse anything in this email with a demand for you to do something different than you are. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/6/25 19:50, Robert Haas wrote: > On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote: >> I wonder if maybe a better solution would be to allow BufFiles with >> smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much >> that helps, before the buffering stops being effective as the buffer >> gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer >> in half for every nbatch doubling, we'd be down to 1B after 13 rounds >> (but the buffer is useless once it gets too small to hold multiple >> tuples, it's only like 5 cycles). > > I was more thinking about whether we need to keep all of those buffers > around all the time. If the number of batches doesn't increase, then > after we finish moving things into batches they should never need to > be moved into a different batch. If it does, then things are > different, but for example if we initially plan on 64 batches and then > later decide we need 256 batches, we should really only need 3 buffers > at a time, except for the initial work during batch 0. (In this > example, a tuple that is initially assigned to batch 1 might need to > be moved to batch 65, 129, or 193, but it can't need to go anywhere > else.) > Right. > But I don't quite know how we could avoid needing all the buffers at > once during batch 0. That said, it's questionable whether it really > make sense to have an initial number of batches that is very large. > Does partitioning the input data into 64k batches really make sense, > or would it be more efficient to partition it 256 ways initially and > then do a second pass over each of those to split them up another 256 > ways? It's a lot more I/O, but trying to split 64k ways at once is > presumably going to thrash the File table as well as do a lot of > completely random physical I/O, so maybe it's worth considering. > True, but as soon as you limit the number of batches you could generate, it's that more or less the same as not enforcing the limit on the amount of memory consumed by the hash table? Because you have to keep the tuples that belong to the current batch in memory ... I suppose you could do this recursively, i.e. split to 256 batches, and once you can keep the current batch in memory, spill it to disk too. And then read it from file, and split it into 256 more batches. I think we'd need to remember the minimum nbatch value for each batch (when it was created), and then go through all the stages up to current nbatch. But it could work, I guess. The thing is - I don't think increasing the work_mem is bad - in fact, it's exactly the thing that may stop the batch explosion when there are hash collisions / overlaps. That's what the test script does to trigger the explosion, although I admit it's an artificial / adversary case. But similar stuff can happen in PROD, and we blindly increase nbatch when it can't possibly help, stopping only after running out of hash bits. > Another thought is that, if we really do want to partition 64k ways > all at once, maybe 16kb set aside for each batch is not the right > approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of > memory available for partitioning, wouldn't it make sense to read a > gigabyte of tuples, sort them by batch #, and then open each file that > needs to get at least 1 tuple, write all the tuples into that file, > and close it? This seems more scalable than what we do today, because > it doesn't require a certain amount of memory per batch. The > performance might not be great if you're using a really small amount > of memory for a really large number of batches, but it might still be > better than the current algorithm, which could easily leave a lot of > that memory idling a lot of the time. > This is pretty much the idea behind the 0002 patch in the "raw files" PoC patch, although I tried to use a much smaller batch. Maybe with 1GB (and better coding than in my PoC patch) it would work better. Still, if we have 1GB for a buffer, maybe it'd be better to use some of that for a larger hash table, and not need that many batches ... > Said another way, I think the current algorithm is optimized for small > numbers of batches. Repeatedly filling and flushing a 16kB buffer > makes sense if the number of buffers isn't that big so that flushes > are regular and a buffer is typically going to spend a lot of its time > approximately half full. But when the number of batches becomes large, > buffers will start to be flushed less and less often, especially if > there is skew in the data but to some degree even if there isn't. Any > buffer that sits there for "a long time" -- whatever that means > exactly -- without getting flushed is not a good use of memory. > Right. FWIW I suspect we had similar discussions for the hashagg. > I'm just spitballing here. Don't confuse anything in this email with a > demand for you to do something different than you are. > No, thanks. It's good to have these discussions and be forced to think about it from a different angle. regards -- Tomas Vondra