Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adjusting hash join memory limit to handle batch explosion |
Date | |
Msg-id | a53ba073-320a-4fc5-bb5f-da5db4fa973d@vondra.me Whole thread Raw |
In response to | Re: Adjusting hash join memory limit to handle batch explosion (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Adjusting hash join memory limit to handle batch explosion
|
List | pgsql-hackers |
On 1/6/25 16:42, Robert Haas wrote: > Hi Tomas, > > Thanks for working on this. I haven't studied this problem recently, > but here are some ideas that occur to me: > > 1. Try to reduce the per-batch overhead. > Yeah. The "use files without buffering" approach may be seen as an extreme version of this, but it didn't perform well enough. The "shared" buffering was an attempt to have a buffer that doesn't need to scale linearly with the number of batches, but that has issues too (I'm sure some of that is due to my faults in the PoC patch). I wonder if maybe a better solution would be to allow BufFiles with smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much that helps, before the buffering stops being effective as the buffer gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer in half for every nbatch doubling, we'd be down to 1B after 13 rounds (but the buffer is useless once it gets too small to hold multiple tuples, it's only like 5 cycles). Maybe it'd still work well enough if we only did that for large nbatch values, and ensured the buffer can't get too small (say, less than 1kB). But that only gives 3 doubling cycles - i.e. instead of 8GB of memory we'd only use 1GB. That's an improvement, but also not very different from what the "balancing" achieves, except that it's way more invasive and complex. > 2. Stop increasing the number of batches when the per-batch overhead > exceeds a small percentage of work_mem (10%? 5%? 1%?). > > If you've reached a point where the per-batch overhead is using up >> =10% of your work_mem, then at the next doubling it's going to be > using >=20%, which is pretty insane, and the next doubling after that > is going to be >=40%, which is really silly. For 1MB of work_mem and > what I gather from your remarks is 16kB/batch, we exceed the 10% > threshold at 16 batches. Somebody might claim that capping the number > of batches to 16 is insane, but those 16 batches are using 256kB of > memory and we're supposed to finish the entire operation using <= 1MB > of memory, it really isn't. We pretty obviously are not going to be > able to stay within 1MB no matter what we do. > Agreed. > I think your proposal might be a more refined version of this, where > instead of just completely ceasing to create new batches, you try to > balance creating new batches with overrunning work_mem to get the best > outcome possible overall. Maybe that's a good approach, although > perhaps it is more complicated than we need? At any rate, I found the > vadjust-size patch to be quite hard to understand. I think you if you > want to go that route it would need more comments and to have the > existing ones rewritten so that they are understandable without > needing to scour this email thread (e.g. "Try to move on the > anti-diagonal and see if we'd consume less memory" doesn't seem like > something most people are going to understand without a lot of > context). > Yes, the proposal does essentially this. And you're certainly right some of the comments are hard to understand without reading some of the thread, so that would need to improve. regards -- Tomas Vondra
pgsql-hackers by date: