Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adjusting hash join memory limit to handle batch explosion
Date
Msg-id a53ba073-320a-4fc5-bb5f-da5db4fa973d@vondra.me
Whole thread Raw
In response to Re: Adjusting hash join memory limit to handle batch explosion  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Adjusting hash join memory limit to handle batch explosion
List pgsql-hackers

On 1/6/25 16:42, Robert Haas wrote:
> Hi Tomas,
> 
> Thanks for working on this. I haven't studied this problem recently,
> but here are some ideas that occur to me:
> 
> 1. Try to reduce the per-batch overhead.
> 

Yeah. The "use files without buffering" approach may be seen as an
extreme version of this, but it didn't perform well enough. The "shared"
buffering was an attempt to have a buffer that doesn't need to scale
linearly with the number of batches, but that has issues too (I'm sure
some of that is due to my faults in the PoC patch).

I wonder if maybe a better solution would be to allow BufFiles with
smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
that helps, before the buffering stops being effective as the buffer
gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
in half for every nbatch doubling, we'd be down to 1B after 13 rounds
(but the buffer is useless once it gets too small to hold multiple
tuples, it's only like 5 cycles).

Maybe it'd still work well enough if we only did that for large nbatch
values, and ensured the buffer can't get too small (say, less than 1kB).
But that only gives 3 doubling cycles - i.e. instead of 8GB of memory
we'd only use 1GB. That's an improvement, but also not very different
from what the "balancing" achieves, except that it's way more invasive
and complex.

> 2. Stop increasing the number of batches when the per-batch overhead
> exceeds a small percentage of work_mem (10%? 5%? 1%?).
> 
> If you've reached a point where the per-batch overhead is using up
>> =10% of your work_mem, then at the next doubling it's going to be
> using >=20%, which is pretty insane, and the next doubling after that
> is going to be >=40%, which is really silly. For 1MB of work_mem and
> what I gather from your remarks is 16kB/batch, we exceed the 10%
> threshold at 16 batches. Somebody might claim that capping the number
> of batches to 16 is insane, but those 16 batches are using 256kB of
> memory and we're supposed to finish the entire operation using <= 1MB
> of memory, it really isn't. We pretty obviously are not going to be
> able to stay within 1MB no matter what we do.
> 

Agreed.

> I think your proposal might be a more refined version of this, where
> instead of just completely ceasing to create new batches, you try to
> balance creating new batches with overrunning work_mem to get the best
> outcome possible overall. Maybe that's a good approach, although
> perhaps it is more complicated than we need? At any rate, I found the
> vadjust-size patch to be quite hard to understand. I think you if you
> want to go that route it would need more comments and to have the
> existing ones rewritten so that they are understandable without
> needing to scour this email thread (e.g. "Try to move on the
> anti-diagonal and see if we'd consume less memory" doesn't seem like
> something most people are going to understand without a lot of
> context).
> 

Yes, the proposal does essentially this. And you're certainly right some
of the comments are hard to understand without reading some of the
thread, so that would need to improve.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Fwd: Re: A new look at old NFS readdir() problems?
Next
From: Fujii Masao
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting