Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Adjusting hash join memory limit to handle batch explosion
Date
Msg-id CA+TgmoZBWC-x4VY+OdGNh4Rz=LWeLytm+ZX745a058mmJWPy+A@mail.gmail.com
Whole thread Raw
In response to Re: Adjusting hash join memory limit to handle batch explosion  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adjusting hash join memory limit to handle batch explosion
List pgsql-hackers
On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote:
> I wonder if maybe a better solution would be to allow BufFiles with
> smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
> that helps, before the buffering stops being effective as the buffer
> gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
> in half for every nbatch doubling, we'd be down to 1B after 13 rounds
> (but the buffer is useless once it gets too small to hold multiple
> tuples, it's only like 5 cycles).

I was more thinking about whether we need to keep all of those buffers
around all the time. If the number of batches doesn't increase, then
after we finish moving things into batches they should never need to
be moved into a different batch. If it does, then things are
different, but for example if we initially plan on 64 batches and then
later decide we need 256 batches, we should really only need 3 buffers
at a time, except for the initial work during batch 0. (In this
example, a tuple that is initially assigned to batch 1 might need to
be moved to batch 65, 129, or 193, but it can't need to go anywhere
else.)

But I don't quite know how we could avoid needing all the buffers at
once during batch 0. That said, it's questionable whether it really
make sense to have an initial number of batches that is very large.
Does partitioning the input data into 64k batches really make sense,
or would it be more efficient to partition it 256 ways initially and
then do a second pass over each of those to split them up another 256
ways? It's a lot more I/O, but trying to split 64k ways at once is
presumably going to thrash the File table as well as do a lot of
completely random physical I/O, so maybe it's worth considering.

Another thought is that, if we really do want to partition 64k ways
all at once, maybe 16kb set aside for each batch is not the right
approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
memory available for partitioning, wouldn't it make sense to read a
gigabyte of tuples, sort them by batch #, and then open each file that
needs to get at least 1 tuple, write all the tuples into that file,
and close it? This seems more scalable than what we do today, because
it doesn't require a certain amount of memory per batch. The
performance might not be great if you're using a really small amount
of memory for a really large number of batches, but it might still be
better than the current algorithm, which could easily leave a lot of
that memory idling a lot of the time.

Said another way, I think the current algorithm is optimized for small
numbers of batches. Repeatedly filling and flushing a 16kB buffer
makes sense if the number of buffers isn't that big so that flushes
are regular and a buffer is typically going to spend a lot of its time
approximately half full. But when the number of batches becomes large,
buffers will start to be flushed less and less often, especially if
there is skew in the data but to some degree even if there isn't. Any
buffer that sits there for "a long time" -- whatever that means
exactly -- without getting flushed is not a good use of memory.

I'm just spitballing here. Don't confuse anything in this email with a
demand for you to do something different than you are.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Sample rate added to pg_stat_statements
Next
From: Noah Misch
Date:
Subject: Re: AIO v2.2