Re: Adjusting hash join memory limit to handle batch explosion - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Adjusting hash join memory limit to handle batch explosion |
Date | |
Msg-id | CA+TgmoZBWC-x4VY+OdGNh4Rz=LWeLytm+ZX745a058mmJWPy+A@mail.gmail.com Whole thread Raw |
In response to | Re: Adjusting hash join memory limit to handle batch explosion (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adjusting hash join memory limit to handle batch explosion
|
List | pgsql-hackers |
On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas@vondra.me> wrote: > I wonder if maybe a better solution would be to allow BufFiles with > smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much > that helps, before the buffering stops being effective as the buffer > gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer > in half for every nbatch doubling, we'd be down to 1B after 13 rounds > (but the buffer is useless once it gets too small to hold multiple > tuples, it's only like 5 cycles). I was more thinking about whether we need to keep all of those buffers around all the time. If the number of batches doesn't increase, then after we finish moving things into batches they should never need to be moved into a different batch. If it does, then things are different, but for example if we initially plan on 64 batches and then later decide we need 256 batches, we should really only need 3 buffers at a time, except for the initial work during batch 0. (In this example, a tuple that is initially assigned to batch 1 might need to be moved to batch 65, 129, or 193, but it can't need to go anywhere else.) But I don't quite know how we could avoid needing all the buffers at once during batch 0. That said, it's questionable whether it really make sense to have an initial number of batches that is very large. Does partitioning the input data into 64k batches really make sense, or would it be more efficient to partition it 256 ways initially and then do a second pass over each of those to split them up another 256 ways? It's a lot more I/O, but trying to split 64k ways at once is presumably going to thrash the File table as well as do a lot of completely random physical I/O, so maybe it's worth considering. Another thought is that, if we really do want to partition 64k ways all at once, maybe 16kb set aside for each batch is not the right approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of memory available for partitioning, wouldn't it make sense to read a gigabyte of tuples, sort them by batch #, and then open each file that needs to get at least 1 tuple, write all the tuples into that file, and close it? This seems more scalable than what we do today, because it doesn't require a certain amount of memory per batch. The performance might not be great if you're using a really small amount of memory for a really large number of batches, but it might still be better than the current algorithm, which could easily leave a lot of that memory idling a lot of the time. Said another way, I think the current algorithm is optimized for small numbers of batches. Repeatedly filling and flushing a 16kB buffer makes sense if the number of buffers isn't that big so that flushes are regular and a buffer is typically going to spend a lot of its time approximately half full. But when the number of batches becomes large, buffers will start to be flushed less and less often, especially if there is skew in the data but to some degree even if there isn't. Any buffer that sits there for "a long time" -- whatever that means exactly -- without getting flushed is not a good use of memory. I'm just spitballing here. Don't confuse anything in this email with a demand for you to do something different than you are. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: