The root cause is that hash join treats batches as pretty much free, but
that's not really true - we do allocate two BufFile structs per batch,
and each BufFile is ~8kB as it includes PGAlignedBuffer.
The OOM is not very surprising, because with 524288 batches it'd need
about 8GB of memory, and the system only has 8GB RAM installed.
The second patch tries to enforce work_mem more strictly. That would be
impossible if we were to keep all the BufFile structs in memory, so
instead it slices the batches into chunks that fit into work_mem, and
then uses a single "overflow" file for slices currently not in memory.
These extra slices can't be counted into work_mem, but we should need
just very few of them. For example with work_mem=4MB the slice is 128
batches, so we need 128x less overflow files (compared to per-batch).
Hi Tomas
I read your second patch which uses overflow buf files to reduce the total number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem.
I mentioned in another thread:
There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever.
The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory.
Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0
So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches).
nbatch_inmemory in your patch could also use the upper rule to redefine.
What's your opinion?