Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date
Msg-id CA+hUKGK-donWK+AiJtmpnxbq4tyTG0Mr-xZyNei08Eu1sh2uWA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
List pgsql-bugs
On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 21/2/2024 19:52, Tomas Vondra wrote:
> > It's a bit weird it needs 1.8GB of memory, but perhaps that's also
> > linked to the number of batches, somehow?

> I found one possible weak point in the code of PHJ:
> ExecParallelHashJoinSetUpBatches:
>
> pstate->batches = dsa_allocate0(hashtable->area,
>         EstimateParallelHashJoinBatch(hashtable) * nbatch);
>
> It could explain why we have such a huge memory allocation with a size
> not bonded to a power of 2.

Hmm, a couple of short term ideas:

1.  Maybe the planner should charge a high cost for exceeding some
soft limit on the expected number of batches; perhaps it could be
linked to the number of file descriptors we can open (something like
1000), because during the build phase we'll be opening and closing
random file descriptors like crazy due to vfd pressure, which is not
free; that should hopefully discourage the planner from reaching cases
like this, but of course only in cases the planner can predict.

2.  It sounds like we should clamp nbatches.  We don't want the
per-partition state to exceed MaxAllocSize, which I guess is what
happened here if the above-quoted line produced the error?  (The flag
that would allow "huge" allocations exceeding MaxAllocSize seems
likely to make the world worse while we have so many footguns -- I
think once a query has reached this stage of terrible execution, it's
better to limit the damage.)

The performance and memory usage will still be terrible.  We just
don't deal well with huge numbers of tiny partitions.  Tiny relative
to input size, with input size being effectively unbounded.

Thoughts for later:  A lower limit would of course be possible and
likely desirable.  Once the partition-bookkeeping memory exceeds
work_mem * hash_mem_multiplier, it becomes stupid to double it just
because a hash table has hit that size, because that actually
increases total memory usage (and fast, because it's quadratic).  We'd
be better off in practice giving up on the hash table size limit and
hoping for the best.  But the real long term question is what strategy
we're going to use to actually deal with this situation properly
*without* giving up our memory usage policies and hoping for the best,
and that remains an open question.  To summarise the two main ideas
put forward so far: (1) allow very high number of batches, but process
at most N of M batches at a time, using a temporary "all-the-rest"
batch to be re-partitioned to feed the next N batches + the rest in a
later cycle, (2) fall back to looping over batches multiple times in
order to keep nbatches <= a small limit while also not exceeding a
hash table size limit.  Both have some tricky edge cases, especially
with parallelism in the picture but probably even without it.  I'm
willing to work more on exploring this some time after the 17 cycle.



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18374: Printing memory contexts on OOM condition might lead to segmentation fault
Next
From: Noah Misch
Date:
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)