Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs

From Andrei Lepikhov
Subject Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date
Msg-id cc488d1c-6477-4e20-8bd0-cf152a2e59bf@gmail.com
Whole thread Raw
In response to Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Craig Milhiser <craig@milhiser.com>)
Responses Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
List pgsql-bugs
On 10/9/24 23:28, Craig Milhiser wrote:
> On Oct 8 Andrei Lepikhov wrote
>  > I'm a bit confused: the thread subject named ' invalid DSA memory alloc
>  > request size ...', but you write about issue with OOM killer. It is two
>  > different issues, which one do you have exactly?
> 
> 
> I started with the Invalid DSA memory allocation error. I was asked to 
> try an experimental patch above. Then I got OOM with the patch only 
> running parallel. You will see below, there was an OOM but I do not 
> believe it is the query.
So, I think the patch works, but you found out one more issue at the 
same query. Awesome!

> I ran a new test on an instance with 512 GiB of memory.  After I applied 
> the patch, the Invalid DSA memory allocation message was not 
> replicated.  Running max_parallel_workers_per_gather = 0, the query took 
> ~9.5 seconds and used <1 GiB of memory.  With 
> max_parallel_workers_per_gather = 2 the query used  ~170 GiB of memory, 
> ~70 GB of temp files were written and the query ran for more than 1 hour 
> until I ran out of disk space.
It's fascinating. I have one user report like that, but they also didn't 
provide any synthetic test. I think it is almost impossible to create 
such reproduction without a minimal understanding of what's happening. I 
can imagine only a data skew or a logical bug in this part of the code. 
But without direct perf and gdb touch, it is hard to resolve the issue 
by just gazing into the code.
Additional actions can provide some food for thought:
1. If you remove aggregates (STRING_AGG, count) from the selection list, 
will the problem remain? What about OFFSET 0?
2. Can you build extended statistics on account_id,disabled,user_type_id 
and provide an explain (and explain analyse)?
3. Can you use pg_query_state (unfortunately, it needs a patch and 
re-compilation) and show us intermediate execution state snapshots?
4. I see a duplicate clause in the query: SGM.OBJECT_ID = U.OBJECT_ID. 
For what reason you have it here? can you remove it from the query?
5. One more wild guess: can you analyse how much NULLS contains column 
u.users_id at the moment when HashJoin evaluates clause (u.users_id = 
cf.user_id)?

[1] https://github.com/postgrespro/pg_query_state

-- 
regards, Andrei Lepikhov




pgsql-bugs by date:

Previous
From: Richard Guo
Date:
Subject: Re: BUG #18652: Planner can not find pathkey item to sort for query with expression and expression index
Next
From: Tom Lane
Date:
Subject: Re: BUG #18652: Planner can not find pathkey item to sort for query with expression and expression index