Thread: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
Has the referenced bug in this discussion[1] been released? The discussion mentions it was fixed. I read the release notes but do not recognize this fix as documented. I do not want to hijack that thread.
I am using v16.3 - AWS Aurora. I have opened a support case with AWS also.
I have 4 queries that get this error message. The symptoms are the same - thousands of temp files are created before the error is returned. Just because I am getting the same error/symptoms does not mean it is the same problem. If the fix has been released I can check against the AWS Aurora version with their support staff.
I investigated one of the queries so far. When I turn off parallel execution this query completes.
I am working on a reproducible example to send to the community. The query is sensitive to the values in the predicates. Most of the time these queries work. So far, I have not been able to get an MRE but still working on it.
Thanks
Craig
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Tom Lane
Date:
Craig Milhiser <craig@milhiser.com> writes: > Has the referenced bug in this discussion[1] been released? I don't see any indication, either in that thread or in the commit log, that anything has been done in this area since about 16.2. It's not an easy problem in general. Having said that, Aurora is not Postgres, and I don't know how closely they track us. Can you reproduce this problem on a stock build of community Postgres? regards, tom lane
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Milhiser <craig@milhiser.com> writes:
> Has the referenced bug in this discussion[1] been released?
I don't see any indication, either in that thread or in the commit
log, that anything has been done in this area since about 16.2.
It's not an easy problem in general.
Having said that, Aurora is not Postgres, and I don't know
how closely they track us. Can you reproduce this problem
on a stock build of community Postgres?
regards, tom lane
Thanks. I will work on setting that up. Also getting the aws team involved.
The one query I investigated I rewrote. It took 15 seconds without parallel to avoid this issue. I rewrote it and now the query completes in 0.2 seconds. For this query I can avoid the issue, at least temporarily, by making a better query. But we need to solve the real problem. And I have not looked at the other queries affecting me. I may not get so lucky again.
I will post when I get the stock Postgres setup and running.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Mon, Sep 23, 2024 at 12:52 PM Craig Milhiser <craig@milhiser.com> wrote: > On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Craig Milhiser <craig@milhiser.com> writes: >> > Has the referenced bug in this discussion[1] been released? >> >> I don't see any indication, either in that thread or in the commit >> log, that anything has been done in this area since about 16.2. >> It's not an easy problem in general. >> >> Having said that, Aurora is not Postgres, and I don't know >> how closely they track us. Can you reproduce this problem >> on a stock build of community Postgres? >> >> regards, tom lane > > > Thanks. I will work on setting that up. Also getting the aws team involved. > > The one query I investigated I rewrote. It took 15 seconds without parallel to avoid this issue. I rewrote it and now thequery completes in 0.2 seconds. For this query I can avoid the issue, at least temporarily, by making a better query. But we need to solve the real problem. And I have not looked at the other queries affecting me. I may not get solucky again. > > I will post when I get the stock Postgres setup and running. Hi, FYI this is on my radar and it would be good to try to make a small back-patchable improvement. I would need to swap the problem back into memory to be sure but from memory the problem is that parallel hash join partitions take 432 bytes of book keeping memory each (there is also the problem that they each have output buffers, but those are not allocated in one big array, and for non-parallel hash join there is also a per-partition overhead, but it's smaller due to less bookkeeping state so we don't hear about it). Hash joins use partition files (AKA batches) to try to keep each hash table under work_mem * hash_mem_multiplier, and if you have 2^22 (~4 million) partitions we therefore try to allocate 1811939328 bytes of memory (the number in $SUBJECT), exceeding our arbitrary 1GB allocation limit. It's possible to turn that arbitrary allocation limit off, but that'd be treating a symptom and certainly not really produce good performance. If you increased (probably at least by double) work_mem or hash_mem_multiplier, you might have better luck: at some cross-over point, doubling the size of the array of partitions uses more memory than you can save by (potentially) halving the size of the hash table! Even aside from that arithmetic problem, anything more than around 2^10 partitions (~1 thousand) will start to perform worse unless you also increase max_files_per_process to match, because every flush of a page of spill file will likely have to close and open a file (on stock PostgreSQL at least, but Aurora may have a completely different scheme for temporary spill data for all I know). So we could simply cap the number of partitions, and start ignoring the work_mem * hash_mem_multiplier limit beyond that cap, but we haven't done it yet because it's hard to pick a number and reports are rare (ie very large queries run with low work_mem, if that is indeed the problem here). 2^21 would be the highest plausible candidate (2^21 * 432 = ~900MB), but it's still very high. There is a related problem of extreme skew (squillions of tuples in one bucket), which is much harder to treat, but ideas were mentioned in that and other threads... For the non-skewed version of the problem, which may be more common, at one level at least the problem is the defaults being set for small memory machines, people running increasingly huge joins on huge memory machines, and the machinery to make it work being a bit naive and excessively expensive. We could and should invent better strategies for coping.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 432 bytes Oh, as Tomas pointed out in the referenced thread, the actual number depends on the number of workers because there is some per-worker state for partitions, but that number does seem consistent with your reported case. Perhaps the correct answer is simply to give up partitioning when the partition state size would exceed the potential hash table savings by further partitioning. Another question is whether it's arriving at the problematic number by underestimating and then repeatedly expanding, which is very inefficient, or planning the high number from the outset, but either way that'd be two different code paths that would need to respect a new cap. If it's the former, there may also be ways to improve initial estimates with statistics.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
On Sun, Sep 22, 2024 at 10:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 432 bytes
Oh, as Tomas pointed out in the referenced thread,
Thanks for working on it and the detailed explanation. I tested set max_parallel_workers_per_gather = 0 from the original thread and it was working. We are putting that into the application, for our largest customers. Set to 0 before the query then back to 2 after.
Your explanation also shows why rewriting of the query works. I reduced the number of rows being processed much earlier in the query. The query was written with 1 set of many joins which worked on millions of rows then reduced to a handful. I broke this into a materialized CTE that forced Postgres to reduce the rows early then do the joins. Rewriting the query is better regardless of this issue.
I am working on getting a stock Postgres in our production protected enclave with our production database. Probably a full day of work that I need to splice in. We have a similar mechanism in our development environment. Once working I can help test and debug any changes. I can also work on a reproducible example.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
> Having said that, Aurora is not Postgres, and I don't know
> how closely they track us. Can you reproduce this problem
> on a stock build of community Postgres?
> how closely they track us. Can you reproduce this problem
> on a stock build of community Postgres?
I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzed and ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well.
select version();
version --------------------------------------------------------------------------------------------------
PostgreSQL 17.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0, 64-bit
Since I have this saved for building, if you need logs or have an experiment, let me know. I tried to reproduce the issue with artificial data simply but the query completed. A different optimization plan was created since the data skew was very different.
I have workarounds of turning parallel execution off for the known queries and when possible rewriting the queries.
Thanks for the help.
Craig
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote: > I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzedand ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well. Since you're building from source, you could try applying the patch posted by Andrei Lephikov: https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru I suspect we may want to limit it to a smaller number than that, as mentioned already, and I think we should also apply the same cap to the initial estimate (Andrei's patch only caps it when it decides to increase it, not for the initial nbatch number). I can write a patch like that in a few days when I return from travelling, and we can aim to get it into the November release, but I suspect Andrei's patch might already avoid the error for your case.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
> Since you're building from source, you could try applying the patch
>posted by Andrei Lephikov:
> https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru
>posted by Andrei Lephikov:
> https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru
This did not work for me. I am running out of memory.
I applied the patch, make clean, make, make check, sudo make install. I am running out of the box Postgres configuration.
Memory below uses "free -m".
Before loading Postgres
total used free shared buff/cache available
Mem: 31388 669 30467 2 639 30719
Swap: 0 0 0
After loading
total used free shared buff/cache available
Mem: 31388 672 30464 14 651 30715
Swap: 0 0 0
I go into psql
set max_parallel_workers_per_gather = 0;
run the query multiple times, takes 9.5 seconds at steady state, returns 20 rows.
Memory is still available
total used free shared buff/cache available
Mem: 31388 921 22547 142 8460 30466
Swap: 0 0 0
In the same psql session, set max_parallel_workers_per_gather = 2; then run the query again. This runs for 1 minute then:
2024-10-01 18:28:45.883 UTC [2586] LOG: background worker "parallel worker" (PID 4465) was terminated by signal 9: Killed
2024-10-01 18:28:45.883 UTC [2586] DETAIL: Failed process was running: SELECT
...
2024-10-01 18:28:45.883 UTC [2586] LOG: terminating any other active server processes
2024-10-01 18:28:46.620 UTC [2586] LOG: all server processes terminated; reinitializing
I got this as close to the end as I could
total used free shared buff/cache available
Mem: 31388 31014 535 1955 2156 373
Swap: 0 0 0
Though OOM conditions often means all bets are off for behavior, I tried something different. I rebooted, started Postgres then run the query. I do not set parallel_... = 0 and run the query which populated the cache. The machine exhausts memory again but usually "hangs". I need to restart. Below is the frozen screen
total used free shared buff/cache available
Mem: 31388 31317 240 1955 2140 70
Swap: 0 0 0
I ran these sequences multiple times. I also analyzed the data again just to make sure.
I reverted the patch to make sure I am reproducing the issue. I get the same 1.8GB allocation failure with parallel. Without parallel the query takes ~10 seconds. The patch increased the single worker performance for this query for out of the box configuration by 5%.
I applied the patch, make clean, make, make check, sudo make install. I am running out of the box Postgres configuration.
Memory below uses "free -m".
Before loading Postgres
total used free shared buff/cache available
Mem: 31388 669 30467 2 639 30719
Swap: 0 0 0
After loading
total used free shared buff/cache available
Mem: 31388 672 30464 14 651 30715
Swap: 0 0 0
I go into psql
set max_parallel_workers_per_gather = 0;
run the query multiple times, takes 9.5 seconds at steady state, returns 20 rows.
Memory is still available
total used free shared buff/cache available
Mem: 31388 921 22547 142 8460 30466
Swap: 0 0 0
In the same psql session, set max_parallel_workers_per_gather = 2; then run the query again. This runs for 1 minute then:
2024-10-01 18:28:45.883 UTC [2586] LOG: background worker "parallel worker" (PID 4465) was terminated by signal 9: Killed
2024-10-01 18:28:45.883 UTC [2586] DETAIL: Failed process was running: SELECT
...
2024-10-01 18:28:45.883 UTC [2586] LOG: terminating any other active server processes
2024-10-01 18:28:46.620 UTC [2586] LOG: all server processes terminated; reinitializing
I got this as close to the end as I could
total used free shared buff/cache available
Mem: 31388 31014 535 1955 2156 373
Swap: 0 0 0
Though OOM conditions often means all bets are off for behavior, I tried something different. I rebooted, started Postgres then run the query. I do not set parallel_... = 0 and run the query which populated the cache. The machine exhausts memory again but usually "hangs". I need to restart. Below is the frozen screen
total used free shared buff/cache available
Mem: 31388 31317 240 1955 2140 70
Swap: 0 0 0
I ran these sequences multiple times. I also analyzed the data again just to make sure.
I reverted the patch to make sure I am reproducing the issue. I get the same 1.8GB allocation failure with parallel. Without parallel the query takes ~10 seconds. The patch increased the single worker performance for this query for out of the box configuration by 5%.
Thanks
On Sun, Sep 29, 2024 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote:
> I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzed and ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well.
Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:
https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru
I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number). I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 2/10/2024 02:12, Craig Milhiser wrote: > > Since you're building from source, you could try applying the patch > >posted by Andrei Lephikov: > > > https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru <https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru> > > This did not work for me. I am running out of memory. Can you provide an explain of this query? Also, can you remove unnecessary details from the query text like temporary view or CASE .. WHEN construction, if the OOM still reproduces. -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
On Oct 1, 2024 Andrei Lepikhov wrote
> Can you provide an explain of this query?
Apologies for the delay. I have been travelling since Wednesday night. Thanks for your help and time with this issue.
Below is the query, with specific values redacted. An explain with max_parallel_workers_per_gather = 2 and explain analyze max_parallel_workers_per_gather = 0.
In this case, the number of rows from the users table based on account_id is in the 99th percentile for this table and it is a long and sparse right tail.
This is using V17.0 stock source code and stock configuration on linux.
The query
SELECT
CF.NUMERIC_VALUE AS CF_COL,
U.USERS_ID,
U.OBJECT_ID,
U.ACCOUNT_ID,
U.EXTERNAL_ID,
U.FIRST_NAME,
U.MIDDLE_NAME,
U.LAST_NAME,
U.DISABLED,
U.DEACTIVATED AS SUSPEND_DATE,
U.CREATED,
U.UPDATED,
U.IS_BLE_TWO_FACTOR_EXEMPT,
U.HAS_THUMBNAIL,
U.USER_TYPE_ID,
UI.USER_IMAGE_ID,
UI.CONTENT_TYPE AS USER_IMAGE_CONTENT_TYPE,
COUNT(*) OVER () AS TOTAL_USERS_COUNT,
STRING_AGG(SG.object_ID::CHARACTER VARYING, ';') AS GROUPS,
STRING_AGG(SG.NAME, ' ') AS GROUPNAMES
FROM
USERS U
LEFT JOIN USER_IMAGE UI ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
LEFT JOIN SECURITY_GROUP_MEMBER SGM ON SGM.OBJECT_ID = U.OBJECT_ID
AND SGM.OBJECT_ID = U.OBJECT_ID
LEFT JOIN SECURITY_GROUP SG
ON SGM.SECURITY_GROUP_ID = SG.SECURITY_GROUP_ID
AND SG.DISABLED = 0
AND SG.ACCOUNT_ID = U.ACCOUNT_ID
AND SG.SECURITY_GROUP_TYPE_ID = 2
LEFT JOIN CUSTOM_FIELD_VALUE CF
ON U.USERS_ID = CF.USER_ID
AND CF.CUSTOM_FIELD_ID = <craig redacted>
WHERE
U.ACCOUNT_ID = <craig redacted>
AND U.USER_TYPE_ID = 1
AND U.DISABLED = 0
GROUP BY
U.USERS_ID,
UI.USER_IMAGE_ID,
CF.NUMERIC_VALUE
ORDER BY
U.LAST_NAME ASC,
U.FIRST_NAME ASC,
U.USERS_ID ASC
LIMIT
20
OFFSET
0;
CF.NUMERIC_VALUE AS CF_COL,
U.USERS_ID,
U.OBJECT_ID,
U.ACCOUNT_ID,
U.EXTERNAL_ID,
U.FIRST_NAME,
U.MIDDLE_NAME,
U.LAST_NAME,
U.DISABLED,
U.DEACTIVATED AS SUSPEND_DATE,
U.CREATED,
U.UPDATED,
U.IS_BLE_TWO_FACTOR_EXEMPT,
U.HAS_THUMBNAIL,
U.USER_TYPE_ID,
UI.USER_IMAGE_ID,
UI.CONTENT_TYPE AS USER_IMAGE_CONTENT_TYPE,
COUNT(*) OVER () AS TOTAL_USERS_COUNT,
STRING_AGG(SG.object_ID::CHARACTER VARYING, ';') AS GROUPS,
STRING_AGG(SG.NAME, ' ') AS GROUPNAMES
FROM
USERS U
LEFT JOIN USER_IMAGE UI ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
LEFT JOIN SECURITY_GROUP_MEMBER SGM ON SGM.OBJECT_ID = U.OBJECT_ID
AND SGM.OBJECT_ID = U.OBJECT_ID
LEFT JOIN SECURITY_GROUP SG
ON SGM.SECURITY_GROUP_ID = SG.SECURITY_GROUP_ID
AND SG.DISABLED = 0
AND SG.ACCOUNT_ID = U.ACCOUNT_ID
AND SG.SECURITY_GROUP_TYPE_ID = 2
LEFT JOIN CUSTOM_FIELD_VALUE CF
ON U.USERS_ID = CF.USER_ID
AND CF.CUSTOM_FIELD_ID = <craig redacted>
WHERE
U.ACCOUNT_ID = <craig redacted>
AND U.USER_TYPE_ID = 1
AND U.DISABLED = 0
GROUP BY
U.USERS_ID,
UI.USER_IMAGE_ID,
CF.NUMERIC_VALUE
ORDER BY
U.LAST_NAME ASC,
U.FIRST_NAME ASC,
U.USERS_ID ASC
LIMIT
20
OFFSET
0;
Explain with stock configuration which is set max_parallel_workers_per_gather = 2;
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2529139.77..2529139.82 rows=20 width=187)
-> Sort (cost=2529139.77..2530484.84 rows=538028 width=187)
Sort Key: u.last_name, u.first_name, u.users_id
-> WindowAgg (cost=2514822.88..2514823.03 rows=538028 width=187)
-> Finalize GroupAggregate (cost=2432583.40..2508097.68 rows=538028 width=179)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Gather Merge (cost=2432583.40..2492181.03 rows=448356 width=179)
Workers Planned: 2
-> Partial GroupAggregate (cost=2431583.37..2439429.60 rows=224178 width=179)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Sort (cost=2431583.37..2432143.82 rows=224178 width=140)
Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Parallel Hash Left Join (cost=1384936.37..2395567.35 rows=224178 width=140)
Hash Cond: (u.users_id = cf.user_id)
-> Hash Left Join (cost=1124308.04..2134350.56 rows=224178 width=134)
Hash Cond: (sgm.security_group_id = sg.security_group_id)
-> Nested Loop Left Join (cost=1119678.30..2129132.34 rows=224178 width=117)
-> Parallel Hash Right Join (cost=1119677.73..1326436.98 rows=224178 width=109)
Hash Cond: (ui.user_image_id = u.user_image_id)
-> Parallel Seq Scan on user_image ui (cost=0.00..130846.12 rows=3533412 width=18)
-> Parallel Hash (cost=1113372.50..1113372.50 rows=224178 width=99)
-> Parallel Bitmap Heap Scan on users u (cost=8824.42..1113372.50 rows=224178 width=99)
Recheck Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Bitmap Index Scan on u_act_dis_type (cost=0.00..8689.92 rows=538028 width=0)
Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm (cost=0.57..3.57 rows=1 width=16)
Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
-> Hash (cost=4622.16..4622.16 rows=607 width=41)
-> Index Scan using account_security_group_fk_ind on security_group sg (cost=0.43..4622.16 rows=607 width=41)
Index Cond: (account_id = <craig redacted>)
Filter: ((disabled = 0) AND (security_group_type_id = 2))
-> Parallel Hash (cost=259796.42..259796.42 rows=66553 width=14)
-> Parallel Index Scan using date_value_idx on custom_field_value cf (cost=0.56..259796.42 rows=66553 width=14)
Index Cond: (custom_field_id = <craig redacted>)
(34 rows)
Limit (cost=2529139.77..2529139.82 rows=20 width=187)
-> Sort (cost=2529139.77..2530484.84 rows=538028 width=187)
Sort Key: u.last_name, u.first_name, u.users_id
-> WindowAgg (cost=2514822.88..2514823.03 rows=538028 width=187)
-> Finalize GroupAggregate (cost=2432583.40..2508097.68 rows=538028 width=179)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Gather Merge (cost=2432583.40..2492181.03 rows=448356 width=179)
Workers Planned: 2
-> Partial GroupAggregate (cost=2431583.37..2439429.60 rows=224178 width=179)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Sort (cost=2431583.37..2432143.82 rows=224178 width=140)
Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Parallel Hash Left Join (cost=1384936.37..2395567.35 rows=224178 width=140)
Hash Cond: (u.users_id = cf.user_id)
-> Hash Left Join (cost=1124308.04..2134350.56 rows=224178 width=134)
Hash Cond: (sgm.security_group_id = sg.security_group_id)
-> Nested Loop Left Join (cost=1119678.30..2129132.34 rows=224178 width=117)
-> Parallel Hash Right Join (cost=1119677.73..1326436.98 rows=224178 width=109)
Hash Cond: (ui.user_image_id = u.user_image_id)
-> Parallel Seq Scan on user_image ui (cost=0.00..130846.12 rows=3533412 width=18)
-> Parallel Hash (cost=1113372.50..1113372.50 rows=224178 width=99)
-> Parallel Bitmap Heap Scan on users u (cost=8824.42..1113372.50 rows=224178 width=99)
Recheck Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Bitmap Index Scan on u_act_dis_type (cost=0.00..8689.92 rows=538028 width=0)
Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm (cost=0.57..3.57 rows=1 width=16)
Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
-> Hash (cost=4622.16..4622.16 rows=607 width=41)
-> Index Scan using account_security_group_fk_ind on security_group sg (cost=0.43..4622.16 rows=607 width=41)
Index Cond: (account_id = <craig redacted>)
Filter: ((disabled = 0) AND (security_group_type_id = 2))
-> Parallel Hash (cost=259796.42..259796.42 rows=66553 width=14)
-> Parallel Index Scan using date_value_idx on custom_field_value cf (cost=0.56..259796.42 rows=66553 width=14)
Index Cond: (custom_field_id = <craig redacted>)
(34 rows)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4152596.58..4152596.63 rows=20 width=187) (actual time=10192.249..10192.258 rows=20 loops=1)
-> Sort (cost=4152596.58..4153941.65 rows=538028 width=187) (actual time=10192.248..10192.255 rows=20 loops=1)
Sort Key: u.last_name, u.first_name, u.users_id
Sort Method: top-N heapsort Memory: 32kB
-> WindowAgg (cost=4138279.81..4138279.85 rows=538028 width=187) (actual time=9748.632..9958.924 rows=904292 loops=1)
-> GroupAggregate (cost=4112723.52..4131554.50 rows=538028 width=179) (actual time=8482.695..9389.560 rows=904292 loops=1)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Sort (cost=4112723.52..4114068.59 rows=538028 width=140) (actual time=8482.679..8655.695 rows=1720872 loops=1)
Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
Sort Method: external merge Disk: 199104kB
-> Hash Left Join (cost=602312.67..3984272.46 rows=538028 width=140) (actual time=1955.881..7537.783 rows=1720872 loops=1)
Hash Cond: (u.users_id = cf.user_id)
-> Hash Left Join (cost=340636.13..3721183.60 rows=538028 width=134) (actual time=1806.879..6920.376 rows=1720872 loops=1)
Hash Cond: (sgm.security_group_id = sg.security_group_id)
-> Nested Loop Left Join (cost=336006.39..3715141.53 rows=538028 width=117) (actual time=1804.650..6599.170 rows=1720872 loops=1)
-> Hash Left Join (cost=336005.82..1788669.80 rows=538028 width=109) (actual time=1804.623..3537.213 rows=904292 loops=1)
Hash Cond: (u.user_image_id = ui.user_image_id)
-> Index Scan using u_act_dis_type on users u (cost=0.56..1384749.23 rows=538028 width=99) (actual time=0.033..1133.900 rows=904292 loops=1)
Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Hash (cost=180313.89..180313.89 rows=8480189 width=18) (actual time=1804.516..1804.517 rows=8488571 loops=1)
Buckets: 131072 Batches: 128 Memory Usage: 3986kB
-> Seq Scan on user_image ui (cost=0.00..180313.89 rows=8480189 width=18) (actual time=0.011..753.277 rows=8488571 loops=1)
-> Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm (cost=0.57..3.57 rows=1 width=16) (actual time=0.003..0.003 rows=2 loops=904292)
Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
Heap Fetches: 0
-> Hash (cost=4622.16..4622.16 rows=607 width=41) (actual time=2.219..2.220 rows=795 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 78kB
-> Index Scan using account_security_group_fk_ind on security_group sg (cost=0.43..4622.16 rows=607 width=41) (actual time=0.937..2.121 rows=795 loops=1)
Index Cond: (account_id = <craig redacted>)
Filter: ((disabled = 0) AND (security_group_type_id = 2))
Rows Removed by Filter: 764
-> Hash (cost=260262.29..260262.29 rows=113140 width=14) (actual time=148.930..148.931 rows=125986 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 6931kB
-> Index Scan using date_value_idx on custom_field_value cf (cost=0.56..260262.29 rows=113140 width=14) (actual time=0.021..132.508 rows=125986 loops=1)
Index Cond: (custom_field_id = <craig redacted>)
Planning Time: 0.983 ms
Execution Time: 10233.621 ms
(37 rows)
Limit (cost=4152596.58..4152596.63 rows=20 width=187) (actual time=10192.249..10192.258 rows=20 loops=1)
-> Sort (cost=4152596.58..4153941.65 rows=538028 width=187) (actual time=10192.248..10192.255 rows=20 loops=1)
Sort Key: u.last_name, u.first_name, u.users_id
Sort Method: top-N heapsort Memory: 32kB
-> WindowAgg (cost=4138279.81..4138279.85 rows=538028 width=187) (actual time=9748.632..9958.924 rows=904292 loops=1)
-> GroupAggregate (cost=4112723.52..4131554.50 rows=538028 width=179) (actual time=8482.695..9389.560 rows=904292 loops=1)
Group Key: u.users_id, ui.user_image_id, cf.numeric_value
-> Sort (cost=4112723.52..4114068.59 rows=538028 width=140) (actual time=8482.679..8655.695 rows=1720872 loops=1)
Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
Sort Method: external merge Disk: 199104kB
-> Hash Left Join (cost=602312.67..3984272.46 rows=538028 width=140) (actual time=1955.881..7537.783 rows=1720872 loops=1)
Hash Cond: (u.users_id = cf.user_id)
-> Hash Left Join (cost=340636.13..3721183.60 rows=538028 width=134) (actual time=1806.879..6920.376 rows=1720872 loops=1)
Hash Cond: (sgm.security_group_id = sg.security_group_id)
-> Nested Loop Left Join (cost=336006.39..3715141.53 rows=538028 width=117) (actual time=1804.650..6599.170 rows=1720872 loops=1)
-> Hash Left Join (cost=336005.82..1788669.80 rows=538028 width=109) (actual time=1804.623..3537.213 rows=904292 loops=1)
Hash Cond: (u.user_image_id = ui.user_image_id)
-> Index Scan using u_act_dis_type on users u (cost=0.56..1384749.23 rows=538028 width=99) (actual time=0.033..1133.900 rows=904292 loops=1)
Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Hash (cost=180313.89..180313.89 rows=8480189 width=18) (actual time=1804.516..1804.517 rows=8488571 loops=1)
Buckets: 131072 Batches: 128 Memory Usage: 3986kB
-> Seq Scan on user_image ui (cost=0.00..180313.89 rows=8480189 width=18) (actual time=0.011..753.277 rows=8488571 loops=1)
-> Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm (cost=0.57..3.57 rows=1 width=16) (actual time=0.003..0.003 rows=2 loops=904292)
Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
Heap Fetches: 0
-> Hash (cost=4622.16..4622.16 rows=607 width=41) (actual time=2.219..2.220 rows=795 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 78kB
-> Index Scan using account_security_group_fk_ind on security_group sg (cost=0.43..4622.16 rows=607 width=41) (actual time=0.937..2.121 rows=795 loops=1)
Index Cond: (account_id = <craig redacted>)
Filter: ((disabled = 0) AND (security_group_type_id = 2))
Rows Removed by Filter: 764
-> Hash (cost=260262.29..260262.29 rows=113140 width=14) (actual time=148.930..148.931 rows=125986 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 6931kB
-> Index Scan using date_value_idx on custom_field_value cf (cost=0.56..260262.29 rows=113140 width=14) (actual time=0.021..132.508 rows=125986 loops=1)
Index Cond: (custom_field_id = <craig redacted>)
Planning Time: 0.983 ms
Execution Time: 10233.621 ms
(37 rows)
I asked someone else to try to build artificial data for this query. Maybe they will have a different take and be successful compared to me.
Thanks
Craig
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 10/7/24 18:42, Craig Milhiser wrote: > > On Oct 1, 2024 Andrei Lepikhov wrote > > > Can you provide an explain of this query? > > Apologies for the delay. I have been travelling since Wednesday night. > Thanks for your help and time with this issue. > > Below is the query, with specific values redacted. An explain with > max_parallel_workers_per_gather = 2 and explain analyze > max_parallel_workers_per_gather = 0. I'm a bit confused: the thread subject named ' invalid DSA memory alloc request size ...', but you write about issue with OOM killer. It is two different issues, which one do you have exactly? OOM killer can be explained easily, because I see huge string_agg aggregate - workers can utilise memory more intensively. For now, explain of an Aggregate node don't show information about factual sort operation of each aggregate and memory consumption. -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
On Oct 8 Andrei Lepikhov wrote
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?
I started with the Invalid DSA memory allocation error. I was asked to try an experimental patch above. Then I got OOM with the patch only running parallel. You will see below, there was an OOM but I do not believe it is the query.
Thanks for the push on OOM. I should have ran this test earlier.
v17.0 and out of the box Postgres configuration.
I ran a new test on an instance with 512 GiB of memory. After I applied the patch, the Invalid DSA memory allocation message was not replicated. Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds and used <1 GiB of memory. With max_parallel_workers_per_gather = 2 the query used ~170 GiB of memory, ~70 GB of temp files were written and the query ran for more than 1 hour until I ran out of disk space.
I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM and 70+GB of temp files with the patch. Only when using 2 parallel workers per gather.
The new test:
This morning I increased the machine size from 32 GiB to 512 GiB RAM.
With the patch applied and max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state. While it was running I captured memory. I ran the query a few times earlier to get the buffers loaded.
total used free shared buff/cache available
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi 489Gi
Swap: 0B 0B 0B
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi 489Gi
Swap: 0B 0B 0B
With the patch applied and max_parallel_workers_per_gather = 2; the query ran for more than 1 hour. During that time memory settled at:
total used free shared buff/cache available
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi 314Gi
Swap: 0B 0B 0B
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi 314Gi
Swap: 0B 0B 0B
Then the machine ran out of disk space: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space left on device
I captured top as well during the run.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40 postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35 postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43 postgres
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40 postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35 postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43 postgres
After rebooting:
df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /
As you mentioned there are string aggregations. I ran with parallel=0 and did some analysis. The aggregations do not seem to be creating something that is out of line.
select max(length(groups)), sum(length(groups)), max(length(groupnames)), sum(length(groupnames)) from milhiser_test;
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)
Perhaps this is a different problem than the "invalid DSA memory alloc". The patch might have addressed that problem and this is another issue. From < 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving from parallel = 0 to 2 seems something is off.
Summary before this test:
Before the patch linked above, I was receiving "ERROR: invalid DSA memory alloc request size 1879048192" when I ran the query with max_parallel_workers_per_gather = 2.
Before the patch with max_parallel_workers_per_gather = 0 the query worked in ~10 seconds at steady state.
I applied the patch to v17.0 source, rebuilt, and passed tests.
With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state and took < 1 GiB of memory.
With max_parallel_workers_per_gather = 2, the machine ran out of memory. This was a 32 GiB machine. The free memory when running without parallel was ~30 GiB free.
Thanks
On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 10/7/24 18:42, Craig Milhiser wrote:
>
> On Oct 1, 2024 Andrei Lepikhov wrote
>
> > Can you provide an explain of this query?
>
> Apologies for the delay. I have been travelling since Wednesday night.
> Thanks for your help and time with this issue.
>
> Below is the query, with specific values redacted. An explain with
> max_parallel_workers_per_gather = 2 and explain analyze
> max_parallel_workers_per_gather = 0.
I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?
OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.
--
regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Thu, Oct 10, 2024 at 5:28 AM Craig Milhiser <craig@milhiser.com> wrote: > Then the machine ran out of disk space: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0":No space left on device For that, I have a patch in the queue to unlink temporary files incrementally: https://www.postgresql.org/message-id/flat/CA+hUKG+RGdvhAdVu5_LH3Ksee+kW-XkTP_nMxBL+Rmgp3Tjb_w@mail.gmail.com That's just treating a symptom, though. Things have already gone quite wrong if we're repeatedly repartitioning our way up to 2 million batches and only giving up there because of Andrei's patch. I wonder if there something could be wrong with Parallel Hash Right Join, which we see in your plan. That's new-ish, and I vaguely recall another case where that seemed to be on the scene in a plan with a high number of batches... hmm. Definitely keen to see a reproducer with synthetic data if you can come up with one...
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 10/9/24 23:28, Craig Milhiser wrote: > On Oct 8 Andrei Lepikhov wrote > > I'm a bit confused: the thread subject named ' invalid DSA memory alloc > > request size ...', but you write about issue with OOM killer. It is two > > different issues, which one do you have exactly? > > > I started with the Invalid DSA memory allocation error. I was asked to > try an experimental patch above. Then I got OOM with the patch only > running parallel. You will see below, there was an OOM but I do not > believe it is the query. So, I think the patch works, but you found out one more issue at the same query. Awesome! > I ran a new test on an instance with 512 GiB of memory. After I applied > the patch, the Invalid DSA memory allocation message was not > replicated. Running max_parallel_workers_per_gather = 0, the query took > ~9.5 seconds and used <1 GiB of memory. With > max_parallel_workers_per_gather = 2 the query used ~170 GiB of memory, > ~70 GB of temp files were written and the query ran for more than 1 hour > until I ran out of disk space. It's fascinating. I have one user report like that, but they also didn't provide any synthetic test. I think it is almost impossible to create such reproduction without a minimal understanding of what's happening. I can imagine only a data skew or a logical bug in this part of the code. But without direct perf and gdb touch, it is hard to resolve the issue by just gazing into the code. Additional actions can provide some food for thought: 1. If you remove aggregates (STRING_AGG, count) from the selection list, will the problem remain? What about OFFSET 0? 2. Can you build extended statistics on account_id,disabled,user_type_id and provide an explain (and explain analyse)? 3. Can you use pg_query_state (unfortunately, it needs a patch and re-compilation) and show us intermediate execution state snapshots? 4. I see a duplicate clause in the query: SGM.OBJECT_ID = U.OBJECT_ID. For what reason you have it here? can you remove it from the query? 5. One more wild guess: can you analyse how much NULLS contains column u.users_id at the moment when HashJoin evaluates clause (u.users_id = cf.user_id)? [1] https://github.com/postgrespro/pg_query_state -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
Thomas Munro wrote
> I wonder if there something could be wrong with Parallel Hash Right Join...Definitely keen to see a reproducer
> with synthetic data if you can come up with one Andrei Lepikhov wrote
> I can imagine only a data skew or a logical bug in this part of the code.
> But without direct perf and gdb touch, it is hard to resolve the issue
> by just gazing into the code.
I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.
> But without direct perf and gdb touch, it is hard to resolve the issue
> by just gazing into the code.
Both of you are correct.
There are comments in the script. Please let me know of any questions or if you cannot reproduce it. If you want me to file a report via the form, let me know.
Using Postgres v17 with out of the box configuration.
drop table test_users;
create table test_users (account_id bigint not null, users_id bigint not null constraint test_users_pkey primary key, first_name varchar(105), last_name varchar(105), user_image_id bigint);
-- The account we are interested, data numbers are negative to eliminate duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name, user_image_id)
SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when random() < 0.95 then null else -1 * i end
FROM generate_series(1, 925_000) AS t(i)
;
-- Make enough other records to get the skew to force a Parallel Hash Right Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left Join and the query works
create table test_users (account_id bigint not null, users_id bigint not null constraint test_users_pkey primary key, first_name varchar(105), last_name varchar(105), user_image_id bigint);
-- The account we are interested, data numbers are negative to eliminate duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name, user_image_id)
SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when random() < 0.95 then null else -1 * i end
FROM generate_series(1, 925_000) AS t(i)
;
-- Make enough other records to get the skew to force a Parallel Hash Right Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left Join and the query works
-- 0.50 makes a right join and breaks; 0.95 makes a left join and works
-- Changes how many users are in user_image which, relative to the number of users and accounts, is the key skew that I found
-- Data numbers are positive
insert into test_users(account_id, users_id, first_name, last_name, user_image_id)
SELECT random(10, 50_000)::bigint, i, md5(random()::text), md5(random()::text), case when random() < 0.50 then null else i end
FROM generate_series(1, 50_000_000) AS t(i)
;
create index user_img_fk_idx on test_users using btree (user_image_id);
drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from test_users where user_image_id is not null;
ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY (user_image_id) REFERENCES test_user_image(user_image_id);
commit;
analyze test_users;
analyze test_user_image;
-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;
SELECT U.USERS_ID
insert into test_users(account_id, users_id, first_name, last_name, user_image_id)
SELECT random(10, 50_000)::bigint, i, md5(random()::text), md5(random()::text), case when random() < 0.50 then null else i end
FROM generate_series(1, 50_000_000) AS t(i)
;
create index user_img_fk_idx on test_users using btree (user_image_id);
drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from test_users where user_image_id is not null;
ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY (user_image_id) REFERENCES test_user_image(user_image_id);
commit;
analyze test_users;
analyze test_user_image;
-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;
SELECT U.USERS_ID
, U.FIRST_NAME
, U.LAST_NAME
FROM test_USERS U
LEFT JOIN test_USER_IMAGE UI
ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
, UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
, U.FIRST_NAME ASC
, U.USERS_ID ASC
LIMIT 20
OFFSET 0
;
set max_parallel_workers_per_gather = 2;
, U.LAST_NAME
FROM test_USERS U
LEFT JOIN test_USER_IMAGE UI
ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
, UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
, U.FIRST_NAME ASC
, U.USERS_ID ASC
LIMIT 20
OFFSET 0
;
set max_parallel_workers_per_gather = 2;
-- Explain the above query. For it to break, a Parallel Hash Right Join is executed.
-- When a Parallel Hash Left Join is executed, the query works. Switch between left and right by changing the skew as noted above.
-- when run with stock Postgres 17, the Invalid DSA memory allocation occurs, which started this thread
-- when run with the patch to fix the Invalid DSA memory allocation, the OOM occurs.
-- I reproduced the Invalid DSA memory allocation with AWS Aurora v16.2. Naturally I cannot try patches there. The above was recreated with stock Postgres v17 on a plain ec2 instance.
Thank you for your time
Craig
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote: > I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me. Thanks, repro'd here. At first glance, it looks like it's trying to load this distribution into a hash table and failing to handle the skew as well as non-parallel hash: postgres=# select user_image_id, count(*) from test_users where account_id = -1 group by 1 order by 2 desc limit 5; user_image_id | count ---------------+-------- | 878823 <-- choking on this? -924960 | 1 -924934 | 1 -924917 | 1 -924971 | 1 (5 rows) -> Parallel Hash Right Join (cost=1027177.72..1368758.97 rows=363544 width=82) Hash Cond: (ui.user_image_id = u.user_image_id) -> Parallel Seq Scan on test_user_image ui (cost=0.00..215192.79 rows=10436379 width=8) -> Parallel Hash (cost=1017662.42..1017662.42 rows=363544 width=82) -> Parallel Seq Scan on test_users u (cost=0.00..1017662.42 rows=363544 width=82) Filter: (account_id = '-1'::integer) Getting coffee and looking more closely...
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 10/14/24 04:08, Thomas Munro wrote: > On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote: >> I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me. > > Thanks, repro'd here. At first glance, it looks like it's trying to > load this distribution into a hash table and failing to handle the > skew as well as non-parallel hash: > Getting coffee and looking more closely... Hmm, with reproduction, it is too easy to solve ;) My explanation (correct if I'm wrong): OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash value for NULL is 0, and it goes to the batch==0. If batch number 0 gets overfilled, the ExecParallelHashIncreaseNumBatches routine attempts to increase the number of batches - but nothing happens. The initial batch is still too big, and the number of batches doubles up to the limit. At the limit, parallel HashJoin stops this grow and (I didn't trace this part, just guess) allocates memory for 2097152 batches that causes OOM. To support this chain of thought, you can see the simple example below, which triggers the issue: DROP TABLE IF EXISTS test; CREATE TABLE test (n int); INSERT INTO test (n) SELECT NULL FROM generate_series(1,1E6); INSERT INTO test (n) VALUES (1, 'a'); ANALYZE test; SET enable_nestloop = 'off'; SET enable_mergejoin = 'off'; SET max_parallel_workers_per_gather = 2; SET min_parallel_table_scan_size = 0; SET min_parallel_index_scan_size = 0; SET parallel_setup_cost = 0.001; SET parallel_tuple_cost = 0.0001; EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT t1.n FROM test t1 LEFT JOIN test t2 USING (n); I think, now it is much easier to find a proper solution. -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Tom Lane
Date:
Andrei Lepikhov <lepihov@gmail.com> writes: > My explanation (correct if I'm wrong): > OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash > value for NULL is 0, and it goes to the batch==0. > If batch number 0 gets overfilled, the > ExecParallelHashIncreaseNumBatches routine attempts to increase the > number of batches - but nothing happens. The initial batch is still too > big, and the number of batches doubles up to the limit. Interesting point. If memory serves (I'm too tired to actually look) the planner considers the statistical most-common-value when estimating whether an unsplittable hash bucket is likely to be too big. It does *not* think about null values ... but it ought to. However, this does not explain why PHJ would be more subject to the problem than non-parallel HJ. regards, tom lane
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Craig Milhiser
Date:
I applied the patch for the parallel hash and ran that against my production data. The query worked with parallel workers. That patch was applied on top of the earlier patch for the invalid Susa memory alloc that started this thread.
From my view as a user, these are both fixed and can be marked as such. We will wait for the patches to role through the release cycles.
If you change the patch and need another test with real data please let me know.
Thank you for your time and effort on these issues.
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 10/16/24 16:19, Thomas Munro wrote: > On Mon, Oct 14, 2024 at 10:16 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> See the attachment for a sketch of the solution. > > Thanks Andrei, I mostly agree with your analysis, but I came up with a > slightly different patch. I think we should check for extreme skew if > old_batch->space_exhausted (the parent partition). Your sketch always > does it for batch 0, which works for these examples but I don't think > it's strictly correct: if batch 0 didn't run out of memory, it might > falsely report extreme skew just because it had (say) 0 or 1 tuples. Yeah, I misunderstood the meaning of the estimated_size variable. Your solution is more universal. Also, I confirm, it passes my synthetic test. Also, it raises the immediate question: What if we have too many duplicates? Sometimes, in user complaints, I see examples where they, analysing the database's logical consistency, pass through millions of duplicates to find an unexpected value. Do we need a top memory consumption limit here? I recall a thread in the mailing list with a general approach to limiting backend memory consumption, but it is finished with no result. The patch looks good as well as commentary. -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Thu, Oct 17, 2024 at 9:12 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > Yeah, I misunderstood the meaning of the estimated_size variable. Your > solution is more universal. Also, I confirm, it passes my synthetic test. > Also, it raises the immediate question: What if we have too many > duplicates? Sometimes, in user complaints, I see examples where they, > analysing the database's logical consistency, pass through millions of > duplicates to find an unexpected value. Do we need a top memory > consumption limit here? I recall a thread in the mailing list with a > general approach to limiting backend memory consumption, but it is > finished with no result. It is a hard problem alright[1]. > The patch looks good as well as commentary. Thanks, I will go ahead and push this now. [1] https://www.postgresql.org/message-id/flat/CAAKRu_aLMRHX6_y%3DK5i5wBMTMQvoPMO8DT3eyCziTHjsY11cVA%40mail.gmail.com
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 10/17/24 15:57, Thomas Munro wrote: > On Thu, Oct 17, 2024 at 9:12 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> Yeah, I misunderstood the meaning of the estimated_size variable. Your >> solution is more universal. Also, I confirm, it passes my synthetic test. >> Also, it raises the immediate question: What if we have too many >> duplicates? Sometimes, in user complaints, I see examples where they, >> analysing the database's logical consistency, pass through millions of >> duplicates to find an unexpected value. Do we need a top memory >> consumption limit here? I recall a thread in the mailing list with a >> general approach to limiting backend memory consumption, but it is >> finished with no result. > > It is a hard problem alright[1]. > >> The patch looks good as well as commentary. > > Thanks, I will go ahead and push this now. > > [1] https://www.postgresql.org/message-id/flat/CAAKRu_aLMRHX6_y%3DK5i5wBMTMQvoPMO8DT3eyCziTHjsY11cVA%40mail.gmail.com Thanks for the link. BTW, why not to use current case and fix the problem with the 'invalid DSA memory alloc request size 1811939328' itself ? -- regards, Andrei Lepikhov
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Thomas Munro
Date:
On Thu, Oct 17, 2024 at 10:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > BTW, why not to use current case and fix the problem with the 'invalid > DSA memory alloc request size 1811939328' itself ? I think your patch is good but if you don't mind I'd like to think about how to generalise it a bit first, so that it applies to all places where we choose nbatch, not just repartitioning. Unfortunately that's a bit circular so I'm still thinking about the tidiest way to do it... might take a few days due to travel, and if I don't have something soon I guess your patch is better than nothing (it might be the most common way we finish up in that sort of trouble). I'll also push that other patch that cleans up temporary files aggressively soon (master only), and try to think about some simple ways to avoid large nbatch values that contradict the goal of reducing memory for non-skew cases at planning and execution time (probably master only)...
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
From
Andrei Lepikhov
Date:
On 18/10/2024 06:34, Thomas Munro wrote: > On Thu, Oct 17, 2024 at 10:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> BTW, why not to use current case and fix the problem with the 'invalid >> DSA memory alloc request size 1811939328' itself ? > > I think your patch is good but if you don't mind I'd like to think > about how to generalise it a bit first, so that it applies to all > places where we choose nbatch, not just repartitioning. Unfortunately > that's a bit circular so I'm still thinking about the tidiest way to > do it... might take a few days due to travel, and if I don't have > something soon I guess your patch is better than nothing (it might be > the most common way we finish up in that sort of trouble). Thanks. I thought about generalisation, but who knows how the caller wants to react in the case of unsuccessful allocation? Should we force the user to compare the size requested and allocated? Maybe. I'll wait for your solution. > > I'll also push that other patch that cleans up temporary files > aggressively soon (master only), Have been waiting for this! > and try to think about some simple > ways to avoid large nbatch values that contradict the goal of reducing > memory for non-skew cases at planning and execution time (probably > master only)... Hmm, cost-based maximum number of batches defined at the optimisation stage? -- regards, Andrei Lepikhov