Thread: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Has the referenced bug in this discussion[1] been released? The discussion mentions it was fixed. I read the release notes but do not recognize this fix as documented. I do not want to hijack that thread.

I am using v16.3 - AWS Aurora. I have opened a support case with AWS also.

I have 4 queries that get this error message. The symptoms are the same - thousands of temp files are created before the error is returned. Just because I am getting the same error/symptoms does not mean it is the same problem. If the fix has been released I can check against the AWS Aurora version with their support staff.

I investigated one of the queries so far. When I turn off parallel execution this query completes.

I am working on a reproducible example to send to the community. The query is sensitive to the values in the predicates. Most of the time these queries work.  So far, I have not been able to get an MRE but still working on it.

Thanks
Craig

Craig Milhiser <craig@milhiser.com> writes:
> Has the referenced bug in this discussion[1] been released?

I don't see any indication, either in that thread or in the commit
log, that anything has been done in this area since about 16.2.
It's not an easy problem in general.

Having said that, Aurora is not Postgres, and I don't know
how closely they track us.  Can you reproduce this problem
on a stock build of community Postgres?

            regards, tom lane





On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Milhiser <craig@milhiser.com> writes:
> Has the referenced bug in this discussion[1] been released?

I don't see any indication, either in that thread or in the commit
log, that anything has been done in this area since about 16.2.
It's not an easy problem in general.

Having said that, Aurora is not Postgres, and I don't know
how closely they track us.  Can you reproduce this problem
on a stock build of community Postgres?

                        regards, tom lane

Thanks.  I will work on setting that up.  Also getting the aws team involved. 

The one query I investigated I rewrote. It took 15 seconds without parallel to avoid this issue. I rewrote it and now the query completes in 0.2 seconds. For this query I can avoid the issue, at least temporarily, by making a better query.  But we need to solve the real problem. And I have not looked at the other queries affecting me.  I may not get so lucky again. 

I will post when I get the stock Postgres setup and running. 


On Mon, Sep 23, 2024 at 12:52 PM Craig Milhiser <craig@milhiser.com> wrote:
> On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Craig Milhiser <craig@milhiser.com> writes:
>> > Has the referenced bug in this discussion[1] been released?
>>
>> I don't see any indication, either in that thread or in the commit
>> log, that anything has been done in this area since about 16.2.
>> It's not an easy problem in general.
>>
>> Having said that, Aurora is not Postgres, and I don't know
>> how closely they track us.  Can you reproduce this problem
>> on a stock build of community Postgres?
>>
>>                         regards, tom lane
>
>
> Thanks.  I will work on setting that up.  Also getting the aws team involved.
>
> The one query I investigated I rewrote. It took 15 seconds without parallel to avoid this issue. I rewrote it and now
thequery completes in 0.2 seconds. For this query I can avoid the issue, at least temporarily, by making a better
query. But we need to solve the real problem. And I have not looked at the other queries affecting me.  I may not get
solucky again. 
>
> I will post when I get the stock Postgres setup and running.

Hi,

FYI this is on my radar and it would be good to try to make a small
back-patchable improvement.  I would need to swap the problem back
into memory to be sure but from memory the problem is that parallel
hash join partitions take 432 bytes of book keeping memory each (there
is also the problem that they each have output buffers, but those are
not allocated in one big array, and for non-parallel hash join there
is also a per-partition overhead, but it's smaller due to less
bookkeeping state so we don't hear about it).  Hash joins use
partition files (AKA batches) to try to keep each hash table under
work_mem * hash_mem_multiplier, and if you have 2^22 (~4 million)
partitions we therefore try to allocate 1811939328 bytes of memory
(the number in $SUBJECT), exceeding our arbitrary 1GB allocation
limit.  It's possible to turn that arbitrary allocation limit off, but
that'd be treating a symptom and certainly not really produce good
performance.  If you increased (probably at least by double) work_mem
or hash_mem_multiplier, you might have better luck: at some cross-over
point, doubling the size of the array of partitions uses more memory
than you can save by (potentially) halving the size of the hash table!
 Even aside from that arithmetic problem, anything more than around
2^10 partitions (~1 thousand) will start to perform worse unless you
also increase max_files_per_process to match, because every flush of a
page of spill file will likely have to close and open a file (on stock
PostgreSQL at least, but Aurora may have a completely different scheme
for temporary spill data for all I know).  So we could simply cap the
number of partitions, and start ignoring the work_mem *
hash_mem_multiplier limit beyond that cap, but we haven't done it yet
because it's hard to pick a number and reports are rare (ie very large
queries run with low work_mem, if that is indeed the problem here).
2^21 would be the highest plausible candidate (2^21 * 432 = ~900MB),
but it's still very high.  There is a related problem of extreme skew
(squillions of tuples in one bucket), which is much harder to treat,
but ideas were mentioned in that and other threads...  For the
non-skewed version of the problem, which may be more common, at one
level at least the problem is the defaults being set for small memory
machines, people running increasingly huge joins on huge memory
machines, and the machinery to make it work being a bit naive and
excessively expensive.  We could and should invent better strategies
for coping.



On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 432 bytes

Oh, as Tomas pointed out in the referenced thread, the actual number
depends on the number of workers because there is some per-worker
state for partitions, but that number does seem consistent with your
reported case.  Perhaps the correct answer is simply to give up
partitioning when the partition state size would exceed the potential
hash table savings by further partitioning.  Another question is
whether it's arriving at the problematic number by underestimating and
then repeatedly expanding, which is very inefficient, or planning the
high number from the outset, but either way that'd be two different
code paths that would need to respect a new cap.  If it's the former,
there may also be ways to improve initial estimates with statistics.





On Sun, Sep 22, 2024 at 10:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 432 bytes

Oh, as Tomas pointed out in the referenced thread,

Thanks for working on it and the detailed explanation. I tested set max_parallel_workers_per_gather = 0 from the original thread and it was working. We are putting that into the application, for our largest customers. Set to 0 before the query then back to 2 after. 

Your explanation also shows why rewriting of the query works. I reduced the number of rows being processed much earlier in the query. The query was written with 1 set of many joins which worked on millions of rows then reduced to a handful. I broke this into a materialized CTE that forced Postgres to reduce the rows early then do the joins.  Rewriting the query is better regardless of this issue. 

I am working on getting a stock Postgres in our production protected enclave with our production database.  Probably a full day of work that I need to splice in.  We have a similar mechanism in our development environment. Once working I can help test and debug any changes. I can also work on a reproducible example. 

On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote:
> I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database,
analyzedand ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as
well.

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number).  I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.



> Since you're building from source, you could try applying the patch
>posted by Andrei Lephikov:
https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

This did not work for me. I am running out of memory.

I applied the patch, make clean, make, make check, sudo make install. I am running out of the box Postgres configuration.

Memory below uses "free -m".

Before loading Postgres
               total        used        free      shared  buff/cache   available
Mem:           31388         669       30467           2         639       30719
Swap:              0           0           0

After loading
               total        used        free      shared  buff/cache   available
Mem:           31388         672       30464          14         651       30715
Swap:              0           0           0

I go into psql
set max_parallel_workers_per_gather = 0;
run the query multiple times, takes 9.5 seconds at steady state, returns 20 rows.

Memory is still available

               total        used        free      shared  buff/cache   available
Mem:           31388         921       22547         142        8460       30466
Swap:              0           0           0

In the same psql session, set max_parallel_workers_per_gather = 2; then run the query again. This runs for 1 minute then:

2024-10-01 18:28:45.883 UTC [2586] LOG:  background worker "parallel worker" (PID 4465) was terminated by signal 9: Killed
2024-10-01 18:28:45.883 UTC [2586] DETAIL:  Failed process was running: SELECT
      ...          
2024-10-01 18:28:45.883 UTC [2586] LOG:  terminating any other active server processes
2024-10-01 18:28:46.620 UTC [2586] LOG:  all server processes terminated; reinitializing

I got this as close to the end as I could
               total        used        free      shared  buff/cache   available
Mem:           31388       31014         535        1955        2156         373
Swap:              0           0           0

Though OOM conditions often means all bets are off for behavior, I tried something different. I rebooted, started Postgres then run the query. I do not set parallel_... = 0 and run the query which populated the cache. The machine exhausts memory again but usually "hangs". I need to restart.  Below is the frozen screen
               total        used        free      shared  buff/cache   available
Mem:           31388       31317         240        1955        2140          70
Swap:              0           0           0

I ran these sequences multiple times. I also analyzed the data again just to make sure.

I reverted the patch to make sure I am reproducing the issue. I get the same 1.8GB allocation failure with parallel. Without parallel the query takes ~10 seconds. The patch increased the single worker performance for this query for out of the box configuration by 5%. 

Thanks

On Sun, Sep 29, 2024 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote:
> I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzed and ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well.

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru

I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number).  I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.
On 2/10/2024 02:12, Craig Milhiser wrote:
>  > Since you're building from source, you could try applying the patch
>  >posted by Andrei Lephikov:
>  > 
> https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru
<https://www.postgresql.org/message-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb%40postgrespro.ru>
> 
> This did not work for me. I am running out of memory.
Can you provide an explain of this query? Also, can you remove 
unnecessary details from the query text like temporary view or CASE .. 
WHEN construction, if the OOM still reproduces.

-- 
regards, Andrei Lepikhov





On Oct 1, 2024 Andrei Lepikhov wrote 
> Can you provide an explain of this query? 

Apologies for the delay. I have been travelling since Wednesday night. Thanks for your help and time with this issue.

Below is the query, with specific values redacted. An explain with max_parallel_workers_per_gather = 2 and explain analyze max_parallel_workers_per_gather = 0.

In this case, the number of rows from the users table based on account_id is in the 99th percentile for this table and it is a long and sparse right tail.  

This is using V17.0 stock source code and stock configuration on linux.

The query

SELECT
    CF.NUMERIC_VALUE AS CF_COL,
    U.USERS_ID,
    U.OBJECT_ID,
    U.ACCOUNT_ID,
    U.EXTERNAL_ID,
    U.FIRST_NAME,
    U.MIDDLE_NAME,
    U.LAST_NAME,
    U.DISABLED,
    U.DEACTIVATED AS SUSPEND_DATE,
    U.CREATED,
    U.UPDATED,
    U.IS_BLE_TWO_FACTOR_EXEMPT,
    U.HAS_THUMBNAIL,
    U.USER_TYPE_ID,
    UI.USER_IMAGE_ID,
    UI.CONTENT_TYPE AS USER_IMAGE_CONTENT_TYPE,
    COUNT(*) OVER () AS TOTAL_USERS_COUNT,
    STRING_AGG(SG.object_ID::CHARACTER VARYING, ';') AS GROUPS,
    STRING_AGG(SG.NAME, ' ') AS GROUPNAMES
FROM
    USERS U
    LEFT JOIN USER_IMAGE UI ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
    LEFT JOIN SECURITY_GROUP_MEMBER SGM ON SGM.OBJECT_ID = U.OBJECT_ID
        AND SGM.OBJECT_ID = U.OBJECT_ID
    LEFT JOIN SECURITY_GROUP SG
        ON SGM.SECURITY_GROUP_ID = SG.SECURITY_GROUP_ID
        AND SG.DISABLED = 0
        AND SG.ACCOUNT_ID = U.ACCOUNT_ID
        AND SG.SECURITY_GROUP_TYPE_ID = 2
    LEFT JOIN CUSTOM_FIELD_VALUE CF
        ON U.USERS_ID = CF.USER_ID
        AND CF.CUSTOM_FIELD_ID = <craig redacted>
WHERE
    U.ACCOUNT_ID = <craig redacted>
    AND U.USER_TYPE_ID = 1
    AND U.DISABLED = 0
GROUP BY
    U.USERS_ID,
    UI.USER_IMAGE_ID,
    CF.NUMERIC_VALUE
ORDER BY
    U.LAST_NAME ASC,
    U.FIRST_NAME ASC,
    U.USERS_ID ASC
LIMIT
    20
OFFSET
    0;


Explain with stock configuration which is set max_parallel_workers_per_gather = 2;

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2529139.77..2529139.82 rows=20 width=187)
   ->  Sort  (cost=2529139.77..2530484.84 rows=538028 width=187)
         Sort Key: u.last_name, u.first_name, u.users_id
         ->  WindowAgg  (cost=2514822.88..2514823.03 rows=538028 width=187)
               ->  Finalize GroupAggregate  (cost=2432583.40..2508097.68 rows=538028 width=179)
                     Group Key: u.users_id, ui.user_image_id, cf.numeric_value
                     ->  Gather Merge  (cost=2432583.40..2492181.03 rows=448356 width=179)
                           Workers Planned: 2
                           ->  Partial GroupAggregate  (cost=2431583.37..2439429.60 rows=224178 width=179)
                                 Group Key: u.users_id, ui.user_image_id, cf.numeric_value
                                 ->  Sort  (cost=2431583.37..2432143.82 rows=224178 width=140)
                                       Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
                                       ->  Parallel Hash Left Join  (cost=1384936.37..2395567.35 rows=224178 width=140)
                                             Hash Cond: (u.users_id = cf.user_id)
                                             ->  Hash Left Join  (cost=1124308.04..2134350.56 rows=224178 width=134)
                                                   Hash Cond: (sgm.security_group_id = sg.security_group_id)
                                                   ->  Nested Loop Left Join  (cost=1119678.30..2129132.34 rows=224178 width=117)
                                                         ->  Parallel Hash Right Join  (cost=1119677.73..1326436.98 rows=224178 width=109)
                                                               Hash Cond: (ui.user_image_id = u.user_image_id)
                                                               ->  Parallel Seq Scan on user_image ui  (cost=0.00..130846.12 rows=3533412 width=18)
                                                               ->  Parallel Hash  (cost=1113372.50..1113372.50 rows=224178 width=99)
                                                                     ->  Parallel Bitmap Heap Scan on users u  (cost=8824.42..1113372.50 rows=224178 width=99)
                                                                           Recheck Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
                                                                           ->  Bitmap Index Scan on u_act_dis_type  (cost=0.00..8689.92 rows=538028 width=0)
                                                                                 Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
                                                         ->  Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm  (cost=0.57..3.57 rows=1 width=16)
                                                               Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
                                                   ->  Hash  (cost=4622.16..4622.16 rows=607 width=41)
                                                         ->  Index Scan using account_security_group_fk_ind on security_group sg  (cost=0.43..4622.16 rows=607 width=41)
                                                               Index Cond: (account_id = <craig redacted>)
                                                               Filter: ((disabled = 0) AND (security_group_type_id = 2))
                                             ->  Parallel Hash  (cost=259796.42..259796.42 rows=66553 width=14)
                                                   ->  Parallel Index Scan using date_value_idx on custom_field_value cf  (cost=0.56..259796.42 rows=66553 width=14)
                                                         Index Cond: (custom_field_id = <craig redacted>)
(34 rows)


explain (analyze) with set max_parallel_workers_per_gather = 0;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4152596.58..4152596.63 rows=20 width=187) (actual time=10192.249..10192.258 rows=20 loops=1)
   ->  Sort  (cost=4152596.58..4153941.65 rows=538028 width=187) (actual time=10192.248..10192.255 rows=20 loops=1)
         Sort Key: u.last_name, u.first_name, u.users_id
         Sort Method: top-N heapsort  Memory: 32kB
         ->  WindowAgg  (cost=4138279.81..4138279.85 rows=538028 width=187) (actual time=9748.632..9958.924 rows=904292 loops=1)
               ->  GroupAggregate  (cost=4112723.52..4131554.50 rows=538028 width=179) (actual time=8482.695..9389.560 rows=904292 loops=1)
                     Group Key: u.users_id, ui.user_image_id, cf.numeric_value
                     ->  Sort  (cost=4112723.52..4114068.59 rows=538028 width=140) (actual time=8482.679..8655.695 rows=1720872 loops=1)
                           Sort Key: u.users_id, ui.user_image_id, cf.numeric_value
                           Sort Method: external merge  Disk: 199104kB
                           ->  Hash Left Join  (cost=602312.67..3984272.46 rows=538028 width=140) (actual time=1955.881..7537.783 rows=1720872 loops=1)
                                 Hash Cond: (u.users_id = cf.user_id)
                                 ->  Hash Left Join  (cost=340636.13..3721183.60 rows=538028 width=134) (actual time=1806.879..6920.376 rows=1720872 loops=1)
                                       Hash Cond: (sgm.security_group_id = sg.security_group_id)
                                       ->  Nested Loop Left Join  (cost=336006.39..3715141.53 rows=538028 width=117) (actual time=1804.650..6599.170 rows=1720872 loops=1)
                                             ->  Hash Left Join  (cost=336005.82..1788669.80 rows=538028 width=109) (actual time=1804.623..3537.213 rows=904292 loops=1)
                                                   Hash Cond: (u.user_image_id = ui.user_image_id)
                                                   ->  Index Scan using u_act_dis_type on users u  (cost=0.56..1384749.23 rows=538028 width=99) (actual time=0.033..1133.900 rows=904292 loops=1)
                                                         Index Cond: ((account_id =  <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
                                                   ->  Hash  (cost=180313.89..180313.89 rows=8480189 width=18) (actual time=1804.516..1804.517 rows=8488571 loops=1)
                                                         Buckets: 131072  Batches: 128  Memory Usage: 3986kB
                                                         ->  Seq Scan on user_image ui  (cost=0.00..180313.89 rows=8480189 width=18) (actual time=0.011..753.277 rows=8488571 loops=1)
                                             ->  Index Only Scan using security_group_obid_sgid_idx on security_group_member sgm  (cost=0.57..3.57 rows=1 width=16) (actual time=0.003..0.003 rows=2 loops=904292)
                                                   Index Cond: ((object_id = u.object_id) AND (object_id = u.object_id))
                                                   Heap Fetches: 0
                                       ->  Hash  (cost=4622.16..4622.16 rows=607 width=41) (actual time=2.219..2.220 rows=795 loops=1)
                                             Buckets: 1024  Batches: 1  Memory Usage: 78kB
                                             ->  Index Scan using account_security_group_fk_ind on security_group sg  (cost=0.43..4622.16 rows=607 width=41) (actual time=0.937..2.121 rows=795 loops=1)
                                                   Index Cond: (account_id = <craig redacted>)
                                                   Filter: ((disabled = 0) AND (security_group_type_id = 2))
                                                   Rows Removed by Filter: 764
                                 ->  Hash  (cost=260262.29..260262.29 rows=113140 width=14) (actual time=148.930..148.931 rows=125986 loops=1)
                                       Buckets: 131072  Batches: 1  Memory Usage: 6931kB
                                       ->  Index Scan using date_value_idx on custom_field_value cf  (cost=0.56..260262.29 rows=113140 width=14) (actual time=0.021..132.508 rows=125986 loops=1)
                                             Index Cond: (custom_field_id =  <craig redacted>)
 Planning Time: 0.983 ms
 Execution Time: 10233.621 ms
(37 rows)

I asked someone else to try to build artificial data for this query. Maybe they will have a different take and be successful compared to me.

Thanks
Craig
On 10/7/24 18:42, Craig Milhiser wrote:
> 
>     On Oct 1, 2024 Andrei Lepikhov wrote
> 
>  > Can you provide an explain of this query?
> 
> Apologies for the delay. I have been travelling since Wednesday night. 
> Thanks for your help and time with this issue.
> 
> Below is the query, with specific values redacted. An explain with 
> max_parallel_workers_per_gather = 2 and explain analyze 
> max_parallel_workers_per_gather = 0.
I'm a bit confused: the thread subject named ' invalid DSA memory alloc 
request size ...', but you write about issue with OOM killer. It is two 
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg 
aggregate - workers can utilise memory more intensively. For now, 
explain of an Aggregate node don't show information about factual sort 
operation of each aggregate and memory consumption.

-- 
regards, Andrei Lepikhov




On Oct 8 Andrei Lepikhov wrote
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?


I started with the Invalid DSA memory allocation error. I was asked to try an experimental patch above. Then I got OOM with the patch only running parallel. You will see below, there was an OOM but I do not believe it is the query. 

Thanks for the push on OOM. I should have ran this test earlier.

v17.0 and out of the box Postgres configuration.

I ran a new test on an instance with 512 GiB of memory.  After I applied the patch, the Invalid DSA memory allocation message was not replicated.  Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds and used <1 GiB of memory.  With max_parallel_workers_per_gather = 2 the query used  ~170 GiB of memory, ~70 GB of temp files were written and the query ran for more than 1 hour until I ran out of disk space. 

I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM and 70+GB of temp files with the patch. Only when using 2 parallel workers per gather. 

The new test:

This morning I increased the machine size from 32 GiB to 512 GiB RAM.  

With the patch applied and max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state. While it was running I captured memory. I ran the query a few times earlier to get the buffers loaded.

                total        used        free      shared  buff/cache   available
Mem:           493Gi       3.5Gi       484Gi       142Mi       8.4Gi       489Gi
Swap:             0B          0B          0B

With the patch applied and max_parallel_workers_per_gather = 2; the query ran for more than 1 hour. During that time memory settled at:
               total        used        free      shared  buff/cache   available
Mem:           493Gi       178Gi       209Gi       1.9Gi       110Gi       314Gi
Swap:             0B          0B          0B

Then the machine ran out of disk space: ERROR:  could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space left on device

I captured top as well during the run.
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4951 postgres  20   0   46.8g  45.5g   1.9g D   5.6   9.2   2:40.40 postgres
   4942 postgres  20   0   68.9g  65.6g   1.9g D   5.3  13.3   3:25.35 postgres
   4952 postgres  20   0   68.4g  65.2g   1.9g D   5.3  13.2   3:07.43 postgres

After rebooting:
 df -H
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        266G  197G   70G  74% /

As you mentioned there are string aggregations. I ran with parallel=0 and did some analysis. The aggregations do not seem to be creating something that is out of line.

select max(length(groups)), sum(length(groups)), max(length(groupnames)), sum(length(groupnames)) from milhiser_test;
 max |   sum   | max |   sum
-----+---------+-----+----------
 143 | 6557620 | 499 | 22790616
(1 row)

Perhaps this is a different problem than the "invalid DSA memory alloc".  The patch might have addressed that problem and this is another issue. From < 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving from parallel = 0 to 2 seems something is off.


Summary before this test:
Before the patch linked above, I was receiving "ERROR:  invalid DSA memory alloc request size 1879048192" when I ran the query with max_parallel_workers_per_gather = 2. 

Before the patch with max_parallel_workers_per_gather = 0 the query worked in ~10 seconds at steady state.

I applied the patch to v17.0 source, rebuilt, and passed tests.

With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state and took < 1 GiB of memory. 

With max_parallel_workers_per_gather = 2, the machine ran out of memory. This was a 32 GiB machine. The free memory when running without parallel was ~30 GiB free.

Thanks

On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 10/7/24 18:42, Craig Milhiser wrote:
>
>     On Oct 1, 2024 Andrei Lepikhov wrote
>
>  > Can you provide an explain of this query?
>
> Apologies for the delay. I have been travelling since Wednesday night.
> Thanks for your help and time with this issue.
>
> Below is the query, with specific values redacted. An explain with
> max_parallel_workers_per_gather = 2 and explain analyze
> max_parallel_workers_per_gather = 0.
I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.

--
regards, Andrei Lepikhov

On Thu, Oct 10, 2024 at 5:28 AM Craig Milhiser <craig@milhiser.com> wrote:
> Then the machine ran out of disk space: ERROR:  could not write to file
"base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0":No space left on device 

For that, I have a patch in the queue to unlink temporary files incrementally:

https://www.postgresql.org/message-id/flat/CA+hUKG+RGdvhAdVu5_LH3Ksee+kW-XkTP_nMxBL+Rmgp3Tjb_w@mail.gmail.com

That's just treating a symptom, though.  Things have already gone
quite wrong if we're repeatedly repartitioning our way up to 2 million
batches and only giving up there because of Andrei's patch.

I wonder if there something could be wrong with Parallel Hash Right
Join, which we see in your plan.  That's new-ish, and I vaguely recall
another case where that seemed to be on the scene in a plan with a
high number of batches... hmm.  Definitely keen to see a reproducer
with synthetic data if you can come up with one...



On 10/9/24 23:28, Craig Milhiser wrote:
> On Oct 8 Andrei Lepikhov wrote
>  > I'm a bit confused: the thread subject named ' invalid DSA memory alloc
>  > request size ...', but you write about issue with OOM killer. It is two
>  > different issues, which one do you have exactly?
> 
> 
> I started with the Invalid DSA memory allocation error. I was asked to 
> try an experimental patch above. Then I got OOM with the patch only 
> running parallel. You will see below, there was an OOM but I do not 
> believe it is the query.
So, I think the patch works, but you found out one more issue at the 
same query. Awesome!

> I ran a new test on an instance with 512 GiB of memory.  After I applied 
> the patch, the Invalid DSA memory allocation message was not 
> replicated.  Running max_parallel_workers_per_gather = 0, the query took 
> ~9.5 seconds and used <1 GiB of memory.  With 
> max_parallel_workers_per_gather = 2 the query used  ~170 GiB of memory, 
> ~70 GB of temp files were written and the query ran for more than 1 hour 
> until I ran out of disk space.
It's fascinating. I have one user report like that, but they also didn't 
provide any synthetic test. I think it is almost impossible to create 
such reproduction without a minimal understanding of what's happening. I 
can imagine only a data skew or a logical bug in this part of the code. 
But without direct perf and gdb touch, it is hard to resolve the issue 
by just gazing into the code.
Additional actions can provide some food for thought:
1. If you remove aggregates (STRING_AGG, count) from the selection list, 
will the problem remain? What about OFFSET 0?
2. Can you build extended statistics on account_id,disabled,user_type_id 
and provide an explain (and explain analyse)?
3. Can you use pg_query_state (unfortunately, it needs a patch and 
re-compilation) and show us intermediate execution state snapshots?
4. I see a duplicate clause in the query: SGM.OBJECT_ID = U.OBJECT_ID. 
For what reason you have it here? can you remove it from the query?
5. One more wild guess: can you analyse how much NULLS contains column 
u.users_id at the moment when HashJoin evaluates clause (u.users_id = 
cf.user_id)?

[1] https://github.com/postgrespro/pg_query_state

-- 
regards, Andrei Lepikhov




Thomas Munro wrote
> I wonder if there something could be wrong with Parallel Hash Right Join...Definitely keen to see a reproducer
> with synthetic data if you can come up with one

Andrei Lepikhov wrote
> I can imagine only a data skew or a logical bug in this part of the code.
> But without direct perf and gdb touch, it is hard to resolve the issue
> by just gazing into the code.

Both of you are correct. 

I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.

There are comments in the script. Please let me know of any questions or if you cannot reproduce it. If you want me to file a report via the form, let me know.

Using Postgres v17 with out of the box configuration. 

drop table test_users;
create table test_users (account_id bigint not null, users_id bigint not null constraint test_users_pkey primary key, first_name varchar(105), last_name varchar(105), user_image_id bigint);

-- The account we are interested, data numbers are negative to eliminate duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name, user_image_id)
    SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when random() < 0.95 then null else -1 * i end
    FROM generate_series(1, 925_000) AS t(i)
;

-- Make enough other records to get the skew to force a Parallel Hash Right Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left Join and the query works
-- 0.50 makes a right join and breaks; 0.95 makes a left join and works
-- Changes how many users are in user_image which, relative to the number of users and accounts, is the key skew that I found
-- Data numbers are positive
insert into test_users(account_id, users_id, first_name, last_name, user_image_id)
    SELECT random(10, 50_000)::bigint, i, md5(random()::text), md5(random()::text), case when random() < 0.50 then null else i end
    FROM generate_series(1, 50_000_000) AS t(i)
;

create index user_img_fk_idx on test_users using btree (user_image_id);

drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from test_users where user_image_id is not null;

ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY (user_image_id) REFERENCES test_user_image(user_image_id);

commit;
analyze test_users;
analyze test_user_image;

-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;

SELECT U.USERS_ID
    ,  U.FIRST_NAME
    ,  U.LAST_NAME
FROM test_USERS U
    LEFT JOIN test_USER_IMAGE UI
        ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
       , UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
       , U.FIRST_NAME ASC
       , U.USERS_ID ASC
LIMIT 20
OFFSET 0
;

set max_parallel_workers_per_gather = 2;
-- Explain the above query. For it to break, a Parallel Hash Right Join is executed. 
-- When a Parallel Hash Left Join is executed, the query works. Switch between left and right by changing the skew as noted above. 
-- when run with stock Postgres 17, the Invalid DSA memory allocation occurs, which started this thread
-- when run with the patch to fix the Invalid DSA memory allocation, the OOM occurs.
-- I reproduced the Invalid DSA memory allocation with AWS Aurora v16.2. Naturally I cannot try patches there. The above was recreated with stock Postgres v17 on a plain ec2 instance.

Thank you for your time

Craig



On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote:
> I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.

Thanks, repro'd here.  At first glance, it looks like it's trying to
load this distribution into a hash table and failing to handle the
skew as well as non-parallel hash:

postgres=# select user_image_id, count(*) from test_users where
account_id = -1 group by 1 order by 2 desc limit 5;
 user_image_id | count
---------------+--------
               | 878823  <-- choking on this?
       -924960 |      1
       -924934 |      1
       -924917 |      1
       -924971 |      1
(5 rows)

                           ->  Parallel Hash Right Join
(cost=1027177.72..1368758.97 rows=363544 width=82)
                                 Hash Cond: (ui.user_image_id = u.user_image_id)
                                 ->  Parallel Seq Scan on
test_user_image ui  (cost=0.00..215192.79 rows=10436379 width=8)
                                 ->  Parallel Hash
(cost=1017662.42..1017662.42 rows=363544 width=82)
                                       ->  Parallel Seq Scan on
test_users u  (cost=0.00..1017662.42 rows=363544 width=82)
                                             Filter: (account_id =
'-1'::integer)


Getting coffee and looking more closely...



On 10/14/24 04:08, Thomas Munro wrote:
> On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote:
>> I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.
> 
> Thanks, repro'd here.  At first glance, it looks like it's trying to
> load this distribution into a hash table and failing to handle the
> skew as well as non-parallel hash:
> Getting coffee and looking more closely...
Hmm, with reproduction, it is too easy to solve ;)
My explanation (correct if I'm wrong):
OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash 
value for NULL is 0, and it goes to the batch==0.
If batch number 0 gets overfilled, the 
ExecParallelHashIncreaseNumBatches routine attempts to increase the 
number of batches - but nothing happens. The initial batch is still too 
big, and the number of batches doubles up to the limit.
At the limit, parallel HashJoin stops this grow and (I didn't trace this 
part, just guess) allocates memory for 2097152 batches that causes OOM.
To support this chain of thought, you can see the simple example below, 
which triggers the issue:

DROP TABLE IF EXISTS test;
CREATE TABLE test (n int);
INSERT INTO test (n) SELECT NULL FROM generate_series(1,1E6);
INSERT INTO test (n) VALUES (1, 'a');
ANALYZE test;

SET enable_nestloop = 'off';
SET enable_mergejoin = 'off';

SET max_parallel_workers_per_gather = 2;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
SET parallel_setup_cost = 0.001;
SET parallel_tuple_cost = 0.0001;

EXPLAIN (ANALYZE, VERBOSE, COSTS OFF)
SELECT t1.n FROM test t1 LEFT JOIN test t2 USING (n);

I think, now it is much easier to find a proper solution.

-- 
regards, Andrei Lepikhov




Andrei Lepikhov <lepihov@gmail.com> writes:
> My explanation (correct if I'm wrong):
> OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash 
> value for NULL is 0, and it goes to the batch==0.
> If batch number 0 gets overfilled, the 
> ExecParallelHashIncreaseNumBatches routine attempts to increase the 
> number of batches - but nothing happens. The initial batch is still too 
> big, and the number of batches doubles up to the limit.

Interesting point.  If memory serves (I'm too tired to actually look)
the planner considers the statistical most-common-value when
estimating whether an unsplittable hash bucket is likely to be too
big.  It does *not* think about null values ... but it ought to.

However, this does not explain why PHJ would be more subject to
the problem than non-parallel HJ.

            regards, tom lane



I applied the patch for the parallel hash and ran that against my production data. The query worked with parallel workers. That patch was applied on top of the earlier patch for the invalid Susa memory alloc that started this thread.

From my view as a user, these are both fixed and can be marked as such.  We will wait for the patches to role through the release cycles. 

If you change the patch and need another test with real data please let me know. 

Thank you for your time and effort on these issues.  
On 10/16/24 16:19, Thomas Munro wrote:
> On Mon, Oct 14, 2024 at 10:16 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>> See the attachment for a sketch of the solution.
> 
> Thanks Andrei, I mostly agree with your analysis, but I came up with a
> slightly different patch.  I think we should check for extreme skew if
> old_batch->space_exhausted (the parent partition).  Your sketch always
> does it for batch 0, which works for these examples but I don't think
> it's strictly correct: if batch 0 didn't run out of memory, it might
> falsely report extreme skew just because it had (say) 0 or 1 tuples.
Yeah, I misunderstood the meaning of the estimated_size variable. Your 
solution is more universal. Also, I confirm, it passes my synthetic  test.
Also, it raises the immediate question: What if we have too many 
duplicates? Sometimes, in user complaints, I see examples where they, 
analysing the database's logical consistency, pass through millions of 
duplicates to find an unexpected value. Do we need a top memory 
consumption limit here? I recall a thread in the mailing list with a 
general approach to limiting backend memory consumption, but it is 
finished with no result.
The patch looks good as well as commentary.

-- 
regards, Andrei Lepikhov




On Thu, Oct 17, 2024 at 9:12 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> Yeah, I misunderstood the meaning of the estimated_size variable. Your
> solution is more universal. Also, I confirm, it passes my synthetic  test.
> Also, it raises the immediate question: What if we have too many
> duplicates? Sometimes, in user complaints, I see examples where they,
> analysing the database's logical consistency, pass through millions of
> duplicates to find an unexpected value. Do we need a top memory
> consumption limit here? I recall a thread in the mailing list with a
> general approach to limiting backend memory consumption, but it is
> finished with no result.

It is a hard problem alright[1].

> The patch looks good as well as commentary.

Thanks, I will go ahead and push this now.

[1] https://www.postgresql.org/message-id/flat/CAAKRu_aLMRHX6_y%3DK5i5wBMTMQvoPMO8DT3eyCziTHjsY11cVA%40mail.gmail.com



On 10/17/24 15:57, Thomas Munro wrote:
> On Thu, Oct 17, 2024 at 9:12 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>> Yeah, I misunderstood the meaning of the estimated_size variable. Your
>> solution is more universal. Also, I confirm, it passes my synthetic  test.
>> Also, it raises the immediate question: What if we have too many
>> duplicates? Sometimes, in user complaints, I see examples where they,
>> analysing the database's logical consistency, pass through millions of
>> duplicates to find an unexpected value. Do we need a top memory
>> consumption limit here? I recall a thread in the mailing list with a
>> general approach to limiting backend memory consumption, but it is
>> finished with no result.
> 
> It is a hard problem alright[1].
> 
>> The patch looks good as well as commentary.
> 
> Thanks, I will go ahead and push this now.
> 
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_aLMRHX6_y%3DK5i5wBMTMQvoPMO8DT3eyCziTHjsY11cVA%40mail.gmail.com
Thanks for the link.
BTW, why not to use current case and fix the problem with the 'invalid 
DSA memory alloc request size 1811939328' itself ?

-- 
regards, Andrei Lepikhov




On Thu, Oct 17, 2024 at 10:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> BTW, why not to use current case and fix the problem with the 'invalid
> DSA memory alloc request size 1811939328' itself ?

I think your patch is good but if you don't mind I'd like to think
about how to generalise it a bit first, so that it applies to all
places where we choose nbatch, not just repartitioning.  Unfortunately
that's a bit circular so I'm still thinking about the tidiest way to
do it... might take a few days due to travel, and if I don't have
something soon I guess your patch is better than nothing (it might be
the most common way we finish up in that sort of trouble).

I'll also push that other patch that cleans up temporary files
aggressively soon (master only), and try to think about some simple
ways to avoid large nbatch values that contradict the goal of reducing
memory for non-skew cases at planning and execution time (probably
master only)...



On 18/10/2024 06:34, Thomas Munro wrote:
> On Thu, Oct 17, 2024 at 10:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
>> BTW, why not to use current case and fix the problem with the 'invalid
>> DSA memory alloc request size 1811939328' itself ?
> 
> I think your patch is good but if you don't mind I'd like to think
> about how to generalise it a bit first, so that it applies to all
> places where we choose nbatch, not just repartitioning.  Unfortunately
> that's a bit circular so I'm still thinking about the tidiest way to
> do it... might take a few days due to travel, and if I don't have
> something soon I guess your patch is better than nothing (it might be
> the most common way we finish up in that sort of trouble).
Thanks. I thought about generalisation, but who knows how the caller 
wants to react in the case of unsuccessful allocation? Should we force 
the user to compare the size requested and allocated? Maybe. I'll wait 
for your solution.
> 
> I'll also push that other patch that cleans up temporary files
> aggressively soon (master only),
Have been waiting for this!
> and try to think about some simple
> ways to avoid large nbatch values that contradict the goal of reducing
> memory for non-skew cases at planning and execution time (probably
> master only)...
Hmm, cost-based maximum number of batches defined at the optimisation stage?

-- 
regards, Andrei Lepikhov