BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date
Msg-id 18349-83d33dd3d0c855c3@postgresql.org
Whole thread Raw
Responses Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18349
Logged by:          Alexey Ermakov
Email address:      alexey.ermakov@dataegret.com
PostgreSQL version: 16.2
Operating system:   Rocky Linux release 9.3 (Blue Onyx)
Description:

Hello! I found a problem with memory consumption of parallel worker when
postgres 16.2 runs complex analytical query.
After simplifying I was able to reproduce it with following query with just
2 joins:

2024-02-16 12:14:25.669 UTC [425424] ERROR:  XX000: invalid DSA memory alloc
request size 1811939328
2024-02-16 12:14:25.669 UTC [425424] LOCATION:  dsa_allocate_extended,
dsa.c:690
2024-02-16 12:14:25.669 UTC [425424] STATEMENT:  SELECT *
        FROM history h
             LEFT JOIN coupon cpn ON cpn.recloc = h.recloc AND cpn.version =
h.version
             LEFT JOIN free_text i2 ON i2.recloc = cpn.recloc AND i2.part =
1
          WHERE h.date_operation >= '2024-02-13' and h.date_operation <
'2024-02-14';
2024-02-16 12:14:25.669 UTC [425424] DEBUG:  00000: shmem_exit(1): 5
before_shmem_exit callbacks to make
2024-02-16 12:14:25.669 UTC [425424] LOCATION:  shmem_exit, ipc.c:240

Here is some memory/parallel workers- related settings:
max_parallel_workers = 12
max_parallel_workers_per_gather = 2
hash_mem_multiplier = 2
work_mem = '4MB'

When I increase work_mem to 16MB then query runs successfully (same plan):

 Gather  (cost=1082271.57..8357367.15 rows=26292632 width=345) (actual
time=50735.451..64000.782 rows=1325161 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=4102227 read=7004436, temp read=1808967
written=1809044
   I/O Timings: shared read=37739.143, temp read=4809.969 write=9139.669
   ->  Parallel Hash Right Join  (cost=1081271.57..5727103.95 rows=10955263
width=345) (actual time=50716.477..61271.276 rows=441720 loops=3)
         Hash Cond: ((i2.recloc)::text = (cpn.recloc)::text)
         Buffers: shared hit=4102227 read=7004436, temp read=1808967
written=1809044
         I/O Timings: shared read=37739.143, temp read=4809.969
write=9139.669
         ->  Parallel Seq Scan on free_text i2  (cost=0.00..4160436.50
rows=74411061 width=63) (actual time=0.196..37944.130 rows=60240671
loops=3)
               Filter: (part = 1)
               Rows Removed by Filter: 160726692
               Buffers: shared hit=158869 read=7004436
               I/O Timings: shared read=37739.143
         ->  Parallel Hash  (cost=1077675.12..1077675.12 rows=220524
width=282) (actual time=1406.500..1406.503 rows=209531 loops=3)
               Buckets: 131072  Batches: 8  Memory Usage: 30496kB
               Buffers: shared hit=3943254, temp written=10316
               I/O Timings: temp write=58.043
               ->  Nested Loop Left Join  (cost=1.14..1077675.12 rows=220524
width=282) (actual time=0.048..1239.847 rows=209531 loops=3)
                     Buffers: shared hit=3943254
                     ->  Parallel Index Scan using history_index_date on
history h  (cost=0.57..42000.62 rows=156296 width=90) (actual
time=0.021..76.927 rows=198024 loops=3)
                           Index Cond: ((date_operation >= '2024-02-13
00:00:00'::timestamp without time zone) AND (date_operation < '2024-02-14
00:00:00'::timestamp without time zone))
                           Buffers: shared hit=260503
                     ->  Index Scan using coupon_index_rl on coupon cpn
(cost=0.57..6.62 rows=1 width=192) (actual time=0.004..0.005 rows=1
loops=594071)
                           Index Cond: ((recloc)::text = (h.recloc)::text)
                           Filter: (version = h.version)
                           Rows Removed by Filter: 2
                           Buffers: shared hit=3682751
 Planning:
   Buffers: shared hit=72
 Planning Time: 0.506 ms
 Execution Time: 64060.476 ms
(32 rows)


Interesting thing: after query fails it take some time (seconds) before psql
gives control back.
It turns out that postgresql written 1M+ (!) temporary files (most of them
are empty) and cleaning up takes some time.
With bigger work_mem it won't write so many files. Maybe memory issue
somehow related with number of files.

In PostgreSQL 16.2 there were fix for similar bug: [1],[2]. Could it be that
there is some issue with it ?

I wasn't able to make reproducible test case so far but may be given
information is enough...


[1]
https://github.com/postgres/postgres/commit/2a67b5a60ee68892bb028587ddc6de7650822480
[2]
https://www.postgresql.org/message-id/flat/16925-ec96d83529d0d629%40postgresql.org


Thanks,
Alexey Ermakov


pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: BUG #18348: Inconsistency with EXTRACT([field] from INTERVAL);
Next
From: Andrei Lepikhov
Date:
Subject: Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker