Thread: Postgresql 16.3 Out Of Memory

Postgresql 16.3 Out Of Memory

From

Radu Radutiu

Date:

03 June 2024, 10:19:25

Hello,

I have an out of memory problem after upgrading from postgresql 12 to 16.3. I have identified one query that can reproduce the error on demand. Once the query starts, it will eventually exhaust all RAM and swap until the OOM killer will stop postgresql.
The setup is as follows:
- One VMWARE VM, RHEL 8.9, 64 GB RAM, 16 vCPU, 32 GB swap. Database is used by a local java application that takes up to 22 GB (hard limit). The application uses 2 db connections (long running, app uses a connection pool). The database has about 10 tables, with 3 large tables involved in the problem query:
relation | total_size
-------------------+------------
outputrequest | 217 GB
inputrequest | 149 GB
tran | 5041 MB

I have the following settings:
shared_buffers = '8192MB'
effective_cache_size = '24GB'
maintenance_work_mem = '2GB'
checkpoint_completion_target = '0.9'
wal_buffers = '16MB'
default_statistics_target = '100'
random_page_cost = '1.1'
effective_io_concurrency = '200'
work_mem = '104857kB'
min_wal_size = '1GB'
max_wal_size = '4GB'
max_worker_processes = '16'
max_parallel_workers_per_gather = '4'
max_parallel_workers = '16'
max_parallel_maintenance_workers = '4'
jit = 'off'

Beside OOM there is another (possibly related)problem: the postgresql memory usage is significantly higher than the shared_buffers + the rest of the memory used per connection. On my system with shared buffers 8G and work_mem ~ 100M I have memory usage of 20GB with the system running just simple inserts (the memory might have increased due to previous queries).

I have found out that if I set enable_parallel_hash = 'off', I do get high memory usage for the problem query (above 30G) but not OOM. With the enable_parallel_hash = 'on', I will always get OOM. For simplicity I've restricted postgresql systemd service to cap the memory at 36 GB (so that I don't have to wait for the 32 GB swap to be filled before OOM). This is the memory usage I get when running the query, before systemd restarts postgres - memory usage is 35.9 GB just below the 36 GB limit (the query is running with explain analyze,buffers, the other 2 idle connections are from the application):

[postgres@ips3 ~]$ systemctl status postgresql-16; top -u postgres -n 1 -c
● postgresql-16.service - PostgreSQL 16 database server
Loaded: loaded (/etc/systemd/system/postgresql-16.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/postgresql-16.service.d
└─override.conf
Active: active (running) since Mon 2024-06-03 04:23:16 +08; 11h ago
Docs: https://www.postgresql.org/docs/16/static/
Process: 283703 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 283709 (postgres)
Tasks: 14 (limit: 408404)
Memory: 35.9G (limit: 36.0G)
CGroup: /system.slice/postgresql-16.service
├─283709 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
├─283710 postgres: logger
├─295499 postgres: checkpointer
├─295500 postgres: background writer
├─295506 postgres: walwriter
├─295507 postgres: autovacuum launcher
├─295508 postgres: logical replication launcher
├─295511 postgres: db1 db1 127.0.0.1(37720) idle
├─295946 postgres: postgres db1 [local] EXPLAIN
├─295947 postgres: parallel worker for PID 295946
├─295948 postgres: parallel worker for PID 295946
├─295949 postgres: parallel worker for PID 295946
├─295950 postgres: parallel worker for PID 295946
└─296037 postgres: db1 db1 127.0.0.1(41708) idle

top - 16:14:32 up 14 days, 4:42, 1 user, load average: 5.41, 3.58, 2.25
Tasks: 405 total, 6 running, 399 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.1 us, 5.2 sy, 0.0 ni, 71.2 id, 21.3 wa, 0.7 hi, 0.4 si, 0.0 st
MiB Mem : 63873.4 total, 2162.0 free, 57177.8 used, 4533.6 buff/cache
MiB Swap: 32260.0 total, 3813.8 free, 28446.2 used. 4349.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
295947 postgres 20 0 21.2g 7.4g 654168 R 20.0 11.8 1:31.81 postgres: parallel worker for PID 295946
295949 postgres 20 0 21.3g 7.5g 656316 R 20.0 12.0 1:32.57 postgres: parallel worker for PID 295946
295950 postgres 20 0 21.2g 7.5g 654756 R 20.0 12.1 1:31.99 postgres: parallel worker for PID 295946
295946 postgres 20 0 21.2g 7.4g 655684 R 13.3 11.8 1:31.86 postgres: postgres db1 [local] EXPLAIN
295948 postgres 20 0 21.6g 6.0g 656024 R 13.3 9.6 1:37.91 postgres: parallel worker for PID 295946
283709 postgres 20 0 8960628 44144 43684 S 0.0 0.1 0:42.28 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
283710 postgres 20 0 355288 4876 4576 S 0.0 0.0 0:04.42 postgres: logger
295499 postgres 20 0 8960936 107376 106868 S 0.0 0.2 0:01.75 postgres: checkpointer
295500 postgres 20 0 8960764 7764 7364 S 0.0 0.0 0:00.08 postgres: background writer
295506 postgres 20 0 8960764 22008 21624 S 0.0 0.0 0:00.21 postgres: walwriter
295507 postgres 20 0 8962216 6960 6496 S 0.0 0.0 0:00.00 postgres: autovacuum launcher
295508 postgres 20 0 8962232 7436 7060 S 0.0 0.0 0:00.00 postgres: logical replication launcher
295511 postgres 20 0 8974684 133608 131104 S 0.0 0.2 0:23.90 postgres: db1 db1 127.0.0.1(37720) idle
295945 postgres 20 0 274168 6236 5524 S 0.0 0.0 0:00.00 psql -f test.sql db1
296037 postgres 20 0 8962736 13972 13684 S 0.0 0.0 0:00.00 postgres: db1 db1 127.0.0.1(41708) idle

Below is the execution plan that gives OOM:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=41247490.48..45605473.14 rows=36397008 width=66)
Workers Planned: 4
-> Sort (cost=41246490.42..41269238.55 rows=9099252 width=66)
Sort Key: t.msg_status DESC, t.input_sequence
-> Parallel Hash Right Join (cost=33197712.34..39496299.38 rows=9099252 width=66)
Hash Cond: (snd_tro.reply_input_sequence = t.input_sequence)
-> Parallel Seq Scan on outputrequest snd_tro (cost=0.00..5056151.17 rows=19783917 width=16)
-> Parallel Hash (cost=32856685.24..32856685.24 rows=6148088 width=58)
-> Parallel Hash Right Join (cost=26706855.23..32856685.24 rows=6148088 width=58)
Hash Cond: (rec_tro.input_sequence = t.input_sequence)
-> Parallel Seq Scan on outputrequest rec_tro (cost=0.00..5056151.17 rows=19783917 width=16)
-> Parallel Hash (cost=26492657.28..26492657.28 rows=4154076 width=50)
-> Parallel Hash Right Join (cost=20364128.48..26492657.28 rows=4154076 width=50)
Hash Cond: (rpl_snd_tro.reply_input_sequence = r.input_sequence)
-> Parallel Seq Scan on outputrequest rpl_snd_tro (cost=0.00..5056151.17 rows=19783917 width=16)
-> Parallel Hash (cost=20166158.53..20166158.53 rows=4154076 width=42)
-> Parallel Hash Right Join (cost=14048784.52..20166158.53 rows=4154076 width=42)
Hash Cond: (rpl_rec_tro.input_sequence = r.input_sequence)
-> Parallel Seq Scan on outputrequest rpl_rec_tro (cost=0.00..5056151.17 rows=19783917 width=16)
-> Parallel Hash (cost=13867042.57..13867042.57 rows=4154076 width=34)
-> Parallel Hash Right Join (cost=6730983.28..13867042.57 rows=4154076 width=34)
Hash Cond: (r.originalrequest_id = t.input_sequence)
-> Parallel Seq Scan on inputrequest r (cost=0.00..6465209.90 rows=13367390 width=16)
-> Parallel Hash (cost=6565465.33..6565465.33 rows=4154076 width=26)
-> Parallel Seq Scan on inputrequest t (cost=0.00..6565465.33 rows=4154076 width=26)
Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))
(26 rows)

This is the execution plan with enable_parallel_hash = 'off' which runs but still uses more than 30GB RAM:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=51681038.73..56036969.38 rows=36379870 width=66) (actual time=68534.468..72489.115 rows=14895689 loops=1)
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=212733952 read=3252691 dirtied=28327 written=35494, temp read=1350681 written=1350691
-> Sort (cost=51680038.67..51702776.09 rows=9094968 width=66) (actual time=68352.387..68953.103 rows=2979138 loops=5)
Sort Key: t.msg_status DESC, t.input_sequence
Sort Method: external merge Disk: 179448kB
Buffers: shared hit=212733952 read=3252691 dirtied=28327 written=35494, temp read=1350681 written=1350691
Worker 0: Sort Method: external merge Disk: 177528kB
Worker 1: Sort Method: external merge Disk: 178216kB
Worker 2: Sort Method: external merge Disk: 177272kB
Worker 3: Sort Method: external merge Disk: 176248kB
-> Hash Left Join (cost=8179839.27..49930701.19 rows=9094968 width=66) (actual time=25952.805..64809.594 rows=2979138 loops=5)
Hash Cond: (t.input_sequence = rec_tro.input_sequence)
Buffers: shared hit=212733888 read=3252691 dirtied=28327 written=35494, temp read=1239592 written=1239592
-> Nested Loop Left Join (cost=2.27..38995500.02 rows=6144595 width=58) (actual time=1.541..27287.533 rows=2979138 loops=5)
Buffers: shared hit=207879681 read=2037114 dirtied=1546 written=35418
-> Nested Loop Left Join (cost=1.70..30880205.10 rows=4151312 width=50) (actual time=1.236..21739.231 rows=2979138 loops=5)
Buffers: shared hit=140861726 read=2004019 dirtied=1546 written=34842
-> Nested Loop Left Join (cost=1.13..22788518.11 rows=4151312 width=42) (actual time=1.082..18110.990 rows=2979138 loops=5)
Buffers: shared hit=103757087 read=1971773 dirtied=1546 written=33891
-> Nested Loop Left Join (cost=0.56..14665384.33 rows=4151312 width=34) (actual time=0.718..14005.092 rows=2979138 loops=5)
Buffers: shared hit=66987152 read=1604817 dirtied=1546 written=20820
-> Parallel Seq Scan on inputrequest t (cost=0.00..6561097.71 rows=4151312 width=26) (actual time=0.291..4167.263 rows=2979138 loops=5)
Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))
Rows Removed by Filter: 7555153
Buffers: shared hit=693060 read=888771 dirtied=1546 written=395
-> Index Scan using inputrequest_originalrequest_id_idx on inputrequest r (cost=0.56..1.94 rows=1 width=16) (actual time=0.003..0.003 rows=0 loops=14895689)
Index Cond: (originalrequest_id = t.input_sequence)
Buffers: shared hit=66294092 read=716046 written=20425
-> Index Scan using outputrequest_input_sequence_idx on outputrequest rpl_rec_tro (cost=0.57..1.95 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=14895689)
Index Cond: (input_sequence = r.input_sequence)
Buffers: shared hit=36769935 read=366956 written=13071
-> Index Scan using outputrequest_reply_input_sequence_idx on outputrequest rpl_snd_tro (cost=0.57..1.94 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=14895689)
Index Cond: (reply_input_sequence = r.input_sequence)
Buffers: shared hit=37104639 read=32246 written=951
-> Index Scan using outputrequest_reply_input_sequence_idx on outputrequest snd_tro (cost=0.57..1.94 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=14895689)
Index Cond: (reply_input_sequence = t.input_sequence)
Buffers: shared hit=67017955 read=33095 written=576
-> Hash (cost=5646459.11..5646459.11 rows=79090711 width=16) (actual time=25930.971..25930.972 rows=52601291 loops=5)
Buckets: 4194304 Batches: 32 Memory Usage: 109766kB
Buffers: shared hit=4853863 read=1215577 dirtied=26781 written=76, temp written=1119615
-> Seq Scan on outputrequest rec_tro (cost=0.00..5646459.11 rows=79090711 width=16) (actual time=0.114..17424.298 rows=79099765 loops=5)
Buffers: shared hit=4853863 read=1215577 dirtied=26781 written=76
Planning:
Buffers: shared hit=328 read=12
Planning Time: 3.861 ms
Execution Time: 73323.032 ms
(48 rows)

After successfully running the query, the postgresql memory usage is

● postgresql-16.service - PostgreSQL 16 database server
Loaded: loaded (/etc/systemd/system/postgresql-16.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/postgresql-16.service.d
└─override.conf
Active: active (running) since Mon 2024-06-03 04:23:16 +08; 13h ago
Docs: https://www.postgresql.org/docs/16/static/
Process: 283703 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 283709 (postgres)
Tasks: 9 (limit: 408404)
Memory: 31.3G (limit: 36.0G)
CGroup: /system.slice/postgresql-16.service
├─283709 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
├─283710 postgres: logger
├─296122 postgres: checkpointer
├─296123 postgres: background writer
├─296128 postgres: walwriter
├─296129 postgres: autovacuum launcher
├─296130 postgres: logical replication launcher
├─296133 postgres: db1 db1 127.0.0.1(52340) idle
└─297725 postgres: db1 db1 127.0.0.1(33216) idle

top - 18:09:44 up 14 days, 6:37, 1 user, load average: 0.01, 0.07, 0.08
Tasks: 394 total, 2 running, 392 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.7 sy, 0.0 ni, 98.5 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 63873.4 total, 3798.3 free, 22165.3 used, 37909.8 buff/cache
MiB Swap: 32260.0 total, 31045.3 free, 1214.7 used. 32600.4 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
297728 postgres 20 0 264520 4844 3808 R 6.2 0.0 0:00.02 top -u postgres -n 1 -c
283709 postgres 20 0 8960628 198604 197992 S 0.0 0.3 1:23.66 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
283710 postgres 20 0 355288 4808 4576 S 0.0 0.0 0:04.43 postgres: logger
294897 postgres 20 0 226848 5936 3296 S 0.0 0.0 0:00.18 -bash
296122 postgres 20 0 8960912 2.5g 2.5g S 0.0 4.0 0:03.75 postgres: checkpointer
296123 postgres 20 0 8960764 161884 161160 S 0.0 0.2 0:00.41 postgres: background writer
296128 postgres 20 0 8960764 22404 21748 S 0.0 0.0 0:00.49 postgres: walwriter
296129 postgres 20 0 8962216 7800 6872 S 0.0 0.0 0:00.00 postgres: autovacuum launcher
296130 postgres 20 0 8962232 7784 6884 S 0.0 0.0 0:00.00 postgres: logical replication launcher
296133 postgres 20 0 8964788 2.2g 2.2g S 0.0 3.6 1:29.51 postgres: db1 db1 127.0.0.1(52340) idle
297725 postgres 20 0 8962736 17620 16076 S 0.0 0.0 0:00.00 postgres: db1 db1 127.0.0.1(33216) idle

Does anyone else see higher than expected memory usage reported by systemctl status?

Do you have any idea how to further debug the problem?

Best Regards,

Radu

Re: Postgresql 16.3 Out Of Memory

From

Greg Sabino Mullane

Date:

03 June 2024, 13:12:13

On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu <rradutiu@gmail.com> wrote:

Do you have any idea how to further debug the problem?

Putting aside the issue of non-reclaimed memory for now, can you show us the actual query? The explain analyze you provided shows it doing an awful lot of joins and then returning 14+ million rows to the client. Surely the client does not need that many rows?

You can probably also prevent OOM by lowering your work_mem and/or your max_parallel_workers_per_gather.

Cheers,

Greg

Re: Postgresql 16.3 Out Of Memory

From

Ron Johnson

Date:

03 June 2024, 13:32:26

On Mon, Jun 3, 2024 at 9:12 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:

On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu <rradutiu@gmail.com> wrote:
Do you have any idea how to further debug the problem?

Putting aside the issue of non-reclaimed memory for now, can you show us the actual query? The explain analyze you provided shows it doing an awful lot of joins and then returning 14+ million rows to the client. Surely the client does not need that many rows?

And the query cost is really high. "Did you ANALYZE the instance after conversion?" was my first question.

Re: Postgresql 16.3 Out Of Memory

From

Radu Radutiu

Date:

05 June 2024, 08:16:49

It looks like I did not copy the list.

I did run VACUUM ANALYZE after the upgrade. I've even run it now before getting a new explain plan that is very similar (just the costs differ):
Gather Merge (cost=12336145.92..16111570.23 rows=31531596 width=66)
Workers Planned: 4
-> Sort (cost=12335145.86..12354853.11 rows=7882899 width=66)
Sort Key: t.msg_status DESC, t.input_sequence
-> Parallel Hash Right Join (cost=9376528.66..11242773.26 rows=7882899 width=66)
Hash Cond: (snd_tro.reply_input_sequence = t.input_sequence)
-> Parallel Seq Scan on tbl_outputrequest snd_tro (cost=0.00..1431919.45 rows=20057645 width=16)
-> Parallel Hash (cost=9254599.76..9254599.76 rows=5245992 width=58)
-> Parallel Hash Right Join (cost=7458636.44..9254599.76 rows=5245992 width=58)
Hash Cond: (rec_tro.input_sequence = t.input_sequence)
-> Parallel Seq Scan on tbl_outputrequest rec_tro (cost=0.00..1431919.45 rows=20057645 width=16)
-> Parallel Hash (cost=7380902.99..7380902.99 rows=3491156 width=50)
-> Parallel Hash Right Join (cost=5592677.17..7380902.99 rows=3491156 width=50)
Hash Cond: (rpl_snd_tro.reply_input_sequence = r.input_sequence)
-> Parallel Seq Scan on tbl_outputrequest rpl_snd_tro (cost=0.00..1431919.45 rows=20057645 width=16)
-> Parallel Hash (cost=5518353.72..5518353.72 rows=3491156 width=42)
-> Parallel Hash Right Join (cost=3729209.40..5518353.72 rows=3491156 width=42)
Hash Cond: (rpl_rec_tro.input_sequence = r.input_sequence)
-> Parallel Seq Scan on tbl_outputrequest rpl_rec_tro (cost=0.00..1431919.45 rows=20057645 width=16)
-> Parallel Hash (cost=3658294.95..3658294.95 rows=3491156 width=34)
-> Parallel Hash Right Join (cost=1883503.35..3658294.95 rows=3491156 width=34)
Hash Cond: (r.originalrequest_id = t.input_sequence)
-> Parallel Seq Scan on tbl_inputrequest r (cost=0.00..1739752.66 rows=13348166 width=16)
-> Parallel Hash (cost=1839863.91..1839863.91 rows=3491156 width=26)
-> Parallel Seq Scan on tbl_inputrequest t (cost=0.00..1839863.91 rows=3491156 width=26)
Filter: ((receive_time < '2024-05-17 00:00:00'::timestamp without time zone) AND (input_sequence < '202406020168279904'::bigint) AND ((msg_status)::text = ANY ('{COMPLETED,REJECTED}'::text[])))

The query cost is high and it returns a significant number of rows. However it should not consume 64+ GB RAM with the default enable_parallel_hash = 'on' when my shared_buffers is 8GB. The temporary fix I've implemented to get the system working is a change in the application configuration so that the timestamp filter selects fewer rows (3x less), together with setting enable_parallel_hash = 'off'. PostgreSQL service memory usage grows and stays over 20GB even with this setting. I'd like to find out exactly what causes the high memory usage as we have other projects using PostgreSQL and they are scheduled for upgrade from v12.

My test.sql looks like this (application uses prepared statements, the two set operations are to revert for the current session the already implemented fixes):

prepare my_query (timestamp,bigint) as SELECT t.input_sequence, rec_tro.output_sequence, r.input_sequence, rpl_rec_tro.output_sequence, rpl_snd_tro.output_sequence, snd_tro.output_sequence, t.msg_type FROM inputrequest t LEFT JOIN outputrequest rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inputrequest r ON r.originalRequest_id = t.input_sequence LEFT JOIN outputrequest rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence LEFT JOIN outputrequest rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence LEFT JOIN outputrequest snd_tro ON snd_tro.reply_input_sequence = t.input_sequence WHERE t.receive_time < $1 AND t.input_sequence < $2 AND t.msg_status IN ('COMPLETED', 'REJECTED') ORDER BY t.msg_status DESC, t.input_sequence ;

EXPLAIN EXECUTE my_query('2024-05-17 00:00:00', 202406020168279904);

I have an explanation for what I suspected was a memory leak. It seems that systemd reports cached memory, the relevant part is probably rss from /sys/fs/cgroup/memory/system.slice/postgresql-16.service/memory.stat . This still leaves the out of memory situation.

I've managed to get a dump of the memory contexts for the 5 processes (explain + 4 parallel workers) while the query is starting to use significant memory (>10G). I've attached it in case anyone familiar with postgresql internals can see something obviously wrong with the reported numbers.

Radu

Attachment

mem.txt