Thread: Postgresql OOM
Hello all,
I have a query that forces an out of memory error, where the OS will kill the postgresql process.
The query plan (run immediately after a vacuum analyze) is at https://explain.depesz.com/s/ITQI#html .
PostgreSQL version 16.3, running on RHEL 8.9, 16 vCPU, 64 GB RAM, 32 GB swap
shared_buffers=8G
effective_cache_size=24G
maintenance_work_mem=2G
work_mem=104857kB
default_statistics_target = 100
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
jit=off
It looks like the excessive memory allocation is reported in HashSpillContext. I've attached the dump of the memory context for the 5 processes (query + 4 parallel workers) some time after query start. I also see a huge number of temporary files being created. For the time being I've set enable_parallel_hash = 'off' and the problem went away.
I've seen a potentially similar problem reported in https://www.postgresql.org/message-id/flat/20230516200052.sbg6z4ghcmsas3wv%40liskov#f6059259c7c9251fb8c17f5793a2d427 .
Any idea on how to identify the problem? I can reproduce it on demand. Should I report it pgsql-bugs?
Best regards,
Radu
I have a query that forces an out of memory error, where the OS will kill the postgresql process.
The query plan (run immediately after a vacuum analyze) is at https://explain.depesz.com/s/ITQI#html .
PostgreSQL version 16.3, running on RHEL 8.9, 16 vCPU, 64 GB RAM, 32 GB swap
shared_buffers=8G
effective_cache_size=24G
maintenance_work_mem=2G
work_mem=104857kB
default_statistics_target = 100
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
jit=off
It looks like the excessive memory allocation is reported in HashSpillContext. I've attached the dump of the memory context for the 5 processes (query + 4 parallel workers) some time after query start. I also see a huge number of temporary files being created. For the time being I've set enable_parallel_hash = 'off' and the problem went away.
I've seen a potentially similar problem reported in https://www.postgresql.org/message-id/flat/20230516200052.sbg6z4ghcmsas3wv%40liskov#f6059259c7c9251fb8c17f5793a2d427 .
Any idea on how to identify the problem? I can reproduce it on demand. Should I report it pgsql-bugs?
Best regards,
Radu
Attachment
On Thu, Jun 6, 2024 at 1:25 PM Radu Radutiu <rradutiu@gmail.com> wrote:
Hello all,
I have a query that forces an out of memory error, where the OS will kill the postgresql process.
The query plan (run immediately after a vacuum analyze) is at https://explain.depesz.com/s/ITQI#html .
...
Any idea on how to identify the problem? I can reproduce it on demand. Should I report it pgsql-bugs?
Best regards,
Radu
I am not qualified to answer on the OOM issue but why are you joining the same table (outputrequest) 4 times (using an identical join condition)?
This essentially does a cross join, if an input_sequence value has say, 1000 related rows in outputrequest, you will be getting 1000^4 rows in the result set.
FROM inputrequest t LEFT JOIN outputrequest rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inputrequest r ON r.originalRequest_id = t.input_sequence LEFT JOIN outputrequest rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence LEFT JOIN outputrequest rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence LEFT JOIN outputrequest snd_tro ON snd_tro.reply_input_sequence = t.input_sequence
I am not qualified to answer on the OOM issue but why are you joining the same table (outputrequest) 4 times (using an identical join condition)?This essentially does a cross join, if an input_sequence value has say, 1000 related rows in outputrequest, you will be getting 1000^4 rows in the result set.
The query itself runs fine in a reasonable time with enable_parallel_hash = 'off'. I see two problems - one is the wrong execution plan (right after running analyze), the second and the most important is the huge memory usage (far exceeding work_mem and shared buffers) leading to OOM.
See https://explain.depesz.com/s/yAqS for the explain plan with enable_parallel_hash = 'off.
Radu Radutiu <rradutiu@gmail.com> writes: > The query itself runs fine in a reasonable time with enable_parallel_hash = > 'off'. I see two problems - one is the wrong execution plan (right after > running analyze), the second and the most important is the huge memory > usage (far exceeding work_mem and shared buffers) leading to OOM. > See https://explain.depesz.com/s/yAqS for the explain plan > with enable_parallel_hash = 'off. What it looks like to me is that the join key column has very skewed statistics, such that a large majority of the tuples end up in the same hash bucket (probably they even all have identical keys). I think the memory growth is coming from the executor repeatedly splitting the buckets in a vain attempt to separate those tuples into multiple buckets. The planner should recognize this situation and avoid use of hash join in such cases, but maybe the statistics aren't reflecting the problem, or maybe there's something wrong with the logic specific to parallel hash join. You've not really provided enough information to diagnose why the poor choice of plan. regards, tom lane
The planner should recognize this situation and avoid use of hash
join in such cases, but maybe the statistics aren't reflecting the
problem, or maybe there's something wrong with the logic specific
to parallel hash join. You've not really provided enough information
to diagnose why the poor choice of plan.
regards, tom laneThanks for looking into this. I'm not sure what information would be needed to look at the choice of plan.The statistics for the join conditions in the query would be:join_condition | min_count | max_count | avg_count
----------------+-----------+-----------+----------------------------
snd_tro | 0 | 0 | 0.000000000000000000000000
rpl_rec_tro | 0 | 2 | 0.99869222814474470477
rec_tro | 0 | 2 | 0.99869222814474470477
rpl_snd_tro | 0 | 0 | 0.000000000000000000000000r | 0 | 1 | 0.49850916663490161653The relevant columns for the tables are:postgres=# \d inputrequest
Table "public.inputrequest"
Column | Type | Collation | Nullable | Default
--------------------------+-----------------------------+-----------+----------+---------
input_sequence | bigint | | not null |
msg_type | character varying(8) | | not null |
msg_content | text | | not null |
msg_reference | character varying(35) | | |
originalrequest_id | bigint | | |
receive_time | timestamp without time zone | | not null |
related_output_sequence | bigint | | |
msg_status | character varying(15) | | |
Indexes:
"inputrequest_pkey" PRIMARY KEY, btree (input_sequence)
"inputrequest_originalrequest_id_idx" btree (originalrequest_id)
postgres=# \d outputrequest
Table "public.outputrequest"
Column | Type | Collation | Nullable | Default
------------------------+-----------------------------+-----------+----------+---------
output_sequence | bigint | | not null |
input_sequence | bigint | | |
msg_type | character varying(8) | | |
msg_content | text | | not null |
msg_reference | character varying(35) | | |
reply_input_sequence | bigint | | |
status | integer | | not null |
related_input_sequence | bigint | | |
Indexes:
"outputrequest_pkey" PRIMARY KEY, btree (output_sequence)
"outputrequest_input_sequence_idx" btree (input_sequence)
"outputrequest_reply_input_sequence_idx" btree (reply_input_sequence)
I wonder if our choice of primary keys (input_sequence and output_sequence) has something to do with the skew in the hash bucket distribution. We use the following format: yyyymmdd????????xx , where ???????? is more or less a sequence and xx is the node generating the id, i.e. 01,02,etc (with only one or two values in the dataset).
I wonder if it would be difficult to have an upper limit on the private memory that can be allocated by one process (or all processes similar to Oracle's pga_aggregate_limit). I would rather have one query failing with an error message instead of postgres eating up all memory and swap on the server.
Best regards,
Radu
Hi, On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote: > I have a query that forces an out of memory error, where the OS will kill > the postgresql process. FWIW, it can be useful to configure the OS with strict memory overcommit. That causes postgres to fail more gracefully, because the OOM killer won't be invoked. > The query plan (run immediately after a vacuum analyze) is at > https://explain.depesz.com/s/ITQI#html . Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the number of workers? It'd be useful to get some of the information about the actual numbers of tuples etc. Greetings, Andres Freund
Hi, On 2024-06-06 13:58:24 +0100, Pantelis Theodosiou wrote: > I am not qualified to answer on the OOM issue but why are you joining the > same table (outputrequest) 4 times (using an identical join condition)? The conditions aren't actually the same rpl_rec_tro. input_sequence = r.input_sequence rpl_snd_tro.reply_input_sequence = r.input_sequence snd_tro. reply_input_sequence = t.input_sequence First two are r.input_sequence to different columns, the third one also uses reply_input_sequence but joins to t, not r. Greetings, Andres Freund
On Fri, Jun 7, 2024 at 7:59 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> I have a query that forces an out of memory error, where the OS will kill
> the postgresql process.
FWIW, it can be useful to configure the OS with strict memory overcommit. That
causes postgres to fail more gracefully, because the OOM killer won't be
invoked.
> The query plan (run immediately after a vacuum analyze) is at
> https://explain.depesz.com/s/ITQI#html .
Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the number of
workers? It'd be useful to get some of the information about the actual
numbers of tuples etc.
Hi,
I've tried first giving more memory to the OS and mounting a tmpfs in pgsql_tmp. It didn't work, I got
ERROR: invalid DSA memory alloc request size 1140850688
CONTEXT: parallel worker
CONTEXT: parallel worker
I've seen around 2 million temporary files created before the crash.
With work_mem 100MB I was not able to get it to work with 2 parallel workers.
Next, I've increased work_mem to 200MB and now (with extra memory and tmpfs) it finished: https://explain.depesz.com/s/NnRC
Radu
FWIW, it can be useful to configure the OS with strict memory overcommit. That
causes postgres to fail more gracefully, because the OOM killer won't be
invoked.
In the current setup the database is used as an embedded db, with the application sharing the same host as the database. Setting the memory overcommit affects the other application, but configuring LimitAS for the postgresql systemd service should work.
Does this mean that the fact that this query uses more than 100x the configured work_mem is not considered a bug? Should I create a bug report?
Hi, On 2024-06-07 21:42:58 +0300, Radu Radutiu wrote: > On Fri, Jun 7, 2024 at 7:59 PM Andres Freund <andres@anarazel.de> wrote: > > On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote: > > > I have a query that forces an out of memory error, where the OS will kill > > > the postgresql process. > > > > FWIW, it can be useful to configure the OS with strict memory overcommit. > > That > > causes postgres to fail more gracefully, because the OOM killer won't be > > invoked. > > > > > > > The query plan (run immediately after a vacuum analyze) is at > > > https://explain.depesz.com/s/ITQI#html . > > > > Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the > > number of > > workers? It'd be useful to get some of the information about the actual > > numbers of tuples etc. > I've tried first giving more memory to the OS and mounting a tmpfs > in pgsql_tmp. It didn't work, I got > ERROR: invalid DSA memory alloc request size 1_140_850_688 > CONTEXT: parallel worker > I've seen around 2 million temporary files created before the crash. > With work_mem 100MB I was not able to get it to work with 2 parallel > workers. > Next, I've increased work_mem to 200MB and now (with extra memory and > tmpfs) it finished: https://explain.depesz.com/s/NnRC That's helpful, thanks! One thing to note is that postgres' work_mem is confusing - it applies to individual query execution nodes, not the whole query. Additionally, when you use parallel workers, each of the parallel workers can use the "full" work_mem, rather than work_mem being evenly distributed across workers. Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely unexpected (I'd say it's unreasonable if you're not a postgres dev, but that's a different issue). We can see each of the Hash nodes use ~1GB, which is due to (1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB. In this specific query we probably could free the memory in the "lower" hash join nodes once the node directly above has finished building, but we don't have the logic for that today. Of course that doesn't explain why the memory usage / temp file creation is so extreme with a lower limit / fewer workers. There aren't any bad statistics afaics, nor can I really see a potential for a significant skew, it looks to me that the hashtables are all built on a single text field and that nearly all the input rows are distinct. Could you post the table definition (\d+) and the database definition (\l). One random guess I have is that you ended up with a "non-deterministic" collation for that column and that we end up with a bad hash due to that. Greetings, Andres Freund
Hi,
That's helpful, thanks!
One thing to note is that postgres' work_mem is confusing - it applies to
individual query execution nodes, not the whole query. Additionally, when you
use parallel workers, each of the parallel workers can use the "full"
work_mem, rather than work_mem being evenly distributed across workers.
Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely unexpected
(I'd say it's unreasonable if you're not a postgres dev, but that's a
different issue).
We can see each of the Hash nodes use ~1GB, which is due to
(1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB.
In this specific query we probably could free the memory in the "lower" hash
join nodes once the node directly above has finished building, but we don't
have the logic for that today.
I would understand 1 GB, even 2GB (considering hash_mem_multiplier) but the server ran out of memory with more than 64 GB.
Of course that doesn't explain why the memory usage / temp file creation is so
extreme with a lower limit / fewer workers. There aren't any bad statistics
afaics, nor can I really see a potential for a significant skew, it looks to
me that the hashtables are all built on a single text field and that nearly
all the input rows are distinct.
Could you post the table definition (\d+) and the database definition
(\l). One random guess I have is that you ended up with a "non-deterministic"
collation for that column and that we end up with a bad hash due to that.
I was able to eliminate the columns not involved in the query while still retaining the problematic behavior (that's the reason for the new table names):
postgres=# \d inreq
Table "public.inreq"
Column | Type | Collation | Nullable | Default
-------------------------+-----------------------------+-----------+----------+---------
input_sequence | bigint | | not null |
msg_type | character varying(8) | | |
originalrequest_id | bigint | | |
receive_time | timestamp without time zone | | |
related_output_sequence | bigint | | |
msg_status | character varying(15) | | |
Indexes:
"inreq_pkey" PRIMARY KEY, btree (input_sequence)
"inreq_originalrequest_id_idx" btree (originalrequest_id)
postgres=# \d outreq
Table "public.outreq"
Column | Type | Collation | Nullable | Default
------------------------+--------+-----------+----------+---------
output_sequence | bigint | | not null |
input_sequence | bigint | | |
reply_input_sequence | bigint | | |
related_input_sequence | bigint | | |
Indexes:
"outreq_pkey" PRIMARY KEY, btree (output_sequence)
"outreq_input_sequence_idx" btree (input_sequence)
"outreq_reply_input_sequence_idx" btree (reply_input_sequence)
Table "public.inreq"
Column | Type | Collation | Nullable | Default
-------------------------+-----------------------------+-----------+----------+---------
input_sequence | bigint | | not null |
msg_type | character varying(8) | | |
originalrequest_id | bigint | | |
receive_time | timestamp without time zone | | |
related_output_sequence | bigint | | |
msg_status | character varying(15) | | |
Indexes:
"inreq_pkey" PRIMARY KEY, btree (input_sequence)
"inreq_originalrequest_id_idx" btree (originalrequest_id)
postgres=# \d outreq
Table "public.outreq"
Column | Type | Collation | Nullable | Default
------------------------+--------+-----------+----------+---------
output_sequence | bigint | | not null |
input_sequence | bigint | | |
reply_input_sequence | bigint | | |
related_input_sequence | bigint | | |
Indexes:
"outreq_pkey" PRIMARY KEY, btree (output_sequence)
"outreq_input_sequence_idx" btree (input_sequence)
"outreq_reply_input_sequence_idx" btree (reply_input_sequence)
postgres=# SELECT datname, datcollate FROM pg_database WHERE datname = current_database();
datname | datcollate
----------+-------------
postgres | en_US.UTF-8
(1 row)
datname | datcollate
----------+-------------
postgres | en_US.UTF-8
(1 row)
A dump of the two tables above can be found at https://drive.google.com/file/d/1ep1MYjNzlFaICL3GlPZaMdpOxRG6WCGz/view?usp=sharing (compressed size 1GB; size of database after import 14 GB ).
# prepare my_query (timestamp,bigint) as SELECT t.input_sequence, rec_tro.output_sequence, r.input_sequence, rpl_rec_tro.output_sequence, rpl_snd_tro.output_sequence, snd_tro.output_sequence, t.msg_type FROM inreq t LEFT JOIN outreq rec_tro ON rec_tro.input_sequence = t.input_sequence LEFT JOIN inreq r ON r.originalRequest_id = t.input_sequence LEFT JOIN outreq rpl_rec_tro ON rpl_rec_tro.input_sequence = r.input_sequence LEFT JOIN outreq rpl_snd_tro ON rpl_snd_tro.reply_input_sequence = r.input_sequence LEFT JOIN outreq snd_tro ON snd_tro.reply_input_sequence = t.input_sequence WHERE t.receive_time < $1 AND t.input_sequence < $2 AND t.msg_status IN ('COMPLETED', 'REJECTED') ORDER BY t.msg_status DESC, t.input_sequence ;
# EXPLAIN (ANALYZE,BUFFERS) EXECUTE my_query('2024-05-18 00:00:00', 202406020168279904);Best Regards,
Radu