Re: Postgresql OOM - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Postgresql OOM
Date
Msg-id 20240612004920.3aadplpg7ars7lrm@awork3.anarazel.de
Whole thread Raw
In response to Re: Postgresql OOM  (Radu Radutiu <rradutiu@gmail.com>)
Responses Re: Postgresql OOM
List pgsql-hackers
Hi,

On 2024-06-07 21:42:58 +0300, Radu Radutiu wrote:
> On Fri, Jun 7, 2024 at 7:59 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> > > I have a query that forces an out of memory error, where the OS will kill
> > > the postgresql process.
> >
> > FWIW, it can be useful to configure the OS with strict memory overcommit.
> > That
> > causes postgres to fail more gracefully, because the OOM killer won't be
> > invoked.
> >
> >
> > > The query plan (run immediately after a vacuum analyze) is at
> > > https://explain.depesz.com/s/ITQI#html .
> >
> > Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the
> > number of
> > workers? It'd be useful to get some of the information about the actual
> > numbers of tuples etc.

> I've tried first giving more memory to the OS and mounting a tmpfs
> in  pgsql_tmp. It didn't  work, I got
> ERROR:  invalid DSA memory alloc request size 1_140_850_688
> CONTEXT:  parallel worker
> I've seen around 2 million temporary files created before the crash.
> With work_mem 100MB I was not able to get it to work with 2 parallel
> workers.
> Next, I've increased work_mem to 200MB and now (with extra memory and
> tmpfs) it finished: https://explain.depesz.com/s/NnRC

That's helpful, thanks!

One thing to note is that postgres' work_mem is confusing - it applies to
individual query execution nodes, not the whole query. Additionally, when you
use parallel workers, each of the parallel workers can use the "full"
work_mem, rather than work_mem being evenly distributed across workers.

Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely unexpected
(I'd say it's unreasonable if you're not a postgres dev, but that's a
different issue).

We can see each of the Hash nodes use ~1GB, which is due to
(1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB.

In this specific query we probably could free the memory in the "lower" hash
join nodes once the node directly above has finished building, but we don't
have the logic for that today.


Of course that doesn't explain why the memory usage / temp file creation is so
extreme with a lower limit / fewer workers.  There aren't any bad statistics
afaics, nor can I really see a potential for a significant skew, it looks to
me that the hashtables are all built on a single text field and that nearly
all the input rows are distinct.


Could you post the table definition (\d+) and the database definition
(\l). One random guess I have is that you ended up with a "non-deterministic"
collation for that column and that we end up with a bad hash due to that.



Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: RFC: adding pytest as a supported test framework
Next
From: Joseph Koshakow
Date:
Subject: Re: Remove dependence on integer wrapping