Home > mailing lists

Re: PostgreSQL 14.4 ERROR: out of memory issues - Mailing list pgsql-general

From	Aleš Zelený
Subject	Re: PostgreSQL 14.4 ERROR: out of memory issues
Date	July 14, 2022 22:25:54
Msg-id	CAODqTUbZG+Fp-NfynbATfGf0y=xsi1XfwBCDQWgXWeAcVOFXZQ@mail.gmail.com Whole thread Raw
In response to	Re: PostgreSQL 14.4 ERROR: out of memory issues (Michael Paquier <michael@paquier.xyz>)
Responses	Re: PostgreSQL 14.4 ERROR: out of memory issues (Pavel Stehule <pavel.stehule@gmail.com>) Re: PostgreSQL 14.4 ERROR: out of memory issues (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List	pgsql-general

Tree view

st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael@paquier.xyz> napsal:

On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
> So far, it has happened three times (during a single week) from the 14.3 ->
> 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
>
> Questions:
> 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the instance
> and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
> 2) What is the best way to diagnose what is the root cause?

Hmm. 14.4 has nothing in its release notes that would point to a
change in the vacuum or autovacuum's code paths:
https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4

There is nothing specific after a look at the changes as of, and I am
not grabbing anything that would imply a change in memory context
handling either:
`git log --stat REL_14_3..REL_14_4`
`git diff REL_14_3..REL_14_4 -- *.c`

Saying that, you should be able to downgrade safely as there are no
changes in WAL format or such that would break things. Saying that,
the corruption issue caused by CONCURRENTLY is something you'd still
have to face.

Thanks, good to know that, we can use it for a test case, since we already hit the CONCURRENTLY bug on 14.3.

> 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out of
> memory
> 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL: Failed
> on request of size 152094068 in memory context "TopTransactionContext".
> 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
> automatic vacuum of table "prematch.replication.tab_queue_tmp"

This is the interesting part. Do you happen to use logical
replication in a custom C++ plugin?

We are using logical replication to other instances (pg_output) and decoderbufs

https://github.com/debezium/postgres-decoderbufs for other applications.

> 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could
> not fork worker process: Cannot allocate memory
> terminate called after throwing an instance of 'std::bad_alloc'
> what(): std::bad_alloc
>
> DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
> terminate
> called after throwing an instance of 'std::bad_alloc' terminate called
> after throwing an instance of 'std::bad_alloc' what(): what():
> std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
> user=,db=,host=,app= LOG: background worker "parallel worker" (PID 25251)
> was terminated by signal 6: Aborted
> 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
> terminating any other active server processes

Looks like something is going wrong in the memory handling of one of
your C++ extensions here. If you can isolate an issue using a query
without any custom code, that would be a Postgres problem, but I think
that you are missing a trick in it.

Here are extensions installed in the database served by the cluster:

When we started experiencing these issues, based on Prometheus node-exporter statistics, I've begun collecting /proc/<pid>/status for all PG instance processes in an infinite loop.

From this, I can extract some data. It looks like RssAnon memory is what I can see growing from several hundred kB to 1-2GB over some period (depends on workload, but usually 24 hours).

The sessions where RssAnon memory growth are PHP client applications, having multiple connections it might consume 10GB of memory. The same application (I've asked for deployed changes, but no one was released close to the upgrade time).

Comparing Prometheus server data, RssAnon seems to directly follow node_memory_Active_anon_bytes node exporter metrics. If the "suspected" sessions disconnects, node_memory_Active_anon_bytes drops down.

I've checked node_memory_Active_anon_bytes metric before the upgrade and (same application workload) it was pretty stable on PG 13.8, later 14.3 but immediately after the upgrade, node_memory_Active_anon_bytes start growth, and once the Commit limit was reached, the fork()/Out of memory issues appears.

So for the application user running the PHP, I've set log_min_duration_statemenet = 0, but no unexpected statements or arguments (some functions have int array argument) were found in the logfile.

Checking the RssAnon from proc/pid/status I've found some points where RssAnon memory usage grew very steep for a minute, but no "suspicious" queries/arguments were found in the instance logfile.

Any hint, on how to get the root cause would be appreciated since so far I've failed to isolate the issue reproducible testcase.

At least I hope that looking for the RssAnon process memory is an appropriate metric, if not, let me know and I'll try to update the monitoring to get the root cause.

I can imagine a workaround with client application regular reconnect..., but u to 14.3 it works, so I'd like to fix the issue either on our application side or at PG side if it is a PG problem.

Kind regards Ales Zeleny

--
Michael

pgsql-general by date:

From: Bruce Momjian
Date: 14 July 2022, 22:10:23
Subject: Re: Oracle to Postgress Migration

From: Pavel Stehule
Date: 14 July 2022, 22:39:31
Subject: Re: PostgreSQL 14.4 ERROR: out of memory issues

Re: PostgreSQL 14.4 ERROR: out of memory issues - Mailing list pgsql-general

Previous

Next