Thread: Unexplained rapid growth in memory usage of idle backends

Unexplained rapid growth in memory usage of idle backends

From

Alex Balashov

Date:

12 May 2023, 22:12:10

Hi,

We have an application that has a preforked worker pool architecture and opens a relatively large (70+) number of
persistentconnections to PostgreSQL 14, and have `max_connections` set to 200 accordingly.  

This has worked well enough in many deployments, and the backends corresponding to these connections mostly sit idle.
Occasionallythey are hit with a query, and those workloads are almost entirely PL/PgSQL stored functions. These
functionsmostly just aggregate output from various tables, and occasionally apply some business logic; no temporary
tables,no runaway recursive CTEs, nothing baroque.  

Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the
residentmemory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep
growing,e.g. 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3932733 postgres  20   0  637740 327124 323840 R  10.7   0.5   0:00.11 postgres
 782930 postgres  20   0 3151360   2.8g 551704 S   9.7   4.5 221:43.17 postgres
 782839 postgres  20   0 2999896   2.7g 551832 S   8.7   4.3 220:54.55 postgres
3932734 postgres  20   0  636128 338168 336096 R   8.7   0.5   0:00.09 postgres
3932735 postgres  20   0  636128 319824 317768 R   8.7   0.5   0:00.09 postgres
 782851 postgres  20   0 3142152   2.8g 551872 S   7.8   4.5 221:22.62 postgres
 782855 postgres  20   0 3155144   2.8g 551828 S   7.8   4.5 221:38.59 postgres
 782897 postgres  20   0 3148808   2.8g 551724 S   7.8   4.5 220:05.94 postgres
 783106 postgres  20   0 3152416   2.8g 551912 S   7.8   4.5 221:24.40 postgres
 783168 postgres  20   0 2992592   2.7g 551968 S   7.8   4.2 220:36.06 postgres
 782843 postgres  20   0 3146772   2.8g 551944 R   4.9   4.5 221:38.46 postgres
 782778 postgres  20   0 3150956   2.8g 551852 R   3.9   4.5 220:49.88 postgres

It would be only natural to expect a cornucopia of heavyweight queries there, but there aren't any. `pg_stat_activity`
justreveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle')
itself.Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough. 

The state of affairs deteriorates until either

1) the grim OOM reaper comes:

[19063737.017400] Out of memory: Killed process 536356 (postgres) total-vm:21703068kB, anon-rss:20804636kB,
file-rss:4kB,shmem-rss:534896kB, UID:107 pgtables:42384kB oom_score_adj:0 
[19063739.149563] oom_reaper: reaped process 536356 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB,

or

2) the client application is restarted, closing the persistent connections and terminating their corresponding
backends.This releases the memory back to the OS and all is well again. 

For reference:

----- postgresql.auto.conf -----
listen_addresses = '*'
timezone = 'UTC'
max_connections = '200'
effective_cache_size = '2GB'
work_mem = '128MB'
wal_level = 'replica'
fsync = 'on'
synchronous_commit = 'off'
checkpoint_completion_target = '0.75'
checkpoint_warning = '15min'
autovacuum = 'on'
autovacuum_freeze_max_age = '200000000'
vacuum_freeze_min_age = '10000'
vacuum_freeze_table_age = '1000000000'
primary_conninfo = 'user=replication passfile=''/root/.pgpass'' channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432'
promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig'
primary_slot_name = 'replication_db5'
log_min_duration_statement = '-1'
max_worker_processes = '4'
max_parallel_workers_per_gather = '4'
shared_buffers = '512MB'
--------------------------------

And:

----- free -----
               total        used        free      shared  buff/cache   available
Mem:        65837856     2623820    47176960      567988    16037076    61989160
Swap:        1000444           0     1000444
----------------

Lastly, I will reiterate that there is no evidence of a slow-brewing memory leak. Memory usage seems more or less
steady-statein general, rising and falling in expected ways with peak and off-peak workloads. Then, suddenly, some of
thebackends go into into the aforementioned nosedive.  

I have considered using a connection pooler to limit the overall memory footprint and blast radius, and while this
mightaddress the problem from an operational point of view, it does not really resolve the essential question: why is
thishappening in the first place, seemingly out of nowhere? 

I will also say that while the client application does call a lot of stored functions, they are all rather
circumscribedin scope, in support of real-time routing decisions. These are not expensive reporting queries as might be
issuedfrom an API or a user interface of some kind, for example. The client IPs on the problematic backends above
correspondto the client application, not to any other connected clients.  

I'm at an utter loss as to how to troubleshoot or prevent this. Any insight would be deeply appreciated!

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Re: Unexplained rapid growth in memory usage of idle backends

From

Erik Wienhold

Date:

12 May 2023, 23:38:00

> On 13/05/2023 00:12 CEST Alex Balashov <abalashov@evaristesys.com> wrote:
>
> We have an application that has a preforked worker pool architecture and
> opens a relatively large (70+) number of persistent connections to PostgreSQL
> 14, and have `max_connections` set to 200 accordingly.

Which pg14 minor version exactly?

> Occasionally, about once every 2 months, we'll get a slow-motion implosion
> over a period of about 24 hours, where the resident memory size of some of
> the backends shoots up from a reasonable few hundred MB to several gigs --
> and will keep growing, e.g.
>
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 3932733 postgres  20   0  637740 327124 323840 R  10.7   0.5   0:00.11 postgres
>  782930 postgres  20   0 3151360   2.8g 551704 S   9.7   4.5 221:43.17 postgres
>  782839 postgres  20   0 2999896   2.7g 551832 S   8.7   4.3 220:54.55 postgres
> 3932734 postgres  20   0  636128 338168 336096 R   8.7   0.5   0:00.09 postgres
> 3932735 postgres  20   0  636128 319824 317768 R   8.7   0.5   0:00.09 postgres
>  782851 postgres  20   0 3142152   2.8g 551872 S   7.8   4.5 221:22.62 postgres
>  782855 postgres  20   0 3155144   2.8g 551828 S   7.8   4.5 221:38.59 postgres
>  782897 postgres  20   0 3148808   2.8g 551724 S   7.8   4.5 220:05.94 postgres
>  783106 postgres  20   0 3152416   2.8g 551912 S   7.8   4.5 221:24.40 postgres
>  783168 postgres  20   0 2992592   2.7g 551968 S   7.8   4.2 220:36.06 postgres
>  782843 postgres  20   0 3146772   2.8g 551944 R   4.9   4.5 221:38.46 postgres
>  782778 postgres  20   0 3150956   2.8g 551852 R   3.9   4.5 220:49.88 postgres

Function pg_log_backend_memory_contexts(pid int) may give you some insights on
on the memory allocation for specific backends.

https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE

--
Erik

Re: Unexplained rapid growth in memory usage of idle backends

From

Alex Balashov

Date:

12 May 2023, 23:56:50


> On May 12, 2023, at 7:38 PM, Erik Wienhold <ewie@ewie.name> wrote:
>
>> On 13/05/2023 00:12 CEST Alex Balashov <abalashov@evaristesys.com> wrote:
>>
>> We have an application that has a preforked worker pool architecture and
>> opens a relatively large (70+) number of persistent connections to PostgreSQL
>> 14, and have `max_connections` set to 200 accordingly.
>
> Which pg14 minor version exactly?

# dpkg -l postgresql-14
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version          Architecture Description
+++-==============-================-============-=========================================================
ii  postgresql-14  14.4-1.pgdg110+1 amd64        The World's Most Advanced Open Source Relational Database

# psql -V
psql (PostgreSQL) 14.4 (Debian 14.4-1.pgdg110+1)

>
>> Occasionally, about once every 2 months, we'll get a slow-motion implosion
>> over a period of about 24 hours, where the resident memory size of some of
>> the backends shoots up from a reasonable few hundred MB to several gigs --
>> and will keep growing, e.g.
>>
>>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 3932733 postgres  20   0  637740 327124 323840 R  10.7   0.5   0:00.11 postgres
>> 782930 postgres  20   0 3151360   2.8g 551704 S   9.7   4.5 221:43.17 postgres
>> 782839 postgres  20   0 2999896   2.7g 551832 S   8.7   4.3 220:54.55 postgres
>> 3932734 postgres  20   0  636128 338168 336096 R   8.7   0.5   0:00.09 postgres
>> 3932735 postgres  20   0  636128 319824 317768 R   8.7   0.5   0:00.09 postgres
>> 782851 postgres  20   0 3142152   2.8g 551872 S   7.8   4.5 221:22.62 postgres
>> 782855 postgres  20   0 3155144   2.8g 551828 S   7.8   4.5 221:38.59 postgres
>> 782897 postgres  20   0 3148808   2.8g 551724 S   7.8   4.5 220:05.94 postgres
>> 783106 postgres  20   0 3152416   2.8g 551912 S   7.8   4.5 221:24.40 postgres
>> 783168 postgres  20   0 2992592   2.7g 551968 S   7.8   4.2 220:36.06 postgres
>> 782843 postgres  20   0 3146772   2.8g 551944 R   4.9   4.5 221:38.46 postgres
>> 782778 postgres  20   0 3150956   2.8g 551852 R   3.9   4.5 220:49.88 postgres
>
> Function pg_log_backend_memory_contexts(pid int) may give you some insights on
> on the memory allocation for specific backends.
>
> https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE

Ah, thanks!

2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 3; CachedPlanQuery: 2048 total in 2 blocks; 552 free (0
chunks);1496 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 2; SPI Plan: 1024 total in 1 blocks; 600 free (0
chunks);424 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; 828 more child contexts containing 2714624 total in
2097blocks; 899280 free (329 chunks); 1815344 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; WAL record construction: 49768 total in 2 blocks;
6360free (0 chunks); 43408 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; PrivateRefCount: 8192 total in 1 blocks; 2616 free (0
chunks);5576 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; MdSmgr: 8192 total in 1 blocks; 5640 free (1 chunks);
2552used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; LOCALLOCK hash: 16384 total in 2 blocks; 4592 free (2
chunks);11792 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; Timezones: 104120 total in 2 blocks; 2616 free (0
chunks);101504 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (5
chunks);264 used 
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468
chunks);4007312 used 

But what can I learn from this that might be of applied value?

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Re: Unexplained rapid growth in memory usage of idle backends

From

Erik Wienhold

Date:

13 May 2023, 01:17:37

> On 13/05/2023 01:56 CEST Alex Balashov <abalashov@evaristesys.com> wrote:
>
> > On May 12, 2023, at 7:38 PM, Erik Wienhold <ewie@ewie.name> wrote:
> >
> > Function pg_log_backend_memory_contexts(pid int) may give you some insights on
> > on the memory allocation for specific backends.
> >
> > https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE
>
> Ah, thanks!
>
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 3; CachedPlanQuery: 2048 total in 2 blocks; 552 free
(0chunks); 1496 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 2; SPI Plan: 1024 total in 1 blocks; 600 free (0
chunks);424 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; 828 more child contexts containing 2714624 total in
2097blocks; 899280 free (329 chunks); 1815344 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; WAL record construction: 49768 total in 2 blocks;
6360free (0 chunks); 43408 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; PrivateRefCount: 8192 total in 1 blocks; 2616 free
(0chunks); 5576 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; MdSmgr: 8192 total in 1 blocks; 5640 free (1
chunks);2552 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; LOCALLOCK hash: 16384 total in 2 blocks; 4592 free
(2chunks); 11792 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; Timezones: 104120 total in 2 blocks; 2616 free (0
chunks);101504 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (5
chunks);264 used
 
> 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG:  Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468
chunks);4007312 used
 
>
> But what can I learn from this that might be of applied value?

Narrowing down the memory contexts with high allocation.  But I haven't used
this function before and would have to look up the memory context names.
I bet there are plenty of people on this list who can interpret the output.

But the log messages that you posted are not the entire output.  The first line
should read "logging memory context if PID".

--
Erik

Re: Unexplained rapid growth in memory usage of idle backends

From

Tom Lane

Date:

13 May 2023, 01:52:55

Alex Balashov <abalashov@evaristesys.com> writes:
> Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the
residentmemory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep
growing,e.g. 

I'm a little suspicious that the excess memory consumption is a
consequence of JIT-compiled code not getting recycled promptly.
Do you have JIT enabled, and if so are you relying on it too much
to turn it off?

I don't have any concrete idea of why JIT code management would work
fine and then stop working fine, but we do have other reports of
memory leaks from that.

            regards, tom lane

Re: Unexplained rapid growth in memory usage of idle backends

From

J T

Date:

13 May 2023, 15:09:34

What is old_snapshot_threshold set to ?

We had runaway and stacked up sessions were occuring which resulted in the system becoming practically useless. Transactions were dropping and sessions were climbing. We used the system profiling tool "perf" with top to trace the bottle neck down to spin locks. The large number of spin locks were bringing the system to its knees. The perf output also showed the "TransactionLimitedForOldSnapshot" function as a consumer of resource. Which brought to mind a known issue with the parameter old_snapshot_threshold which had a non default value. There have been reported issues in the Postgres community about old_snapshot_threshold and bringing a system down completely under load. Have a look at that.

Also, this is really old but worth mentioning anyway, What are Transparent Huge pages set to at the kernel? I disable that regardless of it being resolved or not since it was a DB killer at one point.

On Fri, May 12, 2023 at 3:12 PM Alex Balashov <abalashov@evaristesys.com> wrote:

Hi,

We have an application that has a preforked worker pool architecture and opens a relatively large (70+) number of persistent connections to PostgreSQL 14, and have `max_connections` set to 200 accordingly.

This has worked well enough in many deployments, and the backends corresponding to these connections mostly sit idle. Occasionally they are hit with a query, and those workloads are almost entirely PL/PgSQL stored functions. These functions mostly just aggregate output from various tables, and occasionally apply some business logic; no temporary tables, no runaway recursive CTEs, nothing baroque.

Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the resident memory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep growing, e.g.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11 postgres
782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17 postgres
782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres
3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09 postgres
3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09 postgres
782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62 postgres
782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59 postgres
782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94 postgres
783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40 postgres
783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06 postgres
782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46 postgres
782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88 postgres

It would be only natural to expect a cornucopia of heavyweight queries there, but there aren't any. `pg_stat_activity` just reveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle') itself. Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough.

The state of affairs deteriorates until either

1) the grim OOM reaper comes:

[19063737.017400] Out of memory: Killed process 536356 (postgres) total-vm:21703068kB, anon-rss:20804636kB, file-rss:4kB, shmem-rss:534896kB, UID:107 pgtables:42384kB oom_score_adj:0
[19063739.149563] oom_reaper: reaped process 536356 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB,

or

2) the client application is restarted, closing the persistent connections and terminating their corresponding backends. This releases the memory back to the OS and all is well again.

For reference:

----- postgresql.auto.conf -----
listen_addresses = '*'
timezone = 'UTC'
max_connections = '200'
effective_cache_size = '2GB'
work_mem = '128MB'
wal_level = 'replica'
fsync = 'on'
synchronous_commit = 'off'
checkpoint_completion_target = '0.75'
checkpoint_warning = '15min'
autovacuum = 'on'
autovacuum_freeze_max_age = '200000000'
vacuum_freeze_min_age = '10000'
vacuum_freeze_table_age = '1000000000'
primary_conninfo = 'user=replication passfile=''/root/.pgpass'' channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432'
promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig'
primary_slot_name = 'replication_db5'
log_min_duration_statement = '-1'
max_worker_processes = '4'
max_parallel_workers_per_gather = '4'
shared_buffers = '512MB'
--------------------------------

And:

----- free -----
total used free shared buff/cache available
Mem: 65837856 2623820 47176960 567988 16037076 61989160
Swap: 1000444 0 1000444
----------------

Lastly, I will reiterate that there is no evidence of a slow-brewing memory leak. Memory usage seems more or less steady-state in general, rising and falling in expected ways with peak and off-peak workloads. Then, suddenly, some of the backends go into into the aforementioned nosedive.

I have considered using a connection pooler to limit the overall memory footprint and blast radius, and while this might address the problem from an operational point of view, it does not really resolve the essential question: why is this happening in the first place, seemingly out of nowhere?

I will also say that while the client application does call a lot of stored functions, they are all rather circumscribed in scope, in support of real-time routing decisions. These are not expensive reporting queries as might be issued from an API or a user interface of some kind, for example. The client IPs on the problematic backends above correspond to the client application, not to any other connected clients.

I'm at an utter loss as to how to troubleshoot or prevent this. Any insight would be deeply appreciated!

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Thanks,

Jorge Torralba
----------------------------

Note: This communication may contain privileged or other confidential information. If you are not the intended recipient, please do not print, copy, retransmit, disseminate or otherwise use the information. Please indicate to the sender that you have received this email in error and delete the copy you received. Thank You.

Re: Unexplained rapid growth in memory usage of idle backends

From

Alex Balashov

Date:

13 May 2023, 23:49:58

> On May 12, 2023, at 9:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alex Balashov <abalashov@evaristesys.com> writes:
>> Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where
theresident memory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will
keepgrowing, e.g. 
>
> I'm a little suspicious that the excess memory consumption is a
> consequence of JIT-compiled code not getting recycled promptly.
> Do you have JIT enabled, and if so are you relying on it too much
> to turn it off?
>
> I don't have any concrete idea of why JIT code management would work
> fine and then stop working fine, but we do have other reports of
> memory leaks from that.

Hi Tom, thanks for your response. JIT is enabled, but I'm not familiar with this feature and not sure what we are using
whichwould touch JIT somehow.  

Also, am I wrong to imagine that a memory leak would leak memory gradually over time, rather than to be set off at some
arbitrarypoint by the same exact workload that has been visited upon the database for the previous 1-2 months hitherto? 

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Re: Unexplained rapid growth in memory usage of idle backends

From

Alex Balashov

Date:

13 May 2023, 23:57:56

Hi,

> On May 13, 2023, at 11:09 AM, J T <jorge.torralba@gmail.com> wrote:
>
> What is old_snapshot_threshold set to ?

It is set to the default -1.

> We had runaway and stacked up sessions were occuring which resulted in the system becoming practically useless.
Transactionswere dropping and sessions were climbing. We used the system profiling tool "perf" with top to trace the
bottleneck down to spin locks. The large number of spin locks were bringing the system to its knees.  The perf output
alsoshowed the "TransactionLimitedForOldSnapshot" function as a consumer of resource. Which brought to mind a known
issuewith the parameter old_snapshot_threshold which had a non default value. There have been reported issues in the
Postgrescommunity about old_snapshot_threshold and bringing a system down completely under load. Have a look at that. 

The setting is default, as per above. But moreover, the base workload is pretty low, while the problem appears
suddenly,acutely and without warning despite no change (that we know of) in workload. It makes me think locking is
probablynot the problem, simply because that would manifest more "across the board". 

> Also, this is really old but worth mentioning anyway, What are Transparent Huge pages set to at the kernel? I disable
thatregardless of it being resolved or not since it was a DB killer at one point. 

THP is enabled. This system is a relatively stock Debian buster install and isn't tuned in any special way. The
workloadsimply doesn't require it; it's not that heavy. That's mostly why this issue is surprising; if the database
wereconstantly under resource pressure, it would be more understandable. 

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Re: Unexplained rapid growth in memory usage of idle backends

From

Jeff Janes

Date:

14 May 2023, 18:08:01

On Fri, May 12, 2023 at 7:57 PM Alex Balashov <abalashov@evaristesys.com> wrote:

>> 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres

If you press 'c' within top, it will change the display to show the full process title, which will give you more info. Although presumably that info would mostly be redundant to what you can already see in pg_stat_actiivty, but it is very nice to see the correlation between the CPU and the Memory and this other info. I usually configure my 'top' so that this is the default mode for it to start in.

> `pg_stat_activity` just reveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle') itself. Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough.

It is hard to square this with your "top" output. I can see how they could be idle while holding memory, but how could they be idle while exercising that much CPU currently, and having accumulated so much CPU usage? Are you checking pg_stat_activity as an unprivileged user, in which case 'state' will show up as NULL for other users' processes? Maybe you have more than one db server running, and the one accumulating the memory is not in the one you are checking pg_stat_activiy against? Or maybe you have malware running in the postgres account, and it is just masquerading as postgres but is not actually postgres.

Is the number of 'idle' processes you see in pg_stat_activity matching up with the number of postgres processes you see in 'top'?

[about memory context...]

2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468 chunks); 4007312 used

But what can I learn from this that might be of applied value?

I would say there is nothing of value there. It looks like that is for the wrong backend. That is, one which was not using much memory at the time. Try again making sure to catch a process which is actually using a lot. Although it could be that the process is truly using a lot of memory but just outside of the PostgreSQL memory accounting system. That could be due to a custom extension, a postgres bug (or perhaps a JIT bug), or malware.

Cheers,

Jeff