Thread: Unexplained rapid growth in memory usage of idle backends
Hi, We have an application that has a preforked worker pool architecture and opens a relatively large (70+) number of persistentconnections to PostgreSQL 14, and have `max_connections` set to 200 accordingly. This has worked well enough in many deployments, and the backends corresponding to these connections mostly sit idle. Occasionallythey are hit with a query, and those workloads are almost entirely PL/PgSQL stored functions. These functionsmostly just aggregate output from various tables, and occasionally apply some business logic; no temporary tables,no runaway recursive CTEs, nothing baroque. Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the residentmemory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep growing,e.g. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11 postgres 782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17 postgres 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres 3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09 postgres 3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09 postgres 782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62 postgres 782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59 postgres 782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94 postgres 783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40 postgres 783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06 postgres 782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46 postgres 782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88 postgres It would be only natural to expect a cornucopia of heavyweight queries there, but there aren't any. `pg_stat_activity` justreveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle') itself.Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough. The state of affairs deteriorates until either 1) the grim OOM reaper comes: [19063737.017400] Out of memory: Killed process 536356 (postgres) total-vm:21703068kB, anon-rss:20804636kB, file-rss:4kB,shmem-rss:534896kB, UID:107 pgtables:42384kB oom_score_adj:0 [19063739.149563] oom_reaper: reaped process 536356 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB, or 2) the client application is restarted, closing the persistent connections and terminating their corresponding backends.This releases the memory back to the OS and all is well again. For reference: ----- postgresql.auto.conf ----- listen_addresses = '*' timezone = 'UTC' max_connections = '200' effective_cache_size = '2GB' work_mem = '128MB' wal_level = 'replica' fsync = 'on' synchronous_commit = 'off' checkpoint_completion_target = '0.75' checkpoint_warning = '15min' autovacuum = 'on' autovacuum_freeze_max_age = '200000000' vacuum_freeze_min_age = '10000' vacuum_freeze_table_age = '1000000000' primary_conninfo = 'user=replication passfile=''/root/.pgpass'' channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432' promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig' primary_slot_name = 'replication_db5' log_min_duration_statement = '-1' max_worker_processes = '4' max_parallel_workers_per_gather = '4' shared_buffers = '512MB' -------------------------------- And: ----- free ----- total used free shared buff/cache available Mem: 65837856 2623820 47176960 567988 16037076 61989160 Swap: 1000444 0 1000444 ---------------- Lastly, I will reiterate that there is no evidence of a slow-brewing memory leak. Memory usage seems more or less steady-statein general, rising and falling in expected ways with peak and off-peak workloads. Then, suddenly, some of thebackends go into into the aforementioned nosedive. I have considered using a connection pooler to limit the overall memory footprint and blast radius, and while this mightaddress the problem from an operational point of view, it does not really resolve the essential question: why is thishappening in the first place, seemingly out of nowhere? I will also say that while the client application does call a lot of stored functions, they are all rather circumscribedin scope, in support of real-time routing decisions. These are not expensive reporting queries as might be issuedfrom an API or a user interface of some kind, for example. The client IPs on the problematic backends above correspondto the client application, not to any other connected clients. I'm at an utter loss as to how to troubleshoot or prevent this. Any insight would be deeply appreciated! -- Alex -- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
> On 13/05/2023 00:12 CEST Alex Balashov <abalashov@evaristesys.com> wrote: > > We have an application that has a preforked worker pool architecture and > opens a relatively large (70+) number of persistent connections to PostgreSQL > 14, and have `max_connections` set to 200 accordingly. Which pg14 minor version exactly? > Occasionally, about once every 2 months, we'll get a slow-motion implosion > over a period of about 24 hours, where the resident memory size of some of > the backends shoots up from a reasonable few hundred MB to several gigs -- > and will keep growing, e.g. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11 postgres > 782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17 postgres > 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres > 3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09 postgres > 3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09 postgres > 782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62 postgres > 782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59 postgres > 782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94 postgres > 783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40 postgres > 783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06 postgres > 782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46 postgres > 782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88 postgres Function pg_log_backend_memory_contexts(pid int) may give you some insights on on the memory allocation for specific backends. https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE -- Erik
> On May 12, 2023, at 7:38 PM, Erik Wienhold <ewie@ewie.name> wrote: > >> On 13/05/2023 00:12 CEST Alex Balashov <abalashov@evaristesys.com> wrote: >> >> We have an application that has a preforked worker pool architecture and >> opens a relatively large (70+) number of persistent connections to PostgreSQL >> 14, and have `max_connections` set to 200 accordingly. > > Which pg14 minor version exactly? # dpkg -l postgresql-14 Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-================-============-========================================================= ii postgresql-14 14.4-1.pgdg110+1 amd64 The World's Most Advanced Open Source Relational Database # psql -V psql (PostgreSQL) 14.4 (Debian 14.4-1.pgdg110+1) > >> Occasionally, about once every 2 months, we'll get a slow-motion implosion >> over a period of about 24 hours, where the resident memory size of some of >> the backends shoots up from a reasonable few hundred MB to several gigs -- >> and will keep growing, e.g. >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11 postgres >> 782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17 postgres >> 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres >> 3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09 postgres >> 3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09 postgres >> 782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62 postgres >> 782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59 postgres >> 782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94 postgres >> 783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40 postgres >> 783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06 postgres >> 782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46 postgres >> 782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88 postgres > > Function pg_log_backend_memory_contexts(pid int) may give you some insights on > on the memory allocation for specific backends. > > https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE Ah, thanks! 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 3; CachedPlanQuery: 2048 total in 2 blocks; 552 free (0 chunks);1496 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 2; SPI Plan: 1024 total in 1 blocks; 600 free (0 chunks);424 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; 828 more child contexts containing 2714624 total in 2097blocks; 899280 free (329 chunks); 1815344 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; WAL record construction: 49768 total in 2 blocks; 6360free (0 chunks); 43408 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; PrivateRefCount: 8192 total in 1 blocks; 2616 free (0 chunks);5576 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; MdSmgr: 8192 total in 1 blocks; 5640 free (1 chunks); 2552used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; LOCALLOCK hash: 16384 total in 2 blocks; 4592 free (2 chunks);11792 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; Timezones: 104120 total in 2 blocks; 2616 free (0 chunks);101504 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (5 chunks);264 used 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468 chunks);4007312 used But what can I learn from this that might be of applied value? -- Alex -- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
> On 13/05/2023 01:56 CEST Alex Balashov <abalashov@evaristesys.com> wrote: > > > On May 12, 2023, at 7:38 PM, Erik Wienhold <ewie@ewie.name> wrote: > > > > Function pg_log_backend_memory_contexts(pid int) may give you some insights on > > on the memory allocation for specific backends. > > > > https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL-TABLE > > Ah, thanks! > > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 3; CachedPlanQuery: 2048 total in 2 blocks; 552 free (0chunks); 1496 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 2; SPI Plan: 1024 total in 1 blocks; 600 free (0 chunks);424 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; 828 more child contexts containing 2714624 total in 2097blocks; 899280 free (329 chunks); 1815344 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; WAL record construction: 49768 total in 2 blocks; 6360free (0 chunks); 43408 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; PrivateRefCount: 8192 total in 1 blocks; 2616 free (0chunks); 5576 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; MdSmgr: 8192 total in 1 blocks; 5640 free (1 chunks);2552 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; LOCALLOCK hash: 16384 total in 2 blocks; 4592 free (2chunks); 11792 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; Timezones: 104120 total in 2 blocks; 2616 free (0 chunks);101504 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (5 chunks);264 used > 2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468 chunks);4007312 used > > But what can I learn from this that might be of applied value? Narrowing down the memory contexts with high allocation. But I haven't used this function before and would have to look up the memory context names. I bet there are plenty of people on this list who can interpret the output. But the log messages that you posted are not the entire output. The first line should read "logging memory context if PID". -- Erik
Alex Balashov <abalashov@evaristesys.com> writes: > Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the residentmemory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep growing,e.g. I'm a little suspicious that the excess memory consumption is a consequence of JIT-compiled code not getting recycled promptly. Do you have JIT enabled, and if so are you relying on it too much to turn it off? I don't have any concrete idea of why JIT code management would work fine and then stop working fine, but we do have other reports of memory leaks from that. regards, tom lane
What is old_snapshot_threshold set to ?
We had runaway and stacked up sessions were occuring which resulted in the system becoming practically useless. Transactions were dropping and sessions were climbing. We used the system profiling tool "perf" with top to trace the bottle neck down to spin locks. The large number of spin locks were bringing the system to its knees. The perf output also showed the "TransactionLimitedForOldSnapshot" function as a consumer of resource. Which brought to mind a known issue with the parameter old_snapshot_threshold which had a non default value. There have been reported issues in the Postgres community about old_snapshot_threshold and bringing a system down completely under load. Have a look at that.
Also, this is really old but worth mentioning anyway, What are Transparent Huge pages set to at the kernel? I disable that regardless of it being resolved or not since it was a DB killer at one point.
JT
On Fri, May 12, 2023 at 3:12 PM Alex Balashov <abalashov@evaristesys.com> wrote:
Hi,
We have an application that has a preforked worker pool architecture and opens a relatively large (70+) number of persistent connections to PostgreSQL 14, and have `max_connections` set to 200 accordingly.
This has worked well enough in many deployments, and the backends corresponding to these connections mostly sit idle. Occasionally they are hit with a query, and those workloads are almost entirely PL/PgSQL stored functions. These functions mostly just aggregate output from various tables, and occasionally apply some business logic; no temporary tables, no runaway recursive CTEs, nothing baroque.
Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the resident memory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep growing, e.g.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11 postgres
782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17 postgres
782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres
3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09 postgres
3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09 postgres
782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62 postgres
782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59 postgres
782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94 postgres
783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40 postgres
783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06 postgres
782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46 postgres
782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88 postgres
It would be only natural to expect a cornucopia of heavyweight queries there, but there aren't any. `pg_stat_activity` just reveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle') itself. Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough.
The state of affairs deteriorates until either
1) the grim OOM reaper comes:
[19063737.017400] Out of memory: Killed process 536356 (postgres) total-vm:21703068kB, anon-rss:20804636kB, file-rss:4kB, shmem-rss:534896kB, UID:107 pgtables:42384kB oom_score_adj:0
[19063739.149563] oom_reaper: reaped process 536356 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB,
or
2) the client application is restarted, closing the persistent connections and terminating their corresponding backends. This releases the memory back to the OS and all is well again.
For reference:
----- postgresql.auto.conf -----
listen_addresses = '*'
timezone = 'UTC'
max_connections = '200'
effective_cache_size = '2GB'
work_mem = '128MB'
wal_level = 'replica'
fsync = 'on'
synchronous_commit = 'off'
checkpoint_completion_target = '0.75'
checkpoint_warning = '15min'
autovacuum = 'on'
autovacuum_freeze_max_age = '200000000'
vacuum_freeze_min_age = '10000'
vacuum_freeze_table_age = '1000000000'
primary_conninfo = 'user=replication passfile=''/root/.pgpass'' channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432'
promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig'
primary_slot_name = 'replication_db5'
log_min_duration_statement = '-1'
max_worker_processes = '4'
max_parallel_workers_per_gather = '4'
shared_buffers = '512MB'
--------------------------------
And:
----- free -----
total used free shared buff/cache available
Mem: 65837856 2623820 47176960 567988 16037076 61989160
Swap: 1000444 0 1000444
----------------
Lastly, I will reiterate that there is no evidence of a slow-brewing memory leak. Memory usage seems more or less steady-state in general, rising and falling in expected ways with peak and off-peak workloads. Then, suddenly, some of the backends go into into the aforementioned nosedive.
I have considered using a connection pooler to limit the overall memory footprint and blast radius, and while this might address the problem from an operational point of view, it does not really resolve the essential question: why is this happening in the first place, seemingly out of nowhere?
I will also say that while the client application does call a lot of stored functions, they are all rather circumscribed in scope, in support of real-time routing decisions. These are not expensive reporting queries as might be issued from an API or a user interface of some kind, for example. The client IPs on the problematic backends above correspond to the client application, not to any other connected clients.
I'm at an utter loss as to how to troubleshoot or prevent this. Any insight would be deeply appreciated!
-- Alex
--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800
Thanks,
Jorge Torralba
----------------------------
Note: This communication may contain privileged or other confidential information. If you are not the intended recipient, please do not print, copy, retransmit, disseminate or otherwise use the information. Please indicate to the sender that you have received this email in error and delete the copy you received. Thank You.
Jorge Torralba
----------------------------
Note: This communication may contain privileged or other confidential information. If you are not the intended recipient, please do not print, copy, retransmit, disseminate or otherwise use the information. Please indicate to the sender that you have received this email in error and delete the copy you received. Thank You.
> On May 12, 2023, at 9:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alex Balashov <abalashov@evaristesys.com> writes: >> Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where theresident memory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keepgrowing, e.g. > > I'm a little suspicious that the excess memory consumption is a > consequence of JIT-compiled code not getting recycled promptly. > Do you have JIT enabled, and if so are you relying on it too much > to turn it off? > > I don't have any concrete idea of why JIT code management would work > fine and then stop working fine, but we do have other reports of > memory leaks from that. Hi Tom, thanks for your response. JIT is enabled, but I'm not familiar with this feature and not sure what we are using whichwould touch JIT somehow. Also, am I wrong to imagine that a memory leak would leak memory gradually over time, rather than to be set off at some arbitrarypoint by the same exact workload that has been visited upon the database for the previous 1-2 months hitherto? -- Alex -- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Hi, > On May 13, 2023, at 11:09 AM, J T <jorge.torralba@gmail.com> wrote: > > What is old_snapshot_threshold set to ? It is set to the default -1. > We had runaway and stacked up sessions were occuring which resulted in the system becoming practically useless. Transactionswere dropping and sessions were climbing. We used the system profiling tool "perf" with top to trace the bottleneck down to spin locks. The large number of spin locks were bringing the system to its knees. The perf output alsoshowed the "TransactionLimitedForOldSnapshot" function as a consumer of resource. Which brought to mind a known issuewith the parameter old_snapshot_threshold which had a non default value. There have been reported issues in the Postgrescommunity about old_snapshot_threshold and bringing a system down completely under load. Have a look at that. The setting is default, as per above. But moreover, the base workload is pretty low, while the problem appears suddenly,acutely and without warning despite no change (that we know of) in workload. It makes me think locking is probablynot the problem, simply because that would manifest more "across the board". > Also, this is really old but worth mentioning anyway, What are Transparent Huge pages set to at the kernel? I disable thatregardless of it being resolved or not since it was a DB killer at one point. THP is enabled. This system is a relatively stock Debian buster install and isn't tuned in any special way. The workloadsimply doesn't require it; it's not that heavy. That's mostly why this issue is surprising; if the database wereconstantly under resource pressure, it would be more understandable. -- Alex -- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
On Fri, May 12, 2023 at 7:57 PM Alex Balashov <abalashov@evaristesys.com> wrote:
>> 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55 postgres
If you press 'c' within top, it will change the display to show the full process title, which will give you more info. Although presumably that info would mostly be redundant to what you can already see in pg_stat_actiivty, but it is very nice to see the correlation between the CPU and the Memory and this other info. I usually configure my 'top' so that this is the default mode for it to start in.
> `pg_stat_activity` just reveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle') itself. Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough.
It is hard to square this with your "top" output. I can see how they could be idle while holding memory, but how could they be idle while exercising that much CPU currently, and having accumulated so much CPU usage? Are you checking pg_stat_activity as an unprivileged user, in which case 'state' will show up as NULL for other users' processes? Maybe you have more than one db server running, and the one accumulating the memory is not in the one you are checking pg_stat_activiy against? Or maybe you have malware running in the postgres account, and it is just masquerading as postgres but is not actually postgres.
Is the number of 'idle' processes you see in pg_stat_activity matching up with the number of postgres processes you see in 'top'?
[about memory context...]
2023-05-12 23:53:23.293 UTC [3934125] eti_adm@eti LOG: Grand total: 6417232 bytes in 2461 blocks; 2409920 free (468 chunks); 4007312 used
But what can I learn from this that might be of applied value?
I would say there is nothing of value there. It looks like that is for the wrong backend. That is, one which was not using much memory at the time. Try again making sure to catch a process which is actually using a lot. Although it could be that the process is truly using a lot of memory but just outside of the PostgreSQL memory accounting system. That could be due to a custom extension, a postgres bug (or perhaps a JIT bug), or malware.
Cheers,
Jeff