Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 - Mailing list pgsql-bugs
| From | Andres Freund |
|---|---|
| Subject | Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 |
| Date | |
| Msg-id | aYuLlzKTVLY9k1zB@alap3.anarazel.de Whole thread Raw |
| In response to | BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 (PG Bug reporting form <noreply@postgresql.org>) |
| Responses |
Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18
|
| List | pgsql-bugs |
Hi, On 2026-02-10 15:28:38 +0000, PG Bug reporting form wrote: > I recently migrated my cluster with 3 dedicated servers to a new cluster. I > was running on PG12 and I am now on PG18.1. > I noticed an increasing memory usage on all of my 3 node, until at some > point there is no memory left and patroni crashes on the leader, leaving the > cluster with no available primary. > The cluster is a Data Warehouse type using TimescaleDB, ingesting approx. 1M > of time-serie a day. > It appears that the memory leak is affecting both the checkpointer and > startup (WAL replay) processes in PostgreSQL 18.0 and 18.1. > I never had such issue on the old cluster with PG12 and the server's > configuration and cluster usage are the same (except the upgrade of PG) > > SYMPTOMS: > - Checkpointer process grows to 5.6GB RSS after 24 hours > - Startup process on replicas grows to 3.9GB RSS > - Memory growth rate: approximately 160-200MB per hour > - Eventually causes out-of-memory conditions > > CONFIGURATION: > - PostgreSQL version: Initially 18.0, upgraded to 18.1 - same issue persists > - Platform: Debian 13 > - TimescaleDB: 2.23.0 > - Deployment: 3-node Patroni cluster with streaming replication > - WAL level: logical > - Hot standby enabled > > SYSTEM RESOURCES: > RAM: 32GB > Proc: 12 core of Intel(R) Xeon(R) E-2386G 3.50GHz > > KEY SETTINGS: > - wal_level: logical > - hot_standby: on > - max_wal_senders: 20 > - max_replication_slots: 20 > - wal_keep_size: 1GB > - shared_buffer: 8GB > > WAL STATISTICS (over 7 days): > - Total WAL generated: 2.3TB (approximately 31GB/day) > - Replication lag: 0 bytes (replicas are caught up) > - No long-running transactions > > MEMORY STATE AFTER 24 HOURS: > On primary: > postgres checkpointer: 3.9GB RSS > > On replicas: > postgres checkpointer: 5.6GB RSS > postgres startup recovering: 3.9GB RSS <-- This is abnormal The RSS slowly increasing towards shared_buffers is normal if you're not using huge_pages. The OS only counts pages in shared memory as part of RSS once a page has been used in the process. Over time the checkpointer process touches more and more of shared_buffers, thus increasing the RSS. You can use "pmap -d -p $pid_of_process" to see how much of the RSS is actually shared memory. To show this, here's a PS for a new backend: ps: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND andres 2544694 0.0 0.0 8719956 25744 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle and then after reading in a relation 1.3GB relation: andres 2544694 1.7 2.2 8720972 1403576 ? Ss 14:55 0:00 postgres: dev assert: andres postgres [local] idle So you can see that RSS increased proportionally with the amount of touched data. Whereas with pmap: pmap -d -p 2544694|tail -n 1 mapped: 8721924K writeable/private: 5196K shared: 8646284K I think you would need to monitor the real memory usage of various processes to know why you're OOMing. You can use pg_log_backend_memory_contexts() to get the memory usage information of backend processes. Greetings, Andres Freund
pgsql-bugs by date: