Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18
Date
Msg-id aYuLlzKTVLY9k1zB@alap3.anarazel.de
Whole thread Raw
In response to BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18
List pgsql-bugs
Hi,

On 2026-02-10 15:28:38 +0000, PG Bug reporting form wrote:
> I recently migrated my cluster with 3 dedicated servers to a new cluster. I
> was running on PG12 and I am now on PG18.1.
> I noticed an increasing memory usage on all of my 3 node, until at some
> point there is no memory left and patroni crashes on the leader, leaving the
> cluster with no available primary.
> The cluster is a Data Warehouse type using TimescaleDB, ingesting approx. 1M
> of time-serie a day.
> It appears that the memory leak is affecting both the checkpointer and
> startup (WAL replay) processes in PostgreSQL 18.0 and 18.1.
> I never had such issue on the old cluster with PG12 and the server's
> configuration and cluster usage are the same (except the upgrade of PG)
> 
> SYMPTOMS:
> - Checkpointer process grows to 5.6GB RSS after 24 hours
> - Startup process on replicas grows to 3.9GB RSS
> - Memory growth rate: approximately 160-200MB per hour
> - Eventually causes out-of-memory conditions
> 
> CONFIGURATION:
> - PostgreSQL version: Initially 18.0, upgraded to 18.1 - same issue persists
> - Platform: Debian 13
> - TimescaleDB: 2.23.0
> - Deployment: 3-node Patroni cluster with streaming replication
> - WAL level: logical
> - Hot standby enabled
> 
> SYSTEM RESOURCES:
> RAM: 32GB
> Proc: 12 core of Intel(R) Xeon(R) E-2386G 3.50GHz
> 
> KEY SETTINGS:
> - wal_level: logical
> - hot_standby: on
> - max_wal_senders: 20
> - max_replication_slots: 20
> - wal_keep_size: 1GB
> - shared_buffer: 8GB
> 
> WAL STATISTICS (over 7 days):
> - Total WAL generated: 2.3TB (approximately 31GB/day)
> - Replication lag: 0 bytes (replicas are caught up)
> - No long-running transactions
> 
> MEMORY STATE AFTER 24 HOURS:
> On primary:
>   postgres checkpointer: 3.9GB RSS
> 
> On replicas:
>   postgres checkpointer: 5.6GB RSS
>   postgres startup recovering: 3.9GB RSS  <-- This is abnormal

The RSS slowly increasing towards shared_buffers is normal if you're not using
huge_pages. The OS only counts pages in shared memory as part of RSS once a
page has been used in the process. Over time the checkpointer process touches
more and more of shared_buffers, thus increasing the RSS.

You can use "pmap -d -p $pid_of_process" to see how much of the RSS is
actually shared memory.

To show this, here's a PS for a new backend:

ps:
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
andres   2544694  0.0  0.0 8719956 25744 ?       Ss   14:55   0:00 postgres: dev assert: andres postgres [local] idle

and then after reading in a relation 1.3GB relation:

andres   2544694  1.7  2.2 8720972 1403576 ?     Ss   14:55   0:00 postgres: dev assert: andres postgres [local] idle

So you can see that RSS increased proportionally with the amount of touched
data.

Whereas with pmap:

pmap -d -p 2544694|tail -n 1
mapped: 8721924K    writeable/private: 5196K    shared: 8646284K


I think you would need to monitor the real memory usage of various processes
to know why you're OOMing.

You can use pg_log_backend_memory_contexts() to get the memory usage
information of backend processes.

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: basic_archive lost archive_directory
Next
From: surya poondla
Date:
Subject: Re: Possibly a bug