BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 - Mailing list pgsql-bugs
| From | PG Bug reporting form |
|---|---|
| Subject | BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18 |
| Date | |
| Msg-id | 19400-c889fc7fffc7c658@postgresql.org Whole thread Raw |
| Responses |
Re: BUG #19400: Memory leak in checkpointer and startup processes on PostgreSQL 18
|
| List | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 19400
Logged by: Raphael
Email address: raphael@atmotrack.fr
PostgreSQL version: 18.1
Operating system: Debian 13
Description:
Hello,
I recently migrated my cluster with 3 dedicated servers to a new cluster. I
was running on PG12 and I am now on PG18.1.
I noticed an increasing memory usage on all of my 3 node, until at some
point there is no memory left and patroni crashes on the leader, leaving the
cluster with no available primary.
The cluster is a Data Warehouse type using TimescaleDB, ingesting approx. 1M
of time-serie a day.
It appears that the memory leak is affecting both the checkpointer and
startup (WAL replay) processes in PostgreSQL 18.0 and 18.1.
I never had such issue on the old cluster with PG12 and the server's
configuration and cluster usage are the same (except the upgrade of PG)
SYMPTOMS:
- Checkpointer process grows to 5.6GB RSS after 24 hours
- Startup process on replicas grows to 3.9GB RSS
- Memory growth rate: approximately 160-200MB per hour
- Eventually causes out-of-memory conditions
CONFIGURATION:
- PostgreSQL version: Initially 18.0, upgraded to 18.1 - same issue persists
- Platform: Debian 13
- TimescaleDB: 2.23.0
- Deployment: 3-node Patroni cluster with streaming replication
- WAL level: logical
- Hot standby enabled
SYSTEM RESOURCES:
RAM: 32GB
Proc: 12 core of Intel(R) Xeon(R) E-2386G 3.50GHz
KEY SETTINGS:
- wal_level: logical
- hot_standby: on
- max_wal_senders: 20
- max_replication_slots: 20
- wal_keep_size: 1GB
- shared_buffer: 8GB
WAL STATISTICS (over 7 days):
- Total WAL generated: 2.3TB (approximately 31GB/day)
- Replication lag: 0 bytes (replicas are caught up)
- No long-running transactions
MEMORY STATE AFTER 24 HOURS:
On primary:
postgres checkpointer: 3.9GB RSS
On replicas:
postgres checkpointer: 5.6GB RSS
postgres startup recovering: 3.9GB RSS <-- This is abnormal
TESTING PERFORMED:
- Disabled wal_log_hints: No effect
- Reduced timescaledb.max_background_workers from 16 to 4: No effect
- Increased checkpoint_timeout to 30min: No effect
- Increased max_wal_size to 4GB: No effect
- Checkpoint frequency is normal (5.99% requested vs timed)
- Upgraded from PostgreSQL 18.0 to 18.1: Same issue persists
The startup process consuming 3.9GB on replicas is particularly concerning,
as this process should not accumulate memory during normal WAL replay.
Here some output 4 days after starting Postgres, on the primary :
# ps aux --sort=-%mem | head -20
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 38762 0.0 24.9 8648868 8182260 ? Ss Feb06 2:42 postgres:
atmo_data: checkpointer
postgres 38763 0.0 8.3 8646992 2746188 ? Ss Feb06 1:08 postgres:
atmo_data: background writer
postgres 38759 0.0 4.5 8646728 1501344 ? Ss Feb06 0:24 postgres:
atmo_data: io worker 0
postgres 38760 0.0 2.9 8646728 976132 ? Ss Feb06 0:13 postgres:
atmo_data: io worker 1
postgres 38761 0.0 2.3 8646728 755948 ? Ss Feb06 0:07 postgres:
atmo_data: io worker 2
The same output on a replica :
# ps aux --sort=-%mem | head -20
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 2706473 0.0 25.2 8653612 8291200 ? Ss Feb06 5:09 postgres:
atmo_data: checkpointer
postgres 2706475 0.2 23.8 8656016 7802324 ? Ss Feb06 14:41 postgres:
atmo_data: startup recovering 0000001E0000024500000075
Replica :
=# SELECT
name,
pg_size_pretty(total_bytes) as total,
pg_size_pretty(used_bytes) as used,
pg_size_pretty(free_bytes) as free
FROM pg_backend_memory_contexts
ORDER BY total_bytes DESC
LIMIT 30;
name | total | used | free
----------------------------------+------------+------------+------------
CacheMemoryContext | 512 kB | 483 kB | 29 kB
TopMemoryContext | 214 kB | 192 kB | 23 kB
MessageContext | 128 kB | 39 kB | 89 kB
Timezones | 102 kB | 99 kB | 2672 bytes
GUCMemoryContext | 64 kB | 55 kB | 9248 bytes
WAL record construction | 49 kB | 43 kB | 6400 bytes
ExecutorState | 48 kB | 39 kB | 9104 bytes
TupleSort main | 32 kB | 25 kB | 6848 bytes
ExprContext | 32 kB | 26 kB | 5832 bytes
TransactionAbortContext | 32 kB | 240 bytes | 32 kB
tuplestore tuples | 32 kB | 19 kB | 13 kB
GUC hash table | 32 kB | 24 kB | 8600 bytes
smgr relation table | 32 kB | 15 kB | 17 kB
Type information cache | 24 kB | 21 kB | 2672 bytes
Custom Scan Methods | 16 kB | 9664 bytes | 6720 bytes
PgStat Pending | 16 kB | 9112 bytes | 7272 bytes
pg_get_backend_memory_contexts | 16 kB | 10 kB | 5664 bytes
Relcache by OID | 16 kB | 12 kB | 3608 bytes
PgStat Shared Ref Hash | 9264 bytes | 8552 bytes | 712 bytes
Record information cache | 8192 bytes | 6552 bytes | 1640 bytes
RowDescriptionContext | 8192 bytes | 1272 bytes | 6920 bytes
Hypertable cache | 8192 bytes | 504 bytes | 7688 bytes
ErrorContext | 8192 bytes | 240 bytes | 7952 bytes
Rendezvous variable hash | 8192 bytes | 7576 bytes | 616 bytes
search_path processing cache | 8192 bytes | 2552 bytes | 5640 bytes
Cache pins | 8192 bytes | 240 bytes | 7952 bytes
fn telemetry local function hash | 8192 bytes | 6168 bytes | 2024 bytes
TopTransactionContext | 8192 bytes | 416 bytes | 7776 bytes
MdSmgr | 8192 bytes | 352 bytes | 7840 bytes
PrivateRefCount | 8192 bytes | 5520 bytes | 2672 bytes
My not-so-good workaround is to perform a daily restart of the cluster
members to avoid breaking patroni on the primary.
Please let me know if you need any additional information to investigate
this issue
pgsql-bugs by date: