BUG #19411: libpq 16.x exhibits a memory leak when connections are repeatedly created and destroyed - Mailing list pgsql-bugs
| From | PG Bug reporting form |
|---|---|
| Subject | BUG #19411: libpq 16.x exhibits a memory leak when connections are repeatedly created and destroyed |
| Date | |
| Msg-id | 19411-0440c8897a04b638@postgresql.org Whole thread Raw |
| Responses |
Re: BUG #19411: libpq 16.x exhibits a memory leak when connections are repeatedly created and destroyed
|
| List | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 19411
Logged by: Matt Carter
Email address: mcarter@twosigma.com
PostgreSQL version: 16.7
Operating system: Debian GNU/Linux 12 with kernel 6.12.51-ts1-amd64
Description:
This report is for the libpq client library for PostgreSQL.
I noticed the issue because my Apache Airflow task supervisor processes were
experiencing memory growth of ~600 KB/min. When running hundreds of tasks,
my 32 GB machine OOM-crashed every few hours. I traced the problem to when
I upgraded libpq from version 13 to 16. Varying Python version, and
PostgreSQL DB version didn't make a difference. When I updated to psycopg3,
the leak rate dropped dramatically.
Confirmed leaking:
libpq 16.0.4
libpq 16.0.7
Confirmed NOT leaking:
libpq 13.0.11
libpq 13.11
My environment for the below reproducer is:
Client libpq: 16.0.4 or 16.0.7 (runtime version)
Python: 3.10.16 or 3.11.11
psycopg2: 2.9.10
SQLAlchemy: 1.4.46
PostgreSQL Server: 13.11 OR 16.7 (both affected equally)
Platform: Linux, K8s containers
Reliable reproducer:
```
#!/usr/bin/env python3
"""
Reproduces ~800 KB/min memory leak in libpq 16.x
MUST monitor with 'ps aux', NOT Python's resource module!
"""
import datetime, os, subprocess
from time import sleep
from sqlalchemy import Column, Integer, types, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.pool import NullPool
# Verify libpq version
import psycopg2
print(f"psycopg2 compiled against libpq: {psycopg2.__libpq_version__}")
# Check ACTUAL runtime libpq (critical!)
so_path = os.path.dirname(psycopg2.__file__) +
"/_psycopg.cpython-311-x86_64-linux-gnu.so"
result = subprocess.run(["ldd", so_path], capture_output=True, text=True)
for line in result.stdout.split('\n'):
if 'libpq.so' in line:
print(f"ACTUAL runtime libpq:
{line.split('=>')[1].split('(')[0].strip()}")
print(f"\nMonitor this PID: {os.getpid()}")
print("Run in another terminal: watch -n 10 'ps -p <PID> -o
pid,rss,vsz,cmd'")
print()
# Setup SQLAlchemy with NullPool (creates/destroys connections)
Base = declarative_base(metadata=None)
class TestTable(Base):
__tablename__ = "test_leak_table"
id = Column(Integer, primary_key=True)
updated_at = Column(types.DateTime(timezone=True))
obj = TestTable()
engine = create_engine(
"postgresql://your-server/your-database",
poolclass=NullPool, # Critical: no pooling, create/destroy each time
encoding="utf-8"
)
Session = scoped_session(sessionmaker(
autocommit=False,
autoflush=False,
bind=engine,
expire_on_commit=False,
))
session = Session()
session.add(obj)
# Monitor memory using ps aux (CRITICAL - internal monitoring doesn't show
leak!)
def get_rss_kb():
"""Get RSS from ps aux - DO NOT use resource.getrusage()!"""
pid = os.getpid()
result = subprocess.run(["ps", "aux"], capture_output=True, text=True)
for line in result.stdout.split('\n'):
if str(pid) in line and 'python' in line:
return int(line.split()[5]) # RSS is 6th column
return None
print("Starting 10-minute test...")
print(f"{'Time(s)':<8} | {'Iteration':<10} | {'RSS(KB)':<10} |
{'Growth(KB)':<10}")
print("-" * 55)
start_time = datetime.datetime.now()
baseline_rss = None
for i in range(1, 601): # 600 iterations = 10 minutes
# Update and commit (creates/destroys connection with NullPool)
obj.updated_at = datetime.datetime.now(datetime.timezone.utc)
session.commit()
# Report every 30 iterations
if i % 30 == 0:
rss = get_rss_kb()
if rss:
if i == 30:
baseline_rss = rss
growth = 0
else:
growth = rss - baseline_rss
elapsed = (datetime.datetime.now() - start_time).total_seconds()
print(f"{int(elapsed):<8} | {i:<10} | {rss:<10} | {growth:<10}",
flush=True)
sleep(1)
# Final results
final_rss = get_rss_kb()
total_elapsed = (datetime.datetime.now() - start_time).total_seconds()
growth_total = final_rss - baseline_rss
leak_rate = growth_total / (total_elapsed / 60)
print()
print("=" * 60)
print(f"Baseline RSS (t=30s): {baseline_rss} KB")
print(f"Final RSS (t=600s): {final_rss} KB")
print(f"Total Growth: {growth_total} KB")
print(f"Leak Rate: {leak_rate:.1f} KB/min")
print("=" * 60)
```
The DB set-up for this reproducer is simple:
```
CREATE TABLE test_leak_table (
id SERIAL PRIMARY KEY,
updated_at TIMESTAMP WITH TIME ZONE
);
INSERT INTO test_leak_table (updated_at) VALUES (NOW());
```
When I run this with libpq 13.x, I get:
Baseline: ~41,000 KB
Final: ~41,200 KB
Growth: ~200 KB (stabilizes)
Leak Rate: ~14 KB/min
But when I run it with libpq 16.x, I reproduce the leak behavior:
Baseline: ~42,000 KB
Final: ~52,000 KB
Growth: ~10,000 KB (continuous linear growth)
Leak Rate: ~800 KB/min
Notes:
1. PostgreSQL server version does NOT matter
PG 13.11: ~797 KB/min average
PG 16.7: ~802 KB/min average
Difference: <1% (within measurement error)
2. Leak is linear and continuous
Grows ~530 KB every 30 seconds
No stabilization
No sawtooth pattern
Suggests unbounded memory accumulation
3. Leak is in C Library Memory
The leak is:
- Visible in `ps aux` RSS (operating system view)
- NOT visible in Python's resource.getrusage() (application view)
- NOT any Python object (I checked with the Python VM's `gc.get_objects()`)
This indicates the leak is in libpq's C-level allocations, not Python heap.
4. The leak requires connection churn. In particular, it needs:
- NullPool (no connection pooling)
- Repeated connection create/destroy
- UPDATE + COMMIT operations each iteration
Note that the only psycopg2 version I tested with was 2.9.10, as it is the
only version available in my test environment.
Also, I only tested with SQLAlchemy version 1.4.46, as v2 would require a
rewrite due to an incompatible API.
Workarounds:
Use libpq 13.x (43x improvement: 14-22 KB/min vs 800 KB/min).
Use connection pooling to reduce connection churn.
Use psycopg3.
Recycle processes frequently.
pgsql-bugs by date: