Add logical_decoding_spill_limit to cap spill file disk usage per slot - Mailing list pgsql-hackers
| From | shawn wang |
|---|---|
| Subject | Add logical_decoding_spill_limit to cap spill file disk usage per slot |
| Date | |
| Msg-id | CA+T=_GU-vTxFqRwWJMR4Hz8YUXkpUv_q6Nm3CrTqHNbhCrS5BA@mail.gmail.com Whole thread |
| Responses |
Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot RE: Add logical_decoding_spill_limit to cap spill file disk usage per slot |
| List | pgsql-hackers |
Hi hackers,
== Motivation ==
We operate a fleet of PostgreSQL instances with logical replication. On several occasions, we have experienced production incidents where logical decoding spill files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably — consuming tens of gigabytes and eventually filling up the data disk. This caused the entire instance to go read-only, impacting not just replication but all write workloads.
The typical scenario is a large transaction (e.g. bulk data load or a long-running DDL) combined with a subscriber that is either slow or temporarily disconnected. The reorder buffer exceeds logical_decoding_work_mem and starts spilling, but there is no upper bound on how much can be spilled. The only backstop today is the OS returning ENOSPC, at which point the damage is already done.
We looked for existing protections:
- max_slot_wal_keep_size: limits WAL retention, but does not affect spill files at all.
- logical_decoding_work_mem: controls *when* spilling starts, but not *how much* can be spilled.
- There is no existing GUC, patch, or commitfest entry that addresses spill file disk quota.
The "Report reorder buffer size" patch (CF #6053, by Ashutosh Bapat) improves observability of reorder buffer state, which is complementary — but observability alone cannot prevent disk-full incidents.
== Proposed solution ==
The attached patch adds a new GUC:
logical_decoding_spill_limit (integer, unit kB, default 0)
When set to a positive value, it limits the total size of on-disk spill files per replication slot. Key design points:
- Tracking: We add two new fields: - ReorderBuffer.spillBytesOnDisk — current total on-disk spill size for this slot (unlike spillBytes which is a cumulative statistic counter, this is a live gauge). - ReorderBufferTXN.serialized_size — per-transaction on-disk size, so we can accurately decrement the global counter during cleanup.
- Increment: In ReorderBufferSerializeChange(), after a successful write(), both counters are incremented by the size written.
- Decrement: In ReorderBufferRestoreCleanup(), when spill files are unlinked, the global counter is decremented by the transaction's serialized_size.
- Enforcement: In ReorderBufferCheckMemoryLimit(), before calling ReorderBufferSerializeTXN(), we check: if (spillBytesOnDisk + txn->size > spill_limit) ereport(ERROR, ...) This is only checked on the spill-to-disk path — not on the streaming path (which involves no disk I/O).
- Behavior on limit exceeded: An ERROR is raised with ERRCODE_CONFIGURATION_LIMIT_EXCEEDED. The walsender exits, but the slot's restart_lsn and confirmed_flush are preserved. The subscriber can reconnect after the DBA:
- increases logical_decoding_spill_limit, or
- increases logical_decoding_work_mem (to reduce spilling), or
- switches to a streaming-capable output plugin (which avoids spilling entirely).
- Default 0 means unlimited — fully backward compatible.
== Why per-slot, not global? ==
Each ReorderBuffer instance lives in a single walsender process and corresponds to exactly one replication slot. A per-slot limit is:
- Lock-free (no shared memory coordination needed)
- Simple to reason about (each slot has its own budget)
- Sufficient to protect against disk-full (the DBA sets the limit based on available disk / number of slots)
A global (cross-slot) limit could be layered on top later if needed, but would require shared-memory counters with spinlock/atomic protection.
== Performance impact ==
- Hot path (in-memory change queuing): zero overhead.
- Spill path: one integer comparison before serialization, one integer addition after write() — negligible compared to the I/O cost.
- Cleanup path: one integer subtraction after unlink() — negligible.
Looking forward to feedback.
Thanks,
Shawn.
pgsql-hackers by date: