[PATCH] Batched clock sweep to reduce cross-socket atomic contention - Mailing list pgsql-hackers
| From | Greg Burd |
|---|---|
| Subject | [PATCH] Batched clock sweep to reduce cross-socket atomic contention |
| Date | |
| Msg-id | 79629577-3ad8-4b1c-a469-ebc2cb4c5104@app.fastmail.com Whole thread |
| List | pgsql-hackers |
Hello hackers,
A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior on some of the newer AWS bare-metal instance
types(r8i in particular, which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the process landed on a very
smallchange to freelist.c that I think is worth showing around. His patch is attached with some tweaks of my own.
Full disclosure: the exploration that led Jim to this patch idea was done with help from an AI assistant (Kiro); the
idea,the benchmarking, and the final shape of the patch are human-driven, but I wanted to be up front about how his
investigationstarted. Happy to discuss that separately if people want to.
The one-line summary: instead of advancing nextVictimBuffer one buffer at a time via pg_atomic_fetch_add_u32, each
backendclaims a batch of 64 consecutive buffer IDs from the shared hand and then iterates them privately. Global sweep
orderis preserved -- every buffer is still
visited exactly once per complete pass -- but the atomic contention on that one cache line drops by roughly the batch
size.
Why this matters
----------------
On multi-socket boxes under eviction pressure, every backend that needs a victim buffer ends up CAS'ing the same cache
line. On a single socket, a locked RMW on that cache line stays warm in L1/L2 and completes in ~20ns. On 2+ sockets,
theline bounces over QPI/UPI at ~100-200ns per op, and with hundreds of backends running StrategyGetBuffer()
concurrently,the line ping-pongs constantly. It's a textbook NUMA scalability bottleneck, and once shared_buffers is
smallerthan the working set and the sweep is running continuously, that single atomic is what you hit in a perf profile
(elevatedbus-cycles, cache-misses on the cache line holding nextVictimBuffer).
Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas called it out in the "Adding basic NUMA
awareness"thread [1] -- so this isn't news to anyone who's been looking at this area. What I think is new is a fix
that'sjust this, without any of the surrounding architectural change.
The framing (credit to Jim): the clock hand is doing two jobs. It *coordinates* backends so they don't redundantly
decrementusage_count on the same buffers and so they eventually visit every buffer in the pool exactly once per pass.
Italso *serializes* access to the counter. Coordination is the part we want. Serialization is the part that's killing
uson bigger NUMA boxes. Batching keeps the coordination and thins out the serialization.
How it works
------------
Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend calls ClockSweepTick() and its local batch is
exhausted,it does a single fetch-add of CLOCK_SWEEP_BATCH_SIZE (64) against nextVictimBuffer and now owns that range.
Subsequentticks just bump the local counter.
Wraparound got a small rewrite. The original code had the backend that crossed NBuffers drive completePasses++ under
thespinlock via a CAS loop. With batching, multiple backends can each land a fetch-add that returns a value >=
NBuffersin the same pass, so the logic now is: whoever sees a start >= NBuffers takes the spinlock, re-reads the
counter,and if it's still out of range does a single CAS to wrap it and bumps completePasses. If somebody else already
wrapped,we just release and move on. StrategySyncStart() still sees a consistent (nextVictimBuffer, completePasses)
pair.
The batch size is gated on whether we actually have multiple NUMA nodes. On a single-socket box the atomic is already
socket-local,batching just makes backends skip further ahead than they need to, so we fall back to batch size 1 --
whichis bit-for-bit the original behavior. The guard:
if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
else
ClockSweepBatchSize = 1;
Min() against NBuffers covers the small-shared_buffers corner so a batch never wraps the pool multiple times in one
claim.
Does batching mess up the meaning of usage_count?
--------------------------------------------------
Short answer: no. I want to walk through this because it was my first concern too, and I think it's the question that
willcome up most on review.
The clock sweep's usage_count is an access-frequency approximation measured in units of *complete passes*. A buffer
withusage_count = N survives N passes without a re-pin. The semantic meaning lives at pass granularity, not at
individual-buffergranularity.
What batching changes: intra-pass temporal ordering. Without batching, with N backends sweeping, decrements are
interleaved-- backend A hits B[0], backend B hits B[1], backend C hits B[2]. With batching, backend A hits B[0..63] in
atight local burst, then backend B hits B[64..127], etc. The 64-buffer chunks are decremented in bursts rather than
individually.
Why it doesn't matter:
1. Every buffer still gets decremented exactly once per complete
pass. The invariant the algorithm actually depends on is
untouched.
2. A buffer's survival window is the time between consecutive
passes. That's milliseconds to seconds under load. Whether
B[0] gets decremented 50us before or 50us after B[63] within
the same pass is below the resolution of anything usage_count
is trying to measure.
3. The bgwriter's feedback loop reads (nextVictimBuffer,
completePasses, numBufferAllocs) via StrategySyncStart() every
~200ms. nextVictimBuffer still advances at the same *total*
rate (64 per atomic op, but atomic ops happen 1/64 as often).
The position it reports can jitter by up to 64 buffers relative
to the one-at-a-time case, but BgBufferSync()'s smoothed
estimates operate over thousands of buffers per cycle, so the
jitter disappears into the averaging. numBufferAllocs still
increments once per allocation. strategy_delta,
smoothed_alloc, smoothed_density, reusable_buffers_est -- all
unaffected in any way I can see.
Table form, because it's easier to argue with:
Property | Unpatched | Batched
----------------------------------+----------------+----------------
Buffers visited per pass | NBuffers | NBuffers
Decrements per buffer per pass | 1 | 1
Eviction threshold | usage_count==0 | usage_count==0
Max survival (passes) | 6 | 6
Decrement ordering within a pass | interleaved | chunked
bgwriter allocation rate signal | accurate | accurate
Cross-socket atomic traffic | 1 per buffer | 1 per 64
There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with
MyBatchEndstill sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the
globalhand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at
once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the
implicationis that some buffers can sit with slightly stale usage_count for longer than they would have before. But
"lightload" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.
There's also a small positive side-effect: cache locality. The backend that just touched BufferDescriptor[B[0]] has
theadjacent descriptors warm in L1/L2. Walking B[0..63] locally is cheaper than walking a striped interleaving where
eachdescriptor was last touched by a different core. I haven't tried to isolate this in perf, but it falls out
naturally.
Benchmarks
----------
Jim ran these; I'm still working on reproducing them locally and will post independent numbers in a follow-up. All
baremetal, Linux, huge pages enabled throughout (more on that below), postmaster pinned to node 0 with `numactl
--cpunodebind=0`because otherwise stock TPS varied from 31K to 40K depending on which node the postmaster happened to
landon at launch -- worth flagging for anyone trying to reproduce.
Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the working set always spills and the sweep is
hot.
r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):
pgbench RO:
Clients Stock Patched Delta
64 31,457 36,353 +16%
128 31,678 37,864 +20%
256 31,510 37,558 +19%
384 31,431 37,464 +19%
512 31,329 37,040 +18%
pgbench RW:
Clients Stock Patched Delta
64 7,685 7,713 0%
128 10,420 10,541 +1%
256 12,393 12,463 +1%
384 15,317 15,197 -1%
512 17,930 17,978 0%
m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
RO +19-20%, RW within noise.
c8i.metal-48xl (192 vCPUs, 1 socket):
Single-socket -> batch_size=1 -> original code path. No
behavioral change. (I double-checked this one specifically
because it's the sanity test for the gate.)
HammerDB TPC-C on m6i.metal (1000 warehouses):
VUs Stock Patched Delta
128 358,518 349,787 -2%
256 332,098 330,272 -1%
384 365,782 377,519 +3%
512 370,663 386,526 +4%
No TPC-C regression, which was the thing we were most worried about. An earlier attempt (per-socket partitioned sweep,
seebelow) was -13% on this same workload.
The general shape is: the scaling curve flattens later. Unpatched, TPS tops out around 128 clients and stays flat up
to512 because backends are spending cycles waiting on the cache line rather than
doing work. Patched, the curve keeps rising past the point where unpatched plateaus.
Huge pages caveat: all of the above was run with huge pages on, on large-memory instances (the r8i.96xl has 3TB, so Jim
neverconsidered running without them). We have not characterized the non-huge-pages case. That's on my list; I don't
expectit to change the conclusion, but I shouldn't speak for data I haven't collected.
Relationship to Tomas's NUMA series
-----------------------------------
Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned
freelists,partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't
thinkwe should frame this patch as competing with that work.
One thing I found striking as I re-read the thread: in the benchmarks Tomas posted later in the series, *most of the
benefitcomes from partitioning the clock sweep*, and the NUMA memory-placement layer on top sometimes runs slower than
partitioningalone. His own conclusion, quoted roughly: the benefit mostly comes from just partitioning the clock
sweep,and it's largely independent of the NUMA stuff; the NUMA partitioning is often slower.
That observation is the thing that makes me think batching is worth considering on its own. It's going after the same
bottleneckTomas's partitioning addresses, but:
- without splitting global eviction visibility (which is where
cross-partition stealing gets complicated),
- without requiring NUMA-aware buffer placement (which has huge
page alignment, descriptor-partition-mid-page, and resize
complications that are still being worked out in that thread),
- without touching PGPROC or bgwriter.
What this patch does *not* do:
- place buffers on specific NUMA nodes
- partition the freelist
- touch PGPROC
- add new GUCs
- change bgwriter
What this patch *does* do:
- target exactly the clock-sweep contention that Tomas's
partitioning targets, and reduce it by ~64x, in ~30 lines.
If Tomas's series lands in full, this patch becomes redundant for its primary use case (though even within a
partitionedsweep, the per-partition atomic still benefits from batching, so it's arguably a useful primitive either
way). If Tomas's series lands incrementally over several cycles -- which the open items in that thread suggest is the
realisticpath -- this gets us a real chunk of the multi-socket win now.
This patch is also orthogonal to my earlier thread about removing the freelist entirely [2], but given the proximity to
thatcode Jim agreed that I could propose/steward it here on the list for consideration.
Open questions / things I'd like feedback on
--------------------------------------------
- Batch size. 64 is a round number that worked well in testing, but
Nathan raised the reasonable point that on small shared_buffers
with high concurrency, a fixed 64 could be unfortunate. Options:
scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
with max_connections, keep it fixed but let operators tune it, or
make it a function of NUMA node count. I don't have a strong
opinion yet; the Min(batch, NBuffers) cap covers the "obviously
wrong" corner but doesn't speak to the "several hundred backends
on a few-MB shared_buffers" shape. Numbers/ideas/proposals welcome.
- NUMA detection. The gate uses pg_numa_init() /
pg_numa_get_max_node(). On systems where libnuma isn't available,
or where get_mempolicy is blocked (some container configurations),
we fall back to batch size 1. That's safe but it misses the
"single socket, many cores, still benefits from fewer atomics"
case. Might be worth a way to force-enable, or batching on all
systems with a smaller batch size when single-socket. I'd like to
measure before deciding.
- Eviction pattern on reads. Nathan also flagged that with batching,
the buffers a backend ends up pinning in one StrategyGetBuffer()
call will tend to be contiguous in buffer-id space rather than
scattered, which is a different allocation pattern than today.
The usage_count analysis above says this is benign, but if anyone
has an intuition for a workload where this would be observable
(e.g., something that cares about the mapping between buffer-id
and relation locality), I'd like to hear it.
- nextVictimBuffer wraparound. The current code has a mild overflow
concern papered over with "highly unlikely and wouldn't be
particularly harmful". With batching this is no worse than before,
but if we're already touching this function, it might be worth
thinking about whether to tighten it up in the same patch or a
follow-up.
- Should the non-NUMA value for this be derived from core counts that
imply L1/L2 cache layouts or simply default to 8 rather than 1 to
realize some benefit?
- Should there be a postgresql.conf setting for this that takes
precedence?
I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at the small-shared_buffers corner, and post perf
statoutput showing the atomic/cache-miss deltas over the next few days. In the meantime, eyeballs and skepticism
welcome-- I would especially welcome comments from Andres, who's been in this code recently, and from Tomas, whose
serieshas the most overlap.
I realize that we're past feature freeze and working on release notes for v19, so the chances of merging this are slim
tonone. I think this could be considered a "performance bug fix for NUMA systems" in this release, but that is
stretchingit a bit. It is a big ask at this stage to land a change like this.
best.
-greg
[1] https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
[2] https://www.postgresql.org/message-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me
Attachment
pgsql-hackers by date: