Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention |
| Date | |
| Msg-id | nwri4oxfusruo7yog2rewbeiegs4rhclkjukn54mcumiilak4b@6chlhxypm5sq Whole thread |
| In response to | [PATCH] Batched clock sweep to reduce cross-socket atomic contention ("Greg Burd" <greg@burd.me>) |
| Responses |
Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention
|
| List | pgsql-hackers |
Hi, Thanks for looking into this. On 2026-04-25 16:08:02 -0400, Greg Burd wrote: > Does batching mess up the meaning of usage_count? > -------------------------------------------------- > > Short answer: no. I want to walk through this because it was my first > concern too, and I think it's the question that will come up most on review. > > The clock sweep's usage_count is an access-frequency approximation measured > in units of *complete passes*. A buffer with usage_count = N survives N > passes without a re-pin. The semantic meaning lives at pass granularity, > not at individual-buffer granularity. > > What batching changes: intra-pass temporal ordering. Without batching, with > N backends sweeping, decrements are interleaved -- backend A hits B[0], > backend B hits B[1], backend C hits B[2]. With batching, backend A hits > B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The > 64-buffer chunks are decremented in bursts rather than individually. > > Why it doesn't matter: > > 1. Every buffer still gets decremented exactly once per complete > pass. The invariant the algorithm actually depends on is > untouched. > > 2. A buffer's survival window is the time between consecutive > passes. That's milliseconds to seconds under load. Whether > B[0] gets decremented 50us before or 50us after B[63] within > the same pass is below the resolution of anything usage_count > is trying to measure. I don't think this is true, unfortunately. Sure, if you have a completely uniform, IO intensive, workload it is, but that's not all there is. If you have a bunch of connections that replace buffers at a low rate and a bunch of connections that do so at a high rate, the batches "checked out" by the "low rate" connections won't be processed soon. Thus the buffers in that batch won't have their usagecount decremented and thus have a stronger "protection" against replacement. You can argue that may be OK, because it'd be unlikely that the next sweep would assign the same buffers to an "low rate" backend again. But that'd be an argument you'd have to make and validate. I'm somewhat doubtful that batching that's independent of contention and independent of the usage rate will work out all that well. If you instead went for a partitioned sweep architecture, with balancing between the different partitions, you don't have that issue. And you have a building block for more numa awareness etc. > There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEndstill sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the globalhand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implicationis that some buffers can sit with slightly stale usage_count for longer than they would have before. But "lightload" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect > doesn't show up where it would hurt. As mentioned above, this assumes that the replacement rate is uniform between backends, which I think is not uniformly true outside of benchmarks. > There's also a small positive side-effect: cache locality. The backend that > just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in > L1/L2. A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent cacheline prefetching you'll have *maybe* 2 consecutive BufferDescs in L1/L2. Where it might help more is the TLB. > Benchmarks > ---------- > > Jim ran these; I'm still working on reproducing them locally and will post > independent numbers in a follow-up. All bare metal, Linux, huge pages > enabled throughout (more on that below), postmaster pinned to node 0 with > `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K > depending on which node the postmaster happened to land on at launch -- > worth flagging for anyone trying to reproduce. That's an odd one that I think you need to investigate separately. > Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the > working set always spills and the sweep is hot. Uhm, is this something worth optimizing substantially for? What you're measuring here is basically the worst possible way of configuring a database, with full double buffering and a lot of memory bandwidth dedicated to copying buffers from one place to another. That's maybe a sane setup if you have a lot of small databases that you can't configure individually, but that's not the case when you run a reasonably large workload on a 384vCPU setup. I think to be really convincing you'd have to do this with actual IO involved somewhere. > Relationship to Tomas's NUMA series > ----------------------------------- > > Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists,partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't thinkwe should frame this patch as competing with that work. > > One thing I found striking as I re-read the thread: in the benchmarks Tomas > posted later in the series, *most of the benefit comes from partitioning the > clock sweep*, and the NUMA memory-placement layer on top sometimes runs > slower than partitioning alone. His own conclusion, quoted roughly: the > benefit mostly comes from just partitioning the clock sweep, and it's > largely independent of the NUMA stuff; the NUMA partitioning is often > slower. That was partially because he measured on something that didn't really have significant NUMA effects though... > That observation is the thing that makes me think batching is worth > considering on its own. It's going after the same bottleneck Tomas's > partitioning addresses, but: > > - without splitting global eviction visibility (which is where > cross-partition stealing gets complicated), You *are* doing that tho. > - without requiring NUMA-aware buffer placement (which has huge > page alignment, descriptor-partition-mid-page, and resize > complications that are still being worked out in that thread), You can do the partitioned clock sweep without *any* of that. Greetings, Andres Freund
pgsql-hackers by date: