Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | ebc3f1ae-c09b-4125-9127-d0ef169680b2@vondra.me Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
On 7/18/25 18:46, Andres Freund wrote: > Hi, > > On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote: >> Here's a v2 of the patch series, with a couple changes: > > Not a deep look at the code, just a quick reply. > > >> * I changed the freelist partitioning scheme a little bit, based on the >> discussion in this thread. Instead of having a single "partition" per >> NUMA node, there's not a minimum number of partitions (set to 4). So > > I assume s/not/now/? > Yes. > >> * There's now a patch partitioning clocksweep, using the same scheme as >> the freelists. > > Nice! > > >> I came to the conclusion it doesn't make much sense to partition these >> things differently - I can't think of a reason why that would be >> advantageous, and it makes it easier to reason about. > > Agreed. > > >> The clocksweep partitioning is somewhat harder, because it affects >> BgBufferSync() and related code. With the partitioning we now have >> multiple "clock hands" for different ranges of buffers, and the clock >> sweep needs to consider that. I modified BgBufferSync to simply loop >> through the ClockSweep partitions, and do a small cleanup for each. > > That probably makes sense for now. It might need a bit of a larger adjustment at some point, but ... > I couldn't think of something fundamentally better and not too complex. I suspect we might want to use multiple bgwriters in the future, and this scheme seems to be reasonably well suited for that too. I'm also thinking about having some sort of "unified" partitioning scheme for all the places partitioning shared buffers. Right now each of the places does it on it's own, i.e. buff_init, freelist and clocksweep all have their code splitting NBuffers into partitions. And it should align. Because what would be the benefit if it didn't? But I guess having three variants of the same code seems a bit pointless. I think buff_init should build a common definition of buffer partitions, and the remaining parts should use that as the source of truth ... > >> * This new freelist/clocksweep partitioning scheme is however harder to >> disable. I now realize the GUC may quite do the trick, and there even is >> not a GUC for the clocksweep. I need to think about this, but I'm not >> even how feasible it'd be to have two separate GUCs (because of how >> these two pieces are intertwined). For now if you want to test without >> the partitioning, you need to skip the patch. > > I think it's totally fair to enable/disable them at the same time. They're so > closely related, that I don't think it really makes sense to measure them > separately. > Yeah, that's a fair point. > >> I did some quick perf testing on my old xeon machine (2 NUMA nodes), and >> the results are encouraging. For a read-only pgbench (2x shared buffers, >> within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not >> bad considering the patch is more about consistency than raw throughput. > > Personally I think an 1.18x improvement on a relatively small NUMA machine is > really rather awesome. > True, but I want to stress out it's just one quick (& simple test). Much more testing is needed before I can make reliable claims. > >> For a read-write pgbench I however saw some strange drops/increases of >> throughput. I suspect this might be due to some thinko in the clocksweep >> partitioning, but I'll need to take a closer look. > > Was that with pinning etc enabled or not? > IIRC it was with everything enabled, except for numa_procs_pin (which pins backend to NUMA node). I found that to actually harm performance in some of the tests (even just read-only ones), resulting in uneven usage of cores and lower throughput. > > >> From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001 >> From: Tomas Vondra <tomas@vondra.me> >> Date: Sun, 8 Jun 2025 18:53:12 +0200 >> Subject: [PATCH v2 5/7] NUMA: clockweep partitioning >> > > >> @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r >> /* >> * Nothing on the freelist, so run the "clock sweep" algorithm >> * >> - * XXX Should we also make this NUMA-aware, to only access buffers from >> - * the same NUMA node? That'd probably mean we need to make the clock >> - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a >> - * subset of buffers. But that also means each process could "sweep" only >> - * a fraction of buffers, even if the other buffers are better candidates >> - * for eviction. Would that also mean we'd have multiple bgwriters, one >> - * for each node, or would one bgwriter handle all of that? >> + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at >> + * buffers from a single partition, aligned with the NUMA node. That >> + * means it only accesses buffers from the same NUMA node. >> + * >> + * XXX That also means each process "sweeps" only a fraction of buffers, >> + * even if the other buffers are better candidates for eviction. Maybe >> + * there should be some logic to "steal" buffers from other freelists >> + * or other nodes? > > I think we *definitely* need "stealing" from other clock sweeps, whenever > there's a meaningful imbalance between the different sweeps. > > I don't think we need to be overly precise about it, a small imbalance won't > have that much of an effect. But clearly it doesn't make sense to say that one > backend can only fill buffers in the current partition, that'd lead to massive > performance issues in a lot of workloads. > Agreed. > The hardest thing probably is to make the logic for when to check foreign > clock sweeps cheap enough. > > One way would be to do it whenever a sweep wraps around, that'd probably > amortize the cost sufficiently, and I don't think it'd be too imprecise, as > we'd have processed that set of buffers in a row without partitioning as > well. But it'd probably be too coarse when determining for how long to use a > foreign sweep instance. But we probably could address that by rechecking the > balanace more frequently when using a foreign partition. > What you mean by "it"? What would happen after a sweep wraps around? > Another way would be to have bgwriter manage this. Whenever it detects that > one ring is too far ahead, it could set a "avoid this partition" bit, which > would trigger backends that natively use that partition to switch to foreign > partitions that don't currently have that bit set. I suspect there's a > problem with that approach though, I worry that the amount of time that > bgwriter spends in BgBufferSync() may sometimes be too long, leading to too > much imbalance. > I'm afraid having hard "avoid" flags would lead to sudden and unexpected changes in performance as we enable/disable partitions. I think a good solution should "smooth it out" somehow, e.g. by not having a true/false flag, but having some sort of "preference" factor with values between (0.0, 1.0) which says how much we should use that partition. I was imagining something like this: Say we know the number of buffers allocated for each partition (in the last round), and we (or rather the BgBufferSync) calculate: coefficient = 1.0 - (nallocated_partition / nallocated) and then use that to "correct" which partition to allocate buffers from. Or maybe just watch how far from the "fair share" we were in the last interval, and gradually increase/decrease the "partition preference" which would say how often we need to "steal" from other partitions. E.g. we find nallocated_partition is 2x the fair share, i.e. nallocated_partition / (nallocated / nparts) = 2.0 Then we say 25% of the time look at some other partition, to "cut" the imbalance in half. And then repeat that in the next cycle, etc. So a process would look at it's "home partition" by default, but it's "roll a dice" first and if above the calculated probability it'd pick some other partition instead (this would need to be done so that it gets balanced overall). If the bgwriter interval is too long, maybe the recalculation could be triggered regularly after any of the clocksweeps wraps around, or after some number of allocations, or something like that. regards -- Tomas Vondra
pgsql-hackers by date: