Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id mntwceou3ouc4usvktwutlbt6p3bqrzy73dw5nockzodhkud4g@7bchfsl3qpth
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
> The patch does a much simpler thing - treat the weight as a "budget",
> i.e. number of buffers to allocate before proceeding to the "next"
> partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
> and then goes back to P1 in a round-robin way. The advantage is it can
> do away without a PRNG.

I think that's a good plan.


A few comments about the clock sweep patch:

- It'd be easier to review if BgBufferSync() weren't basically re-indented
  wholesale. Maybe you could instead move the relevant code to a helper
  function that's called by BgBufferSync() for each clock?

- I think choosing a clock sweep partition in every tick would likely show up
  in workloads that do a lot of buffer replacement, particularly if buffers
  in the workload often have a high usagecount (and thus more ticks are used).
  Given that your balancing approach "sticks" with a partition for a while,
  could we perhaps only choose the partition after exhausting that budget?

- I don't really understand what

> +    /*
> +     * Buffers that should have been allocated in this partition (but might
> +     * have been redirected to keep allocations balanced).
> +     */
> +    pg_atomic_uint32 numRequestedAllocs;
> +

  is intended for.

  Adding yet another atomic increment for every clock sweep tick seems rather
  expensive...


- I wonder if the balancing budgets being relatively low will be good
  enough. It's not too hard to imagine that this frequent "partition choosing"
  will be bad in buffer access heavy workloads. But it's probably the right
  approach until we've measured it being a problem.


- It'd be interesting to do some very simple evaluation like a single
  pg_prewarm() of a relation that's close to the size of shared buffers and
  verify that we don't end up evicting newly read in buffers.  I think your
  approach should work, but verifying that...

  I wonder if we could make some of this into tests somehow. It's pretty easy
  to break this kind of thing and not notice, as everything just continues to
  work, just a tad slower.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nikita Malakhov
Date:
Subject: Detoast iterators - take 2
Next
From: Jingtang Zhang
Date:
Subject: Re: Possible inaccurate description of wal_compression in docs