Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id a6209a67-6e12-4074-a466-37e3f0ead61f@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On 8/12/25 16:24, Andres Freund wrote:
> Hi,
> 
> On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:
>> Right. I don't think the current patch would crash - I can't test it,
>> but I don't see why it would crash. In the worst case it'd end up with
>> partitions that are not ideal. The question is more what would an ideal
>> partitioning for buffers and PGPROC look like. Any opinions?
>>
>> For PGPROC, it's simple - it doesn't make sense to allocate partitions
>> for nodes without CPUs.
>>
>> For buffers, it probably does not really matter if a node does not have
>> any CPUs. If a node does not have any CPUs, that does not mean we should
>> not put any buffers on it. After all, CXL will never have any CPUs (at
>> least I think that's the case), and not using it for shared buffers
>> would be a bit strange. Although, it could still be used for page cache.
> 
> For CXL memory to be really usable, I think we'd need nontrivial additional
> work. CXL memory has considerably higher latency and lower throughput. You'd
> *never* want things like BufferDescs or such on such nodes. And even the
> buffered data itself, you'd want to make sure that frequently used data,
> e.g. inner index pages, never end up on it.
> 

OK, let's keep that out of scope for these patches and assume we're
dealing only with local memory. CXL could still be used by the OS for
page cache, of whatever.

What does that mean for the patch, though. Does it need a way to
configure which nodes to use? I argued to leave this to the OS/numactl,
and we'd just use whatever is made available to Postgres. But maybe
we'll need something within Postgres after all?

FWIW there's work needed a actually inherit NUMA info from the OS. Right
now the patches just use all NUMA nodes, indexed by 0 ... (N-1) etc. I
like the "registry" concept I used for buffer/PGPROC partitions, it made
the patches much simpler. Maybe we should use something like that for
NUMA info too. That is, at startup build a record of the NUMA layout,
and use this as source of truth everywhere (instead of using libnuma
from all those places).

> Which leads to:
> 
>> Maybe it should be "tiered" a bit more?
> 
> Yes, for proper CXL support, we'd need a component that explicitly demotes and
> promotes pages from "real" memory to CXL memory and the other way round. The
> demotion is relatively easy, you'd probably just do it whenever you'd
> otherwise throw out a victim buffer. When to promote back is harder...
> 

Sounds very much like page cache (but that only works for buffered I/O).

> 
>> The patch differentiates only between partitions on "my" NUMA node vs. every
>> other partition. Maybe it should have more layers?
> 
> Given the relative unavailability of CXL memory systems, I think just not
> crashing is good enough for now...
> 

The lowest of bars ;-)

> 
>>>> I'm not sure what to do about this (or how getcpu() or libnuma handle this).
>>>
>>> I don't immediately see any libnuma functions that would care?
>>>
>>
>> Not sure what "care" means here. I don't think it's necessarily broken,
>> it's more about the APIs not making the situation very clear (or
>> convenient).
> 
> What I mean is that I was looking through the libnuma functions and didn't see
> any that would be affected by having multiple "local" NUMA nodes. But:
> 

My question is a bit of a "reverse" to this. That is, how do we even
find (with libnuma) there are multiple local nodes?

> 
>> How do you determine nodes for a CPU, for example? The closest thing I
>> see is numa_node_of_cpu(), but that only returns a single node. Or how
>> would you determine the number of nodes with CPUs (so that we create
>> PGPROC partitions only for those)? I suppose that requires literally
>> walking all the nodes.
> 
> I didn't think of numa_node_of_cpu().
> 

Yeah. I think most of the libnuma API is designed for each CPU belonging
to single NUMA node. I suppose we'd need to use numa_node_to_cpus() to
build this kind of information ourselves.

> As long as numa_node_of_cpu() returns *something* I think it may be good
> enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
> of memory. Slightly sub-optimal mappings should really not matter.
> 

I'm not really concerned about rpi, or the performance on it. I only use
it as an example of system with "weird" NUMA layout.

> I'm kinda wondering if we should deal with such fake numa systems by detecting
> them and disabling our numa support.
> 

That'd be an option too, if we can identify such systems. We could do
that while building the "NUMA registry" I mentioned earlier.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Benoit Tigeot
Date:
Subject: pg_stat_statements: Add `calls_aborted` counter for tracking query cancellations
Next
From: Peter Eisentraut
Date:
Subject: Re: Update LDAP Protocol in fe-connect.c to v3