Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | a6209a67-6e12-4074-a466-37e3f0ead61f@vondra.me Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On 8/12/25 16:24, Andres Freund wrote: > Hi, > > On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote: >> Right. I don't think the current patch would crash - I can't test it, >> but I don't see why it would crash. In the worst case it'd end up with >> partitions that are not ideal. The question is more what would an ideal >> partitioning for buffers and PGPROC look like. Any opinions? >> >> For PGPROC, it's simple - it doesn't make sense to allocate partitions >> for nodes without CPUs. >> >> For buffers, it probably does not really matter if a node does not have >> any CPUs. If a node does not have any CPUs, that does not mean we should >> not put any buffers on it. After all, CXL will never have any CPUs (at >> least I think that's the case), and not using it for shared buffers >> would be a bit strange. Although, it could still be used for page cache. > > For CXL memory to be really usable, I think we'd need nontrivial additional > work. CXL memory has considerably higher latency and lower throughput. You'd > *never* want things like BufferDescs or such on such nodes. And even the > buffered data itself, you'd want to make sure that frequently used data, > e.g. inner index pages, never end up on it. > OK, let's keep that out of scope for these patches and assume we're dealing only with local memory. CXL could still be used by the OS for page cache, of whatever. What does that mean for the patch, though. Does it need a way to configure which nodes to use? I argued to leave this to the OS/numactl, and we'd just use whatever is made available to Postgres. But maybe we'll need something within Postgres after all? FWIW there's work needed a actually inherit NUMA info from the OS. Right now the patches just use all NUMA nodes, indexed by 0 ... (N-1) etc. I like the "registry" concept I used for buffer/PGPROC partitions, it made the patches much simpler. Maybe we should use something like that for NUMA info too. That is, at startup build a record of the NUMA layout, and use this as source of truth everywhere (instead of using libnuma from all those places). > Which leads to: > >> Maybe it should be "tiered" a bit more? > > Yes, for proper CXL support, we'd need a component that explicitly demotes and > promotes pages from "real" memory to CXL memory and the other way round. The > demotion is relatively easy, you'd probably just do it whenever you'd > otherwise throw out a victim buffer. When to promote back is harder... > Sounds very much like page cache (but that only works for buffered I/O). > >> The patch differentiates only between partitions on "my" NUMA node vs. every >> other partition. Maybe it should have more layers? > > Given the relative unavailability of CXL memory systems, I think just not > crashing is good enough for now... > The lowest of bars ;-) > >>>> I'm not sure what to do about this (or how getcpu() or libnuma handle this). >>> >>> I don't immediately see any libnuma functions that would care? >>> >> >> Not sure what "care" means here. I don't think it's necessarily broken, >> it's more about the APIs not making the situation very clear (or >> convenient). > > What I mean is that I was looking through the libnuma functions and didn't see > any that would be affected by having multiple "local" NUMA nodes. But: > My question is a bit of a "reverse" to this. That is, how do we even find (with libnuma) there are multiple local nodes? > >> How do you determine nodes for a CPU, for example? The closest thing I >> see is numa_node_of_cpu(), but that only returns a single node. Or how >> would you determine the number of nodes with CPUs (so that we create >> PGPROC partitions only for those)? I suppose that requires literally >> walking all the nodes. > > I didn't think of numa_node_of_cpu(). > Yeah. I think most of the libnuma API is designed for each CPU belonging to single NUMA node. I suppose we'd need to use numa_node_to_cpus() to build this kind of information ourselves. > As long as numa_node_of_cpu() returns *something* I think it may be good > enough. Nobody uses an RPi for high-throughput postgres workloads with a lot > of memory. Slightly sub-optimal mappings should really not matter. > I'm not really concerned about rpi, or the performance on it. I only use it as an example of system with "weird" NUMA layout. > I'm kinda wondering if we should deal with such fake numa systems by detecting > them and disabling our numa support. > That'd be an option too, if we can identify such systems. We could do that while building the "NUMA registry" I mentioned earlier. regards -- Tomas Vondra
pgsql-hackers by date: