Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | 9c31717c-fe38-4824-9bb7-e4f0ac4f1991@vondra.me Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Alexey Makhmutov <a.makhmutov@postgrespro.ru>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
On 10/13/25 01:58, Alexey Makhmutov wrote: > Hi Tomas, > > Thank you very much for working on this problem and the entire line of > patches prepared! I've tried to play with these patches a little and > here are some my observations and suggestions. > > In the current implementation we try to use all available NUMA nodes on > the machine, however it's often useful to limit the database only to a > set of specific nodes, so that other nodes can be used for other > processes. In my testing I was trying to use one node out of four for > the client program, so I'd liked to limit the database to the remaining > nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to > start the cluster, so the obvious choice for me was to use the > 'numa_get_membind' function instead of 'numa_num_configured_nodes' to > get the list of usable nodes. However, it is much easier to work with > logical nodes in the [0; n] range inside the PG code, so I've decided to > add mapping between 'logical nodes' (0-n in PG) to a set of physical > nodes actually returned by 'numa_get_membind'. We may need to map number > in both directions, so two translation tables are allocated and filled > at the first usage of 'pg_numa' functions. It also seems to be a good > idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all > 'numa_...' calls in it and this also allows us to hide this mapping in > static functions. Here is the patch, which I've used to test this idea: > https://github.com/Lerm/postgres/ > commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161. This idea probably > could be extended by adding some view to expose this mapping to the user > (at least for testing purposes) and allow to explicitly override this > mapping with a GUC setting. With such GUC setting we would be able to > control PG memory usage on NUMA nodes without the need for systemd > resource control or numactl parameters. > I've argued to keep this out of scope for v1, to keep it smaller and simpler. I'm not against adding that feature, though. If someone writes a patch to support this. I suppose the commit you linked is a step in that direction. I agree we should isolate libnuma calls to pg_numa.{c,h}. I wasn't quite consistent when doing that. > Next, I've noticed some problems related to the size alignment for > 'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The > documentation for the 'numa_tonode_memory' says that 'The size > argument will be rounded up to a multiple of the system page size'. > However this does not work well with huge pages as alignment is > performed for the default kernel page size (i.e. 4K in most cases). If > addr + size value (rounded to the default page size) does not cover the > entire huge page, then such invocation seems to be processed incorrectly > and allocation policy is not applied for next pages access in such > segment. At least this was the behavior I've observed on Debian 12 / > 6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment > contains pages from wrong nodes). > I'm not sure I understand. Are you suggesting there's a bug in the patch, the kernel, or somewhere else? There's definitely a possibility of confusion with huge pages, no doubt about that. The default "system page size" is 4KB, but we need to process whole huge pages. But this is exactly why (with hugepages) the code aligns everything to huge page boundary, and sizes everything as a multiple of huge page. At least I think so. Maybe I remember wrong? > There are two location at which we could face such situation in current > patches. First is related to buffers partitions mapping. With current > code we basically ensure that combined size of all partitions for a > single node is aligned to (huge) page size (as size is bound to the > number of descriptors on one page), but individual partition is not > explicitly aligned to this size. So, we could get the situation in which > single page is split between adjacent partitions (e.g. 32GB buffers > split by 3 nodes). With current code we will try to map each partition > independently, which will results in unaligned calls to > 'numa_tonode_memory', so resulting mapping will be incorrect. We could > either try to choose size for individual partition to align it to the > desired page size or map all the partitions for a single node using a > single 'pg_numa_move_to_node' invocation. During testing I've used the > second approach, so here is the change to implement such logic: https:// > github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be. > Can you actually demonstrate this? The code does these two things: * calculate min_node_buffers so that buffers/descriptors are a multiple of page size (either 4K or huge page) * align buffers and descriptors to memory page TYPEALIGN(buffer_align, ...) I believe this is sufficient to ensure nothing gets split / mapped incorrectly. Maybe this fails sometimes? > The second location which could expose the same problem is related to > the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to > align pointer to the end PGPROC partition. There seems to be also two > additional problems with PGPROC partitioning: we need to account > additional padding pages in 'PGProcShmemSize' (using the same logic as > with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to > partitions mapping call (otherwise it will be mapped to current node). > Here is a potential change, which tries to address these problems: > https://github.com/Lerm/postgres/commit/ > eaf12776f59ff150735d0f187595fc8ce3f0a872. > So you're saying pgproc_partition_init() should not do just this ptr = (char *) ptr + num_procs * sizeof(PGPROC); but align the pointer to numa_page_size too? Sounds reasonable. Yeah, PGProcShmemSize() should have added huge pages for each partition, just like FastPathLockShmemSize(). Seems like a bug. I don't think the memset() is a problem. Yes, it might map it to the current node, but so what - the numa_tonode_memory() will just move it to the correct one. > There are also some potential problems with buffers distribution between > nodes. I have a feeling that current logic in > 'buffer_partitions_prepare' does not work correctly if number of buffers > is enough to cover just a single partition per node, but total number of > nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will > set 'numa_can_partition' to 'true', but will allocate 2 partitions per > node (so, 4 or 6 partitions in total), while we can fill just 2 or 3 > partition and leaving remaining partitions empty. This should violate > the last assert check, as last partition will get zero buffers in this > case. Another issue is related to the usage of 1GB pages, as minimal > size for buffers partitions is limited by the minimal number of buffer > descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K > = 256M as minimal size for partition, but for 1GB page the minimal size > is equal to 1GB / 64 * 8K = 128GB. So, if we assume 4 as minimal number > of partitions, then for 2MB pages we need just 1GB for shared_buffers to > enable partitioning (which seems a perfectly fine minimal limit for most > cases), but for 1GB pages we need to allocate at least 512GB to allow > buffers partitioning. Certainly, 1GB pages are usually used on large > machines with large number of buffers allocated, but still it may be > useful to allow configurations with 32GB or 64GB buffer cache to use > both 1GB pages and buffers partitioning at the same time. However, I > don't see an easy way to achieve this with the current logic. We either > need to allow usage of different page sizes here (i.e. 2MB for > descriptors and 1GB for buffers) or combine both buffers and its > descriptors in a single object (i.e. 'buffer chunk', which cover enough > buffers and their descriptors to fit into one or several memory pages), > effectively replacing both buffers and descriptors arrays with an array > of such 'chunks'. The latter solution may also help with dynamic buffer > cache resizing (as we may just add additional 'chunks' in this case) and > also increase TLB-hits with 1GB page (as both descriptor and its buffer > will be likely located in the same page). However, both these changes > seems to be quite large. > I'll look at handling the case with shared_buffers being too small to allow partitioning. There well might be a bug. We should simply disable partitioning in such cases. As for 1GB huge pages, I don't see a good way to support configurations with small buffers in these cases. To me it seems acceptable to say that if you want 1GB huge pages, you should have a lot of memory and shared buffers large enough. I'm not against supporting such systems, if we can come up with a good partitioning scheme. When I tried to come up with a scheme like that, it always came with a substantial complexity & cost. The main challenge was that it forced splitting the array of buffer descriptors, similarly to what the PGPROC partitioning does. And that made buffer access so much more complex / expensive it seemed not worth it. I was worried about impact on systems without NUMA partitioning. > I've tried also to run some benchmarks on my server: I've got some > improvements in 'pgbench/tpcb-like'results - about 8%, but only with > backends pinning to NUMA node (i.e. adjusting your previous pinning > patch to 'debug_numa' GUC: https://github.com/Lerm/postgres/ > commit/5942a3e12c7c501aa9febb63972a039e7ce00c20). For 'select-only' > scenario the gain is more substantial (about 15%), but these tests are > tricky, as they are more sensitive to other server settings and specific > functions layout in compiled code, so they need more checks. > What kind of hardware was that? What/how many cpus, NUMA nodes, how much memory, what storage? FWIW the main purpose of these patches was not so much throughput improvement, but making the behavior more stable / consistent. regards -- Tomas Vondra
pgsql-hackers by date: