Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Alexey Makhmutov |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
Hi Tomas, Thank you very much for working on this problem and the entire line of patches prepared! I've tried to play with these patches a little and here are some my observations and suggestions. In the current implementation we try to use all available NUMA nodes on the machine, however it's often useful to limit the database only to a set of specific nodes, so that other nodes can be used for other processes. In my testing I was trying to use one node out of four for the client program, so I'd liked to limit the database to the remaining nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to start the cluster, so the obvious choice for me was to use the 'numa_get_membind' function instead of 'numa_num_configured_nodes' to get the list of usable nodes. However, it is much easier to work with logical nodes in the [0; n] range inside the PG code, so I've decided to add mapping between 'logical nodes' (0-n in PG) to a set of physical nodes actually returned by 'numa_get_membind'. We may need to map number in both directions, so two translation tables are allocated and filled at the first usage of 'pg_numa' functions. It also seems to be a good idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all 'numa_...' calls in it and this also allows us to hide this mapping in static functions. Here is the patch, which I've used to test this idea: https://github.com/Lerm/postgres/commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161. This idea probably could be extended by adding some view to expose this mapping to the user (at least for testing purposes) and allow to explicitly override this mapping with a GUC setting. With such GUC setting we would be able to control PG memory usage on NUMA nodes without the need for systemd resource control or numactl parameters. Next, I've noticed some problems related to the size alignment for 'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The documentation for the 'numa_tonode_memory' says that 'The size argument will be rounded up to a multiple of the system page size'. However this does not work well with huge pages as alignment is performed for the default kernel page size (i.e. 4K in most cases). If addr + size value (rounded to the default page size) does not cover the entire huge page, then such invocation seems to be processed incorrectly and allocation policy is not applied for next pages access in such segment. At least this was the behavior I've observed on Debian 12 / 6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment contains pages from wrong nodes). There are two location at which we could face such situation in current patches. First is related to buffers partitions mapping. With current code we basically ensure that combined size of all partitions for a single node is aligned to (huge) page size (as size is bound to the number of descriptors on one page), but individual partition is not explicitly aligned to this size. So, we could get the situation in which single page is split between adjacent partitions (e.g. 32GB buffers split by 3 nodes). With current code we will try to map each partition independently, which will results in unaligned calls to 'numa_tonode_memory', so resulting mapping will be incorrect. We could either try to choose size for individual partition to align it to the desired page size or map all the partitions for a single node using a single 'pg_numa_move_to_node' invocation. During testing I've used the second approach, so here is the change to implement such logic: https://github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be. The second location which could expose the same problem is related to the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to align pointer to the end PGPROC partition. There seems to be also two additional problems with PGPROC partitioning: we need to account additional padding pages in 'PGProcShmemSize' (using the same logic as with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to partitions mapping call (otherwise it will be mapped to current node). Here is a potential change, which tries to address these problems: https://github.com/Lerm/postgres/commit/eaf12776f59ff150735d0f187595fc8ce3f0a872. There are also some potential problems with buffers distribution between nodes. I have a feeling that current logic in 'buffer_partitions_prepare' does not work correctly if number of buffers is enough to cover just a single partition per node, but total number of nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will set 'numa_can_partition' to 'true', but will allocate 2 partitions per node (so, 4 or 6 partitions in total), while we can fill just 2 or 3 partition and leaving remaining partitions empty. This should violate the last assert check, as last partition will get zero buffers in this case. Another issue is related to the usage of 1GB pages, as minimal size for buffers partitions is limited by the minimal number of buffer descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K = 256M as minimal size for partition, but for 1GB page the minimal size is equal to 1GB / 64 * 8K = 128GB. So, if we assume 4 as minimal number of partitions, then for 2MB pages we need just 1GB for shared_buffers to enable partitioning (which seems a perfectly fine minimal limit for most cases), but for 1GB pages we need to allocate at least 512GB to allow buffers partitioning. Certainly, 1GB pages are usually used on large machines with large number of buffers allocated, but still it may be useful to allow configurations with 32GB or 64GB buffer cache to use both 1GB pages and buffers partitioning at the same time. However, I don't see an easy way to achieve this with the current logic. We either need to allow usage of different page sizes here (i.e. 2MB for descriptors and 1GB for buffers) or combine both buffers and its descriptors in a single object (i.e. 'buffer chunk', which cover enough buffers and their descriptors to fit into one or several memory pages), effectively replacing both buffers and descriptors arrays with an array of such 'chunks'. The latter solution may also help with dynamic buffer cache resizing (as we may just add additional 'chunks' in this case) and also increase TLB-hits with 1GB page (as both descriptor and its buffer will be likely located in the same page). However, both these changes seems to be quite large. I've tried also to run some benchmarks on my server: I've got some improvements in 'pgbench/tpcb-like'results - about 8%, but only with backends pinning to NUMA node (i.e. adjusting your previous pinning patch to 'debug_numa' GUC: https://github.com/Lerm/postgres/commit/5942a3e12c7c501aa9febb63972a039e7ce00c20). For 'select-only' scenario the gain is more substantial (about 15%), but these tests are tricky, as they are more sensitive to other server settings and specific functions layout in compiled code, so they need more checks. Thank you again for sharing these patches! Thanks, Alexey
pgsql-hackers by date: