Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Alexey Makhmutov
Subject Re: Adding basic NUMA awareness
Date
Msg-id bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
Hi Tomas,

Thank you very much for working on this problem and the entire line of 
patches prepared! I've tried to play with these patches a little and 
here are some my observations and suggestions.

In the current implementation we try to use all available NUMA nodes on 
the machine, however it's often useful to limit the database only to a 
set of specific nodes, so that other nodes can be used for other 
processes. In my testing I was trying to use one node out of four for 
the client program, so I'd liked to limit the database to the remaining 
nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to 
start the cluster, so the obvious choice for me was to use the 
'numa_get_membind' function instead of 'numa_num_configured_nodes' to 
get the list of usable nodes. However, it is much easier to work with 
logical nodes in the [0; n] range inside the PG code, so I've decided to 
add mapping between 'logical nodes' (0-n in PG) to a set of physical 
nodes actually returned by 'numa_get_membind'. We may need to map number 
in both directions, so two translation tables are allocated and filled 
at the first usage of 'pg_numa' functions. It also seems to be a good 
idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all 
'numa_...' calls in it and this also allows us to hide this mapping in 
static functions. Here is the patch, which I've used to test this idea: 
https://github.com/Lerm/postgres/commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161. 
This idea probably could be extended by adding some view to expose this 
mapping to the user (at least for testing purposes) and allow to 
explicitly override this mapping with a GUC setting. With such GUC 
setting we would be able to control PG memory usage on NUMA nodes 
without the need for systemd resource control or numactl parameters.

Next, I've noticed some problems related to the size alignment for 
'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The 
documentation for the 'numa_tonode_memory' says that 'The  size 
argument  will  be rounded up to a multiple of the system page size'. 
However this does not work well with huge pages as alignment is 
performed for the default kernel page size (i.e. 4K in most cases). If 
addr + size value (rounded to the default page size) does not cover the 
entire huge page, then such invocation seems to be processed incorrectly 
and allocation policy is not applied for next pages access in such 
segment. At least this was the behavior I've observed on Debian 12 / 
6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment 
contains pages from wrong nodes).

There are two location at which we could face such situation in current 
patches. First is related to buffers partitions mapping. With current 
code we basically ensure that combined size of all partitions for a 
single node is aligned to (huge) page size (as size is bound to the 
number of descriptors on one page), but individual partition is not 
explicitly aligned to this size. So, we could get the situation in which 
single page is split between adjacent partitions (e.g. 32GB buffers 
split by 3 nodes). With current code we will try to map each partition 
independently, which will results in unaligned calls to 
'numa_tonode_memory', so resulting mapping will be incorrect. We could 
either try to choose size for individual partition to align it to the 
desired page size or map all the partitions for a single node using a 
single 'pg_numa_move_to_node' invocation. During testing I've used the 
second approach, so here is the change to implement such logic: 
https://github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be.

The second location which could expose the same problem is related to 
the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to 
align pointer to the end PGPROC partition. There seems to be also two 
additional problems with PGPROC partitioning: we need to account 
additional padding pages in 'PGProcShmemSize' (using the same logic as 
with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to 
partitions mapping call (otherwise it will be mapped to current node). 
Here is a potential change, which tries to address these problems: 
https://github.com/Lerm/postgres/commit/eaf12776f59ff150735d0f187595fc8ce3f0a872.

There are also some potential problems with buffers distribution between 
nodes. I have a feeling that current logic in 
'buffer_partitions_prepare' does not work correctly if number of buffers 
is enough to cover just a single partition per node, but total number of 
nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will 
set 'numa_can_partition' to 'true', but will allocate 2 partitions per 
node (so, 4 or 6 partitions in total), while we can fill just 2 or 3 
partition and leaving remaining partitions empty. This should violate 
the last assert check, as last partition will get zero buffers in this 
case. Another issue is related to the usage of 1GB pages, as minimal 
size for buffers partitions is limited by the minimal number of buffer 
descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K 
= 256M as minimal size for partition, but for 1GB page the minimal size 
is equal to 1GB / 64 * 8K  = 128GB. So, if we assume 4 as minimal number 
of partitions, then for 2MB pages we need just 1GB for shared_buffers to 
enable partitioning (which seems a perfectly fine minimal limit for most 
cases), but for 1GB pages we need to allocate at least 512GB to allow 
buffers partitioning. Certainly, 1GB pages are usually used on large 
machines with large number of buffers allocated, but still it may be 
useful to allow configurations with 32GB or 64GB buffer cache to use 
both 1GB pages and buffers partitioning at the same time. However, I 
don't see an easy way to achieve this with the current logic. We either 
need to allow usage of different page sizes here (i.e. 2MB for 
descriptors and 1GB for buffers) or combine both buffers and its 
descriptors in a single object (i.e. 'buffer chunk', which cover enough 
buffers and their descriptors to fit into one or several memory pages), 
effectively replacing both buffers and descriptors arrays with an array 
of such 'chunks'. The latter solution may also help with dynamic buffer 
cache resizing (as we may just add additional 'chunks' in this case) and 
also increase TLB-hits with 1GB page (as both descriptor and its buffer 
will be likely located in the same page). However, both these changes 
seems to be quite large.

I've tried also to run some benchmarks on my server: I've got some 
improvements in 'pgbench/tpcb-like'results - about 8%, but only with 
backends pinning to NUMA node (i.e. adjusting your previous pinning 
patch to 'debug_numa' GUC: 
https://github.com/Lerm/postgres/commit/5942a3e12c7c501aa9febb63972a039e7ce00c20). 
For 'select-only' scenario the gain is more substantial (about 15%), but 
these tests are tricky, as they are more sensitive to other server 
settings and specific functions layout in compiled code, so they need 
more checks.

Thank you again for sharing these patches!

Thanks,
Alexey



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Improve docs for n_distinct_inherited
Next
From: Robert Treat
Date:
Subject: Re: Adding REPACK [concurrently]