Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Alexey Makhmutov |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | 92e23c85-f646-4bab-b5e0-df30d8ddf4bd@postgrespro.ru Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On 10/13/25 14:09, Tomas Vondra wrote: > I'm not sure I understand. Are you suggesting there's a bug in the patch, the kernel, or somewhere else? We need to ensure that both addr and (addr + size) are aligned to the page size of the target mapping during 'numa_tonode_memory' invocation, otherwise it may produce unexpected results. > But this is exactly why (with hugepages) the code aligns everything to huge page boundary, and sizes everything as a multiple of huge page. At least I think so. Maybe I remember wrong? I assume that there are places in the current patch, which could perform such unaligned mapping. See below for samples. > Can you actually demonstrate this? This issue is related to the calculation of partition size for buffer descriptors in case we have multiple partitions per node. Currently we ensure that each node gets number of buffers, which fits into whole memory pages, but if we have several partitions per node, then there is no guarantee that partition size will be properly aligned for descriptors. We could observe this problem only if we have multiple partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this issue can potentially affect only configurations with 2 or 3 nodes. Two examples here: first, let's assume we want to have shared_buffers set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be 4,194,304, min_node_buffers will be 32,768 and num_partitions_per_node will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128, so the nearest multiplier for min_node_buffers which allow us to cover all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The num_buffers_per_node is 43*min_node_buffers and it is aligned to page size, but we need to split it between two partitions, so each gets 41.5*min_node_buffers buffers. This still allow us to split buffers itself by page boundary, but descriptor partitions will be split just in the middle of the page. Here is the log for such configuration: NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2 buffers_per_node 1409024 (min 32768) NUMA: buffer 0 node 0 partition 0 buffers 704512 first 0 last 704511 NUMA: buffer 1 node 0 partition 1 buffers 704512 first 704512 last 1409023 NUMA: buffer 2 node 1 partition 0 buffers 704512 first 1409024 last 2113535 NUMA: buffer 3 node 1 partition 1 buffers 704512 first 2113536 last 2818047 NUMA: buffer 4 node 2 partition 0 buffers 688128 first 2818048 last 3506175 NUMA: buffer 5 node 2 partition 1 buffers 688128 first 3506176 last 4194303 NUMA: buffer_partitions_init: 0 => 0 buffers 704512 start 0x7ff7c8c00000 end 0x7ff920c00000 (size 5771362304) NUMA: buffer_partitions_init: 0 => 0 descriptors 704512 start 0x7ff7b8a00000 end 0x7ff7bb500000 (size 45088768) mbind: Invalid argument NUMA: buffer_partitions_init: 1 => 0 buffers 704512 start 0x7ff920c00000 end 0x7ffa78c00000 (size 5771362304) NUMA: buffer_partitions_init: 1 => 0 descriptors 704512 start 0x7ff7bb500000 end 0x7ff7be000000 (size 45088768) mbind: Invalid argument NUMA: buffer_partitions_init: 2 => 1 buffers 704512 start 0x7ffa78c00000 end 0x7ffbd0c00000 (size 5771362304) NUMA: buffer_partitions_init: 2 => 1 descriptors 704512 start 0x7ff7be000000 end 0x7ff7c0b00000 (size 45088768) mbind: Invalid argument NUMA: buffer_partitions_init: 3 => 1 buffers 704512 start 0x7ffbd0c00000 end 0x7ffd28c00000 (size 5771362304) NUMA: buffer_partitions_init: 3 => 1 descriptors 704512 start 0x7ff7c0b00000 end 0x7ff7c3600000 (size 45088768) mbind: Invalid argument NUMA: buffer_partitions_init: 4 => 2 buffers 688128 start 0x7ffd28c00000 end 0x7ffe78c00000 (size 5637144576) NUMA: buffer_partitions_init: 4 => 2 descriptors 688128 start 0x7ff7c3600000 end 0x7ff7c6000000 (size 44040192) NUMA: buffer_partitions_init: 5 => 2 buffers 688128 start 0x7ffe78c00000 end 0x7fffc8c00000 (size 5637144576) NUMA: buffer_partitions_init: 5 => 2 descriptors 688128 start 0x7ff7c6000000 end 0x7ff7c8a00000 (size 44040192) Another example: 2 nodes and 15872MB shared_buffers. Again, NBuffers/min_node_buffers=62, so num_buffers_per_node is 31*min_node_buffers, which gives each partition 15.5*min_node_buffers. Here is the log output: NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2 buffers_per_node 1015808 (min 32768) NUMA: buffer 0 node 0 partition 0 buffers 507904 first 0 last 507903 NUMA: buffer 1 node 0 partition 1 buffers 507904 first 507904 last 1015807 NUMA: buffer 2 node 1 partition 0 buffers 507904 first 1015808 last 1523711 NUMA: buffer 3 node 1 partition 1 buffers 507904 first 1523712 last 2031615 NUMA: buffer_partitions_init: 0 => 0 buffers 507904 start 0x7ffbf9c00000 end 0x7ffcf1c00000 (size 4160749568) NUMA: buffer_partitions_init: 0 => 0 descriptors 507904 start 0x7ffbf1e00000 end 0x7ffbf3d00000 (size 32505856) mbind: Invalid argument NUMA: buffer_partitions_init: 1 => 0 buffers 507904 start 0x7ffcf1c00000 end 0x7ffde9c00000 (size 4160749568) NUMA: buffer_partitions_init: 1 => 0 descriptors 507904 start 0x7ffbf3d00000 end 0x7ffbf5c00000 (size 32505856) mbind: Invalid argument NUMA: buffer_partitions_init: 2 => 1 buffers 507904 start 0x7ffde9c00000 end 0x7ffee1c00000 (size 4160749568) NUMA: buffer_partitions_init: 2 => 1 descriptors 507904 start 0x7ffbf5c00000 end 0x7ffbf7b00000 (size 32505856) mbind: Invalid argument NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000 end 0x7fffd9c00000 (size 4160749568) NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start 0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856) mbind: Invalid argument > So you're saying pgproc_partition_init() should not do just this > ptr = (char *) ptr + num_procs * sizeof(PGPROC); > but align the pointer to numa_page_size too? Sounds reasonable. Yes, that's exactly my point, otherwise we could violate the alignment rule for 'numa_tonode_memory'. Here is an extraction from the log for system with 2 nodes, 2000 max_connections and 2MB pages: NUMA: pgproc backends 2056 num_nodes 2 per_node 1028 NUMA: pgproc_init_partition procs 0x7fffe7800000 endptr 0x7fffe78d2d20 num_procs 1028 node 0 mbind: Invalid argument NUMA: pgproc_init_partition procs 0x7fffe7a00000 endptr 0x7fffe7ad2d20 num_procs 1028 node 1 mbind: Invalid argument NUMA: pgproc_init_partition procs 0x7fffe7c00000 endptr 0x7fffe7c07cb0 num_procs 38 node -1 mbind: Invalid argument mbind: Invalid argument > I don't think the memset() is a problem. Yes, it might map it to the current node, but so what - the numa_tonode_memory() will just move it to the correct one. Well, the 'numa_tonode_memory' call does not move pages to the target node. It just sets the policy for mapping, so system will actually try to provide page from the correct node once we touch it. However, if the page is already faulted, then it won't be affected by this mapping, so that's why it works faster compared to 'numa_move_pages'. As stated in libnuma documentation: * numa_tonode_memory() put memory on a specific node. The constraints described for numa_interleave_memory() apply here too. * numa_interleave_memory() interleaves size bytes of memory page by page from start on nodes specified in nodemask. <...> This is a lower level function to interleave allocated but not yet faulted in memory. Not yet faulted in means the memory is allocated using mmap(2) or shmat(2), but has not been accessed by the current process yet. <...> If the numa_set_strict() flag is true then the operation will cause a numa_error if there were already pages in the mapping that do not follow the policy. I assume, that for the regular page kernel may rebalance memory in the future (not immediately), but not for hugepages. So, we really don't want to touch the memory area before we call the 'numa_tonode_memory'. This can be easily tested with the simple program: #include <stdio.h> #include <numa.h> #include <sys/mman.h> #include <linux/mman.h> #define MAP_SIZE 2*1024*1024 int main(int argc, char** argv) { void* ptr1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0); void* ptr2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0); /* Fault first page */ memset(ptr1, 1, MAP_SIZE); /* Move to node 1 */ numa_tonode_memory(ptr1, MAP_SIZE, 1); numa_tonode_memory(ptr2, MAP_SIZE, 1); /* Fault second page */ memset(ptr2, 1, MAP_SIZE); /* Wait */ printf("ptr1=%lx\nptr2=\%lx\nPress Enter to continue...\n",ptr1,ptr2); getchar(); munmap(ptr2, MAP_SIZE); munmap(ptr1, MAP_SIZE); return 0; } Running it on the first node: # gcc -o test_mem test_mem.c -lnuma # taskset -c 0 ./test_mem ptr1=7ffff7a00000 ptr2=7ffff7800000 Press Enter to continue... From another terminal: # grep huge /proc/`pgrep test_mem`/numa_maps 7ffff7800000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N1=1 kernelpagesize_kB=2048 7ffff7a00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N0=1 kernelpagesize_kB=2048 So, while policy (bind:1) is set for both mappings, but only the second one (which was not touched before the 'numa_tonode_memory' invocation) is actualy located on node 1 rather than 0. > What kind of hardware was that? What/how many cpus, NUMA nodes, how much memory, what storage? Of course, that's valid question. I probably should not have commented on performance side without providing full data, while I was still trying to measure it and it was just preliminary runs. Sorry for that. Thanks, Alexey
pgsql-hackers by date: