Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Alexey Makhmutov
Subject Re: Adding basic NUMA awareness
Date
Msg-id 92e23c85-f646-4bab-b5e0-df30d8ddf4bd@postgrespro.ru
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On 10/13/25 14:09, Tomas Vondra wrote:

 > I'm not sure I understand. Are you suggesting there's a bug in the 
patch, the kernel, or somewhere else?

We need to ensure that both addr and (addr + size) are aligned to the 
page size of the target mapping during 'numa_tonode_memory' invocation, 
otherwise it may produce unexpected results.

 > But this is exactly why (with hugepages) the code aligns everything 
to huge page boundary, and sizes everything as a multiple of huge page. 
At least I think so. Maybe I remember wrong?

I assume that there are places in the current patch, which could perform 
such unaligned mapping. See below for samples.

 > Can you actually demonstrate this?

This issue is related to the calculation of partition size for buffer 
descriptors in case we have multiple partitions per node. Currently we 
ensure that each node gets number of buffers, which fits into whole 
memory pages, but if we have several partitions per node, then there is 
no guarantee that partition size will be properly aligned for 
descriptors. We could observe this problem only if we have multiple 
partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this 
issue can potentially affect only configurations with 2 or 3 nodes.

Two examples here: first, let's assume we want to have shared_buffers 
set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be 
4,194,304,  min_node_buffers will be 32,768 and num_partitions_per_node 
will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128, 
so the nearest multiplier for min_node_buffers which allow us to cover 
all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The 
num_buffers_per_node is 43*min_node_buffers and it is aligned to page 
size, but we need to split it between two partitions, so each gets 
41.5*min_node_buffers buffers. This still allow us to split buffers 
itself by page boundary, but descriptor partitions will be split just in 
the middle of the page. Here is the log for such configuration:
NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2 
buffers_per_node 1409024 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 704512 first 0 last 704511
NUMA: buffer 1 node 0 partition 1 buffers 704512 first 704512 last 1409023
NUMA: buffer 2 node 1 partition 0 buffers 704512 first 1409024 last 2113535
NUMA: buffer 3 node 1 partition 1 buffers 704512 first 2113536 last 2818047
NUMA: buffer 4 node 2 partition 0 buffers 688128 first 2818048 last 3506175
NUMA: buffer 5 node 2 partition 1 buffers 688128 first 3506176 last 4194303
NUMA: buffer_partitions_init: 0 => 0 buffers 704512 start 0x7ff7c8c00000 
end 0x7ff920c00000 (size 5771362304)
NUMA: buffer_partitions_init: 0 => 0 descriptors 704512 start 
0x7ff7b8a00000 end 0x7ff7bb500000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 704512 start 0x7ff920c00000 
end 0x7ffa78c00000 (size 5771362304)
NUMA: buffer_partitions_init: 1 => 0 descriptors 704512 start 
0x7ff7bb500000 end 0x7ff7be000000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 704512 start 0x7ffa78c00000 
end 0x7ffbd0c00000 (size 5771362304)
NUMA: buffer_partitions_init: 2 => 1 descriptors 704512 start 
0x7ff7be000000 end 0x7ff7c0b00000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 704512 start 0x7ffbd0c00000 
end 0x7ffd28c00000 (size 5771362304)
NUMA: buffer_partitions_init: 3 => 1 descriptors 704512 start 
0x7ff7c0b00000 end 0x7ff7c3600000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 4 => 2 buffers 688128 start 0x7ffd28c00000 
end 0x7ffe78c00000 (size 5637144576)
NUMA: buffer_partitions_init: 4 => 2 descriptors 688128 start 
0x7ff7c3600000 end 0x7ff7c6000000 (size 44040192)
NUMA: buffer_partitions_init: 5 => 2 buffers 688128 start 0x7ffe78c00000 
end 0x7fffc8c00000 (size 5637144576)
NUMA: buffer_partitions_init: 5 => 2 descriptors 688128 start 
0x7ff7c6000000 end 0x7ff7c8a00000 (size 44040192)

Another example: 2 nodes and 15872MB shared_buffers. Again, 
NBuffers/min_node_buffers=62, so num_buffers_per_node is 
31*min_node_buffers, which gives each partition 15.5*min_node_buffers. 
Here is the log output:
NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2 
buffers_per_node 1015808 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 507904 first 0 last 507903
NUMA: buffer 1 node 0 partition 1 buffers 507904 first 507904 last 1015807
NUMA: buffer 2 node 1 partition 0 buffers 507904 first 1015808 last 1523711
NUMA: buffer 3 node 1 partition 1 buffers 507904 first 1523712 last 2031615
NUMA: buffer_partitions_init: 0 => 0 buffers 507904 start 0x7ffbf9c00000 
end 0x7ffcf1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 0 => 0 descriptors 507904 start 
0x7ffbf1e00000 end 0x7ffbf3d00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 507904 start 0x7ffcf1c00000 
end 0x7ffde9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 1 => 0 descriptors 507904 start 
0x7ffbf3d00000 end 0x7ffbf5c00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 507904 start 0x7ffde9c00000 
end 0x7ffee1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 2 => 1 descriptors 507904 start 
0x7ffbf5c00000 end 0x7ffbf7b00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000 
end 0x7fffd9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start 
0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856)
mbind: Invalid argument

 > So you're saying pgproc_partition_init() should not do just this
 > ptr = (char *) ptr + num_procs * sizeof(PGPROC);
 > but align the pointer to numa_page_size too? Sounds reasonable.

Yes, that's exactly my point, otherwise we could violate the alignment 
rule for 'numa_tonode_memory'. Here is an extraction from the log for 
system with 2 nodes, 2000 max_connections and 2MB pages:
NUMA: pgproc backends 2056 num_nodes 2 per_node 1028
NUMA: pgproc_init_partition procs 0x7fffe7800000 endptr 0x7fffe78d2d20 
num_procs 1028 node 0
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7a00000 endptr 0x7fffe7ad2d20 
num_procs 1028 node 1
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7c00000 endptr 0x7fffe7c07cb0 
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument

 > I don't think the memset() is a problem. Yes, it might map it to the 
current node, but so what - the numa_tonode_memory() will just move it 
to the correct one.

Well, the 'numa_tonode_memory' call does not move pages to the target 
node. It just sets the policy for mapping, so system will actually try 
to provide page from the correct node once we touch it. However, if the 
page is already faulted, then it won't be affected by this mapping, so 
that's why it works faster compared to 'numa_move_pages'. As stated in 
libnuma documentation:
* numa_tonode_memory() put memory on a specific node. The constraints 
described for numa_interleave_memory() apply here too.
* numa_interleave_memory()  interleaves  size  bytes of memory page by 
page from start on nodes specified in nodemask. <...> This is a lower 
level function to interleave allocated but not yet faulted in memory. 
Not yet faulted in means the memory is allocated using mmap(2) or 
shmat(2), but has not been accessed by  the current process yet. <...> 
If the numa_set_strict() flag is true then the operation will cause a 
numa_error if there were already pages in the mapping that do not follow 
the policy.

I assume, that for the regular page kernel may rebalance memory in the 
future (not immediately), but not for hugepages. So, we really don't 
want to touch the memory area before we call the 'numa_tonode_memory'.

This can be easily tested with the simple program:
#include <stdio.h>
#include <numa.h>
#include <sys/mman.h>
#include <linux/mman.h>

#define MAP_SIZE 2*1024*1024

int main(int argc, char** argv) {
   void* ptr1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);
   void* ptr2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED 
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);

   /* Fault first page */
   memset(ptr1, 1, MAP_SIZE);
   /* Move to node 1 */
   numa_tonode_memory(ptr1, MAP_SIZE, 1);
   numa_tonode_memory(ptr2, MAP_SIZE, 1);
   /* Fault second page */
   memset(ptr2, 1, MAP_SIZE);

   /* Wait */
   printf("ptr1=%lx\nptr2=\%lx\nPress Enter to continue...\n",ptr1,ptr2);
   getchar();
   munmap(ptr2, MAP_SIZE);
   munmap(ptr1, MAP_SIZE);
   return 0;
}

Running it on the first node:
# gcc -o test_mem test_mem.c -lnuma
# taskset -c 0 ./test_mem
ptr1=7ffff7a00000
ptr2=7ffff7800000
Press Enter to continue...

 From another terminal:
# grep huge /proc/`pgrep test_mem`/numa_maps
7ffff7800000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N1=1 
kernelpagesize_kB=2048
7ffff7a00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N0=1 
kernelpagesize_kB=2048

So, while policy (bind:1) is set for both mappings, but only the second 
one (which was not touched before the 'numa_tonode_memory' invocation) 
is actualy located on node 1 rather than 0.

 > What kind of hardware was that? What/how many cpus, NUMA nodes, how 
much memory, what storage?

Of course, that's valid question. I probably should not have commented 
on performance side without providing full data, while I was still 
trying to measure it and it was just preliminary runs. Sorry for that.

Thanks,
Alexey



pgsql-hackers by date:

Previous
From: Bryan Green
Date:
Subject: Re: [PATCH] Fix incorrect fprintf usage in log_error FRONTEND path
Next
From: Nathan Bossart
Date:
Subject: Re: Clarification on Role Access Rights to Table Indexes