Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | Z8CGAdTbRzKXo0dq@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
List | pgsql-hackers |
Hi, On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote: > On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > [..] > > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: > > > > Thanks for the updated version! > > > > FWIW, I had to do a few changes to get an error free compiling experience with > > autoconf/or meson and both with or without the libnuma configure option. > > > > Sharing here as .txt files: > > > > Also the pg_buffercache test fails without the libnuma configure option. Maybe > > some tests should depend of the libnuma configure option. > [..] > > Thank you so much for this Bertrand ! > > I've applied those , played a little bit with configure and meson and > reproduced the test error and fixed it by silencing that NOTICE in > tests. So v6 is attached even before I get a chance to start using > that CI. Still waiting for some input and tests regarding that earlier > touchpages attempt, docs are still missing... Thanks for the new version! I did some tests and it looks like it's giving correct results. I don't see -2 anymore and every backend reports correct distribution (with or without hp, with "small" or "large" shared buffer). A few random comments: === 1 + /* + * This is for gathering some NUMA statistics. We might be using + * various DB block sizes (4kB, 8kB , .. 32kB) that end up being + * allocated in various different OS memory pages sizes, so first we + * need to understand the OS memory page size before calling + * move_pages() + */ + os_page_size = pg_numa_get_pagesize(); + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; + pages_per_blk = (float) BLCKSZ / os_page_size; + + elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f", + os_page_count, os_page_size, pages_per_blk); + + os_page_ptrs = palloc(sizeof(void *) * os_page_count); + os_pages_status = palloc(sizeof(int) * os_page_count); + memset(os_page_ptrs, 0, sizeof(void *) * os_page_count); + + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here + */ + memset(os_pages_status, 0xff, sizeof(int) * os_page_count); I think that if (query_numa) check should also wrap that entire section of code. === 2 + if (query_numa) + { + blk2page = (int) i * pages_per_blk; + j = 0; + do + { This check is done for every page. I wonder if it would not make sense to create a brand new function for pg_buffercache_numa and just let the current pg_buffercache_pages() as it is. That said it would be great to avoid code duplication as much a possible though, maybe using a shared populate_buffercache_entry() or such helper function? === 3 +#define ONE_GIGABYTE 1024*1024*1024 + if ((i * os_page_size) % ONE_GIGABYTE == 0) + CHECK_FOR_INTERRUPTS(); + } Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS() for every page instead? (I don't see with a 30GB shared buffer). I've the feeling that we could get rid of the "ONE_GIGABYTE" check. === 4 + pfree(os_page_ptrs); + pfree(os_pages_status); Not sure that's needed, we should be in a short-lived memory context here (ExprContext or such). === 5 +/* SQL SRF showing NUMA zones for allocated shared memory */ +Datum +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) +{ That's a good idea. + for (i = 0; i < shm_ent_page_count; i++) + { + /* + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ + volatile uint64 touch pg_attribute_unused(); + + page_ptrs[i] = (char *) ent->location + (i * os_page_size); + pg_numa_touch_mem_if_required(touch, page_ptrs[i]); That sounds right. Could we also avoid some code duplication with pg_get_shmem_allocations()? Also same remarks about pfree() and ONE_GIGABYTE as above. A few other things: ==== 6 +++ b/src/backend/storage/ipc/shmem.c @@ -73,6 +73,7 @@ #include "storage/shmem.h" #include "storage/spin.h" #include "utils/builtins.h" +#include "port/pg_numa.h" Not at the right position, should be between those 2: #include "miscadmin.h" #include "storage/lwlock.h" ==== 7 +/*------------------------------------------------------------------------- + * + * pg_numa.h + * Miscellaneous functions for bit-wise operations. description is not correct. Also the "Copyright (c) 2019-2025" might be "Copyright (c) 2025" instead. === 8 +++ b/src/port/pg_numa.c @@ -0,0 +1,150 @@ +/*------------------------------------------------------------------------- + * + * numa.c + * Basic NUMA portability routines s/numa.c/pg_numa.c/ ? === 9 +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -6,6 +6,7 @@ * contrib/pg_buffercache/pg_buffercache_pages.c *------------------------------------------------------------------------- */ +#include "pg_config.h" #include "postgres.h" Is this new include needed? #include "access/htup_details.h" @@ -13,10 +14,12 @@ #include "funcapi.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" not in the right order. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: