Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmw+EtapdJrageWrMhn35=cP2XZD4JK6ox_gpHnKsbG-Mg@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
List | pgsql-hackers |
On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for the new version! > > v10 is attached with most fixes after review and one new thing > introduced: pg_numa_available() for run-time decision inside tests > which was needed after simplifying code a little bit as you wanted. > I've also fixed -Dlibnuma=disabled as it was not properly implemented. > There are couple open items (minor/decision things), but most is fixed > or answered: > > > Some random comments on 0001: > > > > === 1 > > > > It does not compiles "alone". It's missing: > > > [..] > > +extern PGDLLIMPORT int huge_pages_status; > [..] > > -static int huge_pages_status = HUGE_PAGES_UNKNOWN; > > +int huge_pages_status = HUGE_PAGES_UNKNOWN; > > > > That come with 0002. So it has to be in 0001 instead. > > Ugh, good catch, I haven't thought about it in isolation, they are > separate to just ease review, but should be committed together. Fixed. > > > === 2 > > > > +else > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > +fi > > > > Maybe we should add the same test (checking for numa.h) for meson? > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > between meson and autoconf? > > > === 3 > > > > +# FIXME: filter-out / with/without with_libnuma? > > +LIBS += $(LIBNUMA_LIBS) > > > > It looks to me that we can remove those 2 lines. > > Done. > > > === 4 > > > > + Only supported on Linux. > > > > s/on Linux/on platforms for which the libnuma library is implemented/? > > > > I did a quick grep on "Only supported on" and it looks like that could be > > a more consistent wording. > > Fixed. > > > === 5 > > > > +#include "c.h" > > +#include "postgres.h" > > +#include "port/pg_numa.h" > > +#include "storage/pg_shmem.h" > > +#include <unistd.h> > > > > I had a closer look to other header files and it looks like it "should" be: > > > > #include "c.h" > > #include "postgres.h" > > > > #include <unistd.h> > > #ifdef WIN32 > > #include <windows.h> > > #endif > > > > #include "port/pg_numa.h" > > #include "storage/pg_shmem.h" > > > > And is "#include "c.h"" really needed? > > Fixed both. It seems to compile without c.h. > > > === 6 > > > > +/* FIXME not tested, might crash */ > > > > That's a bit scary. > > When you are in support for long enough it is becoming the norm ;) But > on serious note Andres wanted have numa error/warning handlers (used > by libnuma), but current code has no realistic code-path to hit it > from numa_available(3), numa_move_pages(3) or numa_max_node(3). The > situation won't change until one day in future (I hope!) we start > using some more advanced libnuma functionality for interleaving > memory, please see my earlier reply: > https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com > E.g. numa_available(3) is tiny wrapper , see > https://github.com/numactl/numactl/blob/master/libnuma.c#L872 > > For now, I've adjusted that FIXME to XXX, but still don't know we > could inject failure to see this triggered... > > > === 7 > > > > + * XXX: for now we issue just WARNING, but long-term that might depend on > > + * numa_set_strict() here > > > > s/here/here./ > > Done. > > > Did not look carefully at all the comments in 0001, 0002 and 0003 though. > > > > A few random comments regarding 0002: > > > > === 8 > > > > # create extension pg_buffercache; > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > RETURNS SETOF RECORD > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > LANGUAGE C PARALLEL SAFE" > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > While that's ok if 0003 is applied. I think that each individual patch should > > "fully" work individually. > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > patches or keep them separate? > > > === 9 > > > > + do > > + { > > + if (os_page_ptrs[blk2page + j] == 0) > > > > blk2page + j will be repeated multiple times, better to store it in a local > > variable instead? > > Done. > > > === 10 > > > > + if (firstUseInBackend == true) > > > > > > if (firstUseInBackend) instead? > > Done everywhere. > > > === 11 > > > > + int j = 0, > > + blk2page = (int) i * pages_per_blk; > > > > I wonder if size_t is more appropriate for blk2page: > > > > size_t blk2page = (size_t)(i * pages_per_blk) maybe? > > Sure, done. > > > === 12 > > > > as we know that we'll iterate until pages_per_blk then would a for loop be more > > appropriate here, something like? > > > > " > > for (size_t j = 0; j < pages_per_blk; j++) > > { > > if (os_page_ptrs[blk2page + j] == 0) > > { > > " > > Sure. > > > === 13 > > > > + if (buf_state & BM_DIRTY) > > + fctx->record[i].isdirty = true; > > + else > > + fctx->record[i].isdirty = false; > > > > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? > > > > === 14 > > > > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > + fctx->record[i].isvalid = true; > > + else > > + fctx->record[i].isvalid = false; > > > > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > maybe? > > It is coming from the original pg_buffercache and it is less readable > to me, so I don't want to touch that too much because that might open > refactoring doors too wide. > > > === 15 > > > > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) > > > > I wonder if "i" should get a more descriptive name? > > Done, buffer_id. > > > === 16 > > > > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent > > with pg_stat_io_build_tuples for example). > > > > I now realize that I did propose populate_buffercache_entry() up-thread, > > sorry for changing my mind. > > OK, I've tried to create a better name, but all my ideas were too > long, but pg_buffercache_build_tuple sounds nice, so lets use that. > > > === 17 > > > > + static bool firstUseInBackend = true; > > > > maybe we should give it a more descriptive name? > > I couldn't come up with anything that wouldn't look too long, so > instead I've added a comment explaining the meaning behind this > variable, hope that's good enough. > > > Also I need to think more about how firstUseInBackend is used, for example: > > > > ==== 17.1 > > > > would it be better to define it as a file scope variable? (at least that > > would avoid to pass it as an extra parameter in some functions). > > I have no strong opinion on this, but I have one doubt (for future): > isn't creating global variables making life harder for upcoming > multithreading guys ? > > > === 17.2 > > > > what happens if firstUseInBackend is set to false and later on the pages > > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > > called again by a backend that already had set firstUseInBackend to false, > > would that provide accurate information? > > It is still the correct result. That "touching" (paging-in) is only > necessary probably to properly resolve PTEs as the fork() does not > seem to carry them over from parent: > > postgres=# select 'create table foo' || s || ' as select > generate_series(1, 100000);' from generate_series(1, 4) s; > postgres=# \gexec > SELECT 100000 > SELECT 100000 > SELECT 100000 > SELECT 100000 > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; -- before: > numa_zone_id | count > --------------+--------- > 0 | 256 > 1 | 4131 > | 8384221 > postgres=# select pg_backend_pid(); > pg_backend_pid > ---------------- > 1451 > > -- now use another root (!) session to "migratepages(8)" from numactl > will also shift shm segment: > # migratepages 1451 1 3 > > -- while above will be in progress for a lot of time , but the outcome > is visible much faster in that backend (pg is functioning): > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; > numa_zone_id | count > --------------+--------- > 0 | 256 > 3 | 4480 > | 8383872 > > So the above clearly shows that initial touching of shm is required, > but just once and it stays valid afterwards. > > BTW: migratepages(8) was stuck for 1-2 minutes there on > "__wait_rcu_gp" according to it's wchan, without any sign of activity > on the OS and then out of blue completed just fine, s_b=64GB,HP=on. > > > === 18 > > > > +Datum > > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > > +{ > > + FuncCallContext *funcctx; > > + BufferCachePagesContext *fctx; /* User function context. */ > > + bool query_numa = true; > > > > I don't think we need query_numa anymore in pg_buffercache_numa_pages(). > > > > I think that we can just ERROR (or NOTICE and return) here: > > > > + if (pg_numa_init() == -1) > > + { > > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); > > + query_numa = false; > > > > and fully get rid of query_numa. > > Right... fixed it here, but it made the tests blow up, so I've had to > find a way to conditionally launch tests based on NUMA availability > and that's how pg_numa_available() was born. It's in ipc/shmem.c > because I couldn't find a better place for it... > > > === 19 > > > > And then here: > > > > for (i = 0; i < NBuffers; i++) > > { > > populate_buffercache_entry(i, fctx); > > if (query_numa) > > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); > > } > > > > if (query_numa) > > { > > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) > > elog(ERROR, "failed NUMA pages inquiry: %m"); > > > > for (i = 0; i < NBuffers; i++) > > { > > int blk2page = (int) i * pages_per_blk; > > > > /* > > * Technically we can get errors too here and pass that to > > * user. Also we could somehow report single DB block spanning > > * more than one NUMA zone, but it should be rare. > > */ > > fctx->record[i].numa_zone_id = os_pages_status[blk2page]; > > } > > > > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? > > Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper > that prepares **os_page_ptrs, which is then used by > pg_numa_query_pages() and then we fill the data. But now after > removing query_numa , it reads smoother anyway. Can you please take a > look again on this, is this up to the project standards? > > > A few random comments regarding 0003: > > > > === 20 > > > > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes > > > > s/pg_shmem_allocations/pg_shmem_numa_allocations/? > > Fixed. > > > === 21 > > > > + /* FIXME: should we release LWlock here ? */ > > + elog(ERROR, "failed NUMA pages inquiry status: %m"); > > > > There is no need, see src/backend/storage/lmgr/README. > > Thanks, fixed. > > > === 22 > > > > +#define MAX_NUMA_ZONES 32 /* FIXME? */ > > + Size zones[MAX_NUMA_ZONES]; > > > > could we rely on pg_numa_get_max_node() instead? > > Sure, done. > > > === 23 > > > > + if (s >= 0) > > + zones[s]++; > > > > should we also check that s is below a limit? > > STILL OPEN QUESTION: I'm not sure it would give us value to report > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > could somehow overallocate that array and remember negative ones too. > > > === 24 > > > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > > no possible holes? I mean could a system have node 0,1 and 3 but not 2? > > I have never seen a hole in numbering as it is established during > boot, and the only way that could get close to adjusting it could be > making processor books (CPU and RAM together) offline. Even *if* > someone would be doing some preventive hardware maintenance, that > still wouldn't hurt, as we are just using the Id of the zone to > display it -- the max would be already higher. I mean, technically one > could use lsmem(1) (it mentions removable flag, after let's pages and > processes migrated away and from there) and then use chcpu(1) to > --disable CPUs on that zone (to prevent new ones coming there and > allocating new local memory there) and then offlining that memory > region via chmem(1) --disable. Even with all of that, that still > shouldn't cause issues for this code I think, because > `numa_max_node()` says it `returns the >> highest node number <<< > available on the current system.` > > > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > > did a little in 0001 though) > > That was an attempt to say "Thank You", OK, I've aligned it that way, > so you are still referenced in 0001. I wouldn't find motivation to > work on this if you won't respond to those emails ;) > > There is one known issue, CI returned for numa.out test, that I need > to get bottom to (it does not reproduce for me locally) inside > numa.out/sql: > > SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > Hi, ok, so v11 should fix this last problem and make cfbot happy. -J.
Attachment
pgsql-hackers by date: