Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: Hi Andres, thanks for your review! OK first sane version attached with new src/port/pg_numa.c boilerplate in 0001. Fixed some bugs too, there is one remaining optimization to be done (see that `static` question later). Docs/tests are still missing. QQ: I'm still wondering if we there's better way of exposing multiple pg's shma entries pointing to the same page (think of something hot: PROCLOCK or ProcArray), so wouldn't it make sense (in some future thread/patch) to expose this info somehow via additional column (pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of an easy way of showing that a potential NUMA auto balancing could lead to TLB NUMA shootdowns (not that it is happening or counting , just identifying it as a problem in allocation). Or that stuff doesn't make sense as we already have pg_shm_allocations.{off|size} and we could derive such info from it (after all it is for devs?)? postgres@postgres:1234 : 18843 # select name,off,off+allocated_size,allocated_size from pg_shmem_allocations order by off; name | off | ?column? | allocated_size ------------------------------------------------+-----------+-----------+---------------- [..] Proc Header | 147114112 | 147114240 | 128 Proc Array | 147274752 | 147275392 | 640 KnownAssignedXids | 147275392 | 147310848 | 35456 KnownAssignedXidsValid | 147310848 | 147319808 | 8960 Backend Status Array | 147319808 | 147381248 | 61440 postgres@postgres:1234 : 18924 # select * from pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array', 'KnownAssignedXids', '..') order by name; name | numa_zone_id | numa_size -------------------+--------------+----------- KnownAssignedXids | 0 | 2097152 Proc Array | 0 | 2097152 Proc Header | 0 | 2097152 I.e. ProcArray ends and right afterwards KnownAssignedXids start, both are hot , but on the same HP and NUMA node > > If there would be agreement that this is the way we want to have it > > (from the backend and not from checkpointer), here's what's left on > > the table to be done here: > > > a. isn't there something quicker for touching / page-faulting memory ? > > If you actually fault in a page the kernel actually has to allocate memory and > then zero it out. That rather severely limits the throughput... OK, no comments about that madvise(MADV_POPULATE_READ), so I'm sticking to pointers. > > If not then maybe add CHECKS_FOR_INTERRUPTS() there? > > Should definitely be there. Added. > > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > > help (it probably only works for parent//postmaster). > > Yes, needs to be in postmaster. > > Does the issue with "new" backends seeing pages as not present exist both with > and without huge pages? Please see attached file for more verbose results, but in short it is like below: patch(-touchpages) hugepages=off INVALID RESULTS (-2) patch(-touchpages) hugepages=on INVALID RESULTS (-2) patch(touchpages) hugepages=off CORRECT RESULT patch(touchpages) hugepages=on CORRECT RESULT patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2) patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2) IMHVO, the only other thing that could work here (but still page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are welcome, maybe it might be kernel version dependent. BTW: and yes you can "feel" the timing impact of MAP_SHARED|MAP_POPULATE during startup, it seems that for our case child backends that don't come-up with pre-faulted page attachments across fork() apparently. > FWIW, what you posted fails on CI: > https://cirrus-ci.com/task/5114213770723328 > > Probably some ifdefs are missing. The sanity-check task configures with > minimal dependencies, which is why you're seeing this even on linux. Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all of this CI stuff... > > b. refactor shared code so that it goes into src/port (but with > > Linux-only support so far) Done. > > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? > > You mean a specific context instead of CurrentMemoryContext? Yes, I had doubts earlier, but for now I'm going to leave it as it is. > > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > > index 91b51142d2e..e3b7554d9e8 100644 > > --- a/.cirrus.tasks.yml > > +++ b/.cirrus.tasks.yml > > @@ -436,6 +436,10 @@ task: > > SANITIZER_FLAGS: -fsanitize=address > > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > > > + > > + # FIXME: use or not the libnuma? > > + # --with-libnuma \ > > + # > > # Normally, the "relation segment" code basically has no coverage in our > > # tests, because we (quite reasonably) don't generate tables large > > # enough in tests. We've had plenty bugs that we didn't notice due > > the > > I don't see why not. Fixed. > > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > new file mode 100644 > > index 00000000000..e5b3d1f7dd2 > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,30 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the function. > > +DROP FUNCTION pg_buffercache_pages() CASCADE; > > Why? I think that's going to cause problems, as the pg_buffercache view > depends on it, and user views might turn in depend on pg_buffercache. I think > CASCADE is rarely, if ever, ok to use in an extension scripot. ... it's just me cutting corners :^), fixed now. [..] > > +-- Don't want these to be available to public. > > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; > > We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same > for _numa? Yup, fixed. > > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > > else > > fctx->record[i].isvalid = false; > > > > +#ifdef USE_LIBNUMA > > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > > + blk2page = (int) i * pages_per_blk; > > BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. Fixed, thanks I was looking for something like this! Is that +1 in v4 good? > > + j = 0; > > + do { > > + /* > > + * Many buffers can point to the same page, but we want to > > + * query just first address. > > + * > > + * In order to get reliable results we also need to touch memory pages > > + * so that inquiry about NUMA zone doesn't return -2. > > + */ > > + if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > + } > > + j++; > > + } while(j < (int)pages_per_blk); > > +#endif > > + > > Why is this done before we even have gotten -2 back? Even if we need it, it > seems like we ought to defer this until necessary. Not fixed yet: maybe we could even do a `static` with `has_this_run_earlier` and just perform this work only once during the first time? > > +#ifdef USE_LIBNUMA > > + if(query_numa) { > > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > > + if(numa_available() == -1) { > > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > > + } else { > > + /* Amortize the number of pages we need to query about */ > > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > > + elog(ERROR, "failed NUMA pages inquiry status"); > > + } > > I wonder if we ought to override numa_error() so we can display more useful > errors. Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine... [..] > Doing multiple memory allocations while holding an lwlock is probably not a > great idea, even if the lock normally isn't contended. [..] > Why do this in very loop iteration? Both fixed. -J.
Attachment
pgsql-hackers by date: