Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmz7FWPAJkU6_-5S+e_fd184SWYnLa888wTske15MD_12A@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
Hi Bertrand, TL;DR; the main problem seems choosing which way to page-fault the shared memory before the backend is going to use numa_move_pages() as the memory mappings (fresh after fork()/CoW) seem to be not ready for numa_move_pages() inquiry. On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > We probably would need to split it to some separate and new view > > within the pg_buffercache extension, but that is going to be slow, yet > > still provide valid results. > > Yup. OK so I've made that NUMA inquiry (now with that "volatile touch" to get valid results for not used memory) into a new and separate pg_buffercache_numa view. This avoids the problem that somebody would automatically run into this slow path when using pg_buffercache. > > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > > > Different perspective: I wanted to use the same approach in the new > > pg_shmemallocations_numa, but that won't cut it there. The other idea > > that came to my mind is to issue move_pages() from the backend that > > has already used all of those pages. That literally mean on of the > > below ideas: > > 1. from somewhere like checkpointer / bgwriter? > > 2. add touching memory on backend startup like always (sic!) > > 3. or just attempt to read/touch memory addr just before calling > > move_pages(). E.g. this last options is just two lines: > > > > if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > > (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > } > > > > and it seems to work while still issuing much less syscalls with > > move_pages() across backends, well at least here. > > One of the main issue I see with 1. and 2. is that we would not get accurate > results should the kernel decides to migrate the pages. Indeed, the process doing > the move_pages() call needs to have accessed the pages more recently than any > kernel migrations to see accurate locations. We never get fully accurate state as the zone memory migration might be happening as we query it, but in theory we could add something to e.g. checkpointer/bgwriter that would inquiry it on demand and report it back somewhat through shared memory (?), but I'm somehow afraid because as stated at the end of email, it might take some time (well we probably wouldn't need to "touch memory" then after all, as all of it is active), but that's still impact to those bgworkers. Somehow I feel safer if that code is NOT part of bgworker. > OTOH, one of the main issue that I see with 3. is that the monitoring could > probably influence the kernel's decision to start pages migration (I'm not 100% > sure but I could imagine it may influence the kernel's decision due to having to > read/touch the pages). > > But I'm thinking: do we really need to know the page location of every single page? > I think what we want to see is if the pages are "equally" distributed on all > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > or contiguous chunks). We could create a new function that would accept a > "sampling distance" as parameter for example, thoughts? The way I envision it (and I think what Andres wanted, not sure, still yet to see him comment on all of this) is to give PG devs a way to quickly spot NUMA imbalances, even for single relation. Probably some DBA in the wild could also query it to see how PG/kernel distributes memory from time to time. It seems to be more debugging and coding aid for future NUMA optimizations, rather than being used by some monitoring constantly. I would even dare to say it would require --enable-debug (or some other developer-only toggle), but apparently there's no need to hide it like that if those are separate views. Changes since previous version: 0. rebase due the recent OAuth commit introducing libcurl 1. cast uint64 for NBuffers as You found out 2. put stuff into pg_buffercache_numa 3. 0003 adds pg_shmem_numa_allocations Or should we rather call it pg_shmem_numa_zones or maybe just pg_shm_numa ? If there would be agreement that this is the way we want to have it (from the backend and not from checkpointer), here's what's left on the table to be done here: a. isn't there something quicker for touching / page-faulting memory ? If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it probably only works for parent//postmaster). I've also tried MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too: + rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ); + if(rc != 0) { + elog(NOTICE, "madvice() failed"); + } [..] - volatile uint64 touch pg_attribute_unused(); os_page_ptrs[i] = (char *)ent->location + (i * os_page_size); - touch = *(uint64 *)os_page_ptrs[i]; with volatile touching memory or MADV_POPULATE_READ the result seems to reliable (s_b 128MB here): postgres@postgres:1234 : 14442 # select * from pg_shmem_numa_allocations order by numa_size desc; name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 134221824 XLOG Ctl | 0 | 4206592 Buffer Descriptors | 0 | 1048576 transaction | 0 | 528384 Checkpointer Data | 0 | 524288 Checkpoint BufferIds | 0 | 327680 Shared Memory Stats | 0 | 311296 [..] without at least one of those two, new backend reports complete garbage: name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 995328 Shared Memory Stats | 0 | 245760 shmInvalBuffer | 0 | 65536 Buffer Descriptors | 0 | 65536 Backend Status Array | 0 | 61440 serializable | 0 | 57344 [..] b. refactor shared code so that it goes into src/port (but with Linux-only support so far) c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? d. fix tests, indent it, docs, make cfbot happy As for the sampling, dunno, fine for me. As an optional argument? but wouldn't it be better to find a way to actually for it to be quick? OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to 128GB with numactl --interleave=all pg_ctl start: postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] Buffer Blocks | 0 | 17179869184 Buffer Blocks | 1 | 17179869184 Buffer Blocks | 2 | 17179869184 Buffer Blocks | 3 | 17179869184 Buffer Blocks | 4 | 17179869184 Buffer Blocks | 5 | 17179869184 Buffer Blocks | 6 | 17179869184 Buffer Blocks | 7 | 17179869184 Buffer IO Condition Variables | 0 | 33554432 Buffer IO Condition Variables | 1 | 33554432 Buffer IO Condition Variables | 2 | 33554432 [..] but it takes 23s. Yes it takes 23s to just gather that info with memory touch, but that's ~128GB of memory and is hardly responsible (lack of C_F_I()). By default without numactl's interleave=all, you get clear picture of lack of NUMA awareness in PG shared segment (just as Andres presented, but now it is evident; well it is subject to autobalancing of course): postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] commit_timestamp | 0 | 2097152 commit_timestamp | 1 | 6291456 commit_timestamp | 2 | 0 commit_timestamp | 3 | 0 commit_timestamp | 4 | 0 [..] transaction | 0 | 14680064 transaction | 1 | 0 transaction | 2 | 0 transaction | 3 | 0 transaction | 4 | 2097152 [..] Somehow without interleave it is very quick too. -J.
Attachment
pgsql-hackers by date: