Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmxQBcMM0e98e79OpK7_XxoLBmXfMYys-gC-iED9N+wMrQ@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 4/6/25 14:57, Jakub Wartak wrote: > > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote: > >> > >> Hi, > > > > Hi Andres/Tomas, > > > > I've noticed that Tomas responded to this while writing this, so I'm > > attaching git-am patches based on his v25 (no squash) and there's only > > one new (last one contains fixes based on this review) + slight commit > > amendment to 0001. > > > > I'm not working on this at the moment. I may have a bit of time in the > evening, but more likely I'll get back to this on Monday. OK, tried to fix all outstanding issues (except maybe some tiny code refactors for beatification) > >> I just played around with this for a bit. As noted somewhere further down, > >> pg_buffercache_numa.page_num ends up wonky in different ways for the different > >> pages. > > > > I think page_num is under heavy work in progress... I'm still not sure > > is it worth exposing this (is it worth the hassle). If we scratch it > > it won't be perfect, but we have everything , otherwise we risk this > > feature as we are going into a feature freeze literally tomorrow. > > > > IMHO it's not difficult to change the definition of page_num this way, > it's pretty much a one line change. It's more a question of whether we > actually want to expose this. Bertrand noticed this first in https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal : - startptr = (char *) BufferGetBlock(1); + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1)); With the above I'm also not getting wonky (-1) results anymore. The rest of reply assumes we are using this. > [snip] > > >>> + > >>> + if (firstNumaTouch) > >>> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); > >> > >> Over the patchseries the related code is duplicated. Seems like it'd be good > >> to put it into pg_numa instead? This seems like the thing that's good to > >> abstract away in one central spot. > > > > I hear you, but we are using those statistics for per-localized-code > > (shm.c's firstTouch != pgbuffercache.c's firstTouch). > > > > Yeah, I don't moving this is quite possible / useful. We could pass the > flag somewhere, but we still need to check & update it in local code. No idea how such code should be looking, but it would be more code? Are you thinking about some code still touching local &firstNumaTouch ? > >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num" > >> works quite right for me. > >> > >> This definition afaict is always 0 when using huge pages and just 0 and 1 for > >> 4k pages. But my understanding of numa_page is that it's the "id" of the numa > >> pages, which isn't that? > >> > >> With "adjust page_num" I get a number that starts at -1 and then increments > >> from there. More correct, but doesn't quite seem right either. > > > > Apparently handling this special case of splitted buffers edge-cases > > was Pandora box ;) Tomas what do we do about it? Does that has chance > > to get in before freeze? > > > > I don't think the split buffers are pandora box on their own, it's more > that it made us notice other issues / questions. I don't think handling > it is particularly complex - the most difficult part seems to be > figuring out how many rows we'll return, and mapping them to pages. > But that's not very difficult, IMO. OK, with a fresh week, and fresh mind and a different name (ospageid) it looks better to me now. > The bigger question is whether it's safe to do the TYPEALIGN_DOWN(), > which may return a pointer from before the first buffer. But I guess > that's OK, thanks to how memory is allocated - at least, that's what all > the move_pages() examples I found do, so unless those are all broken, > that's OK. I agree. This took some more time, but in case of a) pg_buffercache_numa and HP=off view we shouldn't access ptr below buffercache, because my understanding is that shm memory would be page aligned anyway as per BufferManagerShmemInit() which uses TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query pg_buffercache_numa, one gets the following pages: # strace -fp 14364 -e move_pages strace: Process 14364 attached move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,... (gdb) print BufferBlocks $1 = 0x7f8bfe4b5000 "" while BufferBlocks actually starts there, so we are good without HP. With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb) print BufferBlocks => $1 = 0x7ff33c879000 so we are actually accessing earlier pointer (!), but Buffer Blocks is like 15-th structure there (and there's always going be something earlier to my understanding): select row_number() over (order by off),* from pg_shmem_allocations order by off asc limit 15; row_number | name | off | size | allocated_size ------------+------------------------+---------+-----------+---------------- [..] 13 | Shared MultiXact State | 5735936 | 1148 | 1152 14 | Buffer Descriptors | 5737088 | 1048576 | 1048576 15 | Buffer Blocks | 6785664 | 134221824 | 134221824 To me this is finding in itself, with HP shared_buffers being located on page with something else... b) in pg_shmem_allocations_numa view without HP we don't even get to as low as ShmemBase for move_pages(), but with HP=on we hit as low as ShmemBase but not lower, so we are good IMHO. Wouldn't buildfarm actually tell us this if that's bad as an insurance policy? > >>> + <tbody> > >>> + <row> > >>> + <entry role="catalog_table_entry"><para role="column_definition"> > >>> + <structfield>bufferid</structfield> <type>integer</type> > >>> + </para> > >>> + <para> > >>> + ID, in the range 1..<varname>shared_buffers</varname> > >>> + </para></entry> > >>> + </row> > >>> + > >>> + <row> > >>> + <entry role="catalog_table_entry"><para role="column_definition"> > >>> + <structfield>page_num</structfield> <type>int</type> > >>> + </para> > >>> + <para> > >>> + number of OS memory page for this buffer > >>> + </para></entry> > >>> + </row> > >> > >> "page_num" doesn't really seem very descriptive for "number of OS memory page > >> for this buffer". To me "number of" sounds like it's counting the number of > >> associated pages, but it's really just a "page id" or something like that. > >> > >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit? > > > > Tomas, are you good with rename? I think I've would also prefer os_page_id. > > > > Fine with me. Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_" anywhere, so let's stick to that. Done that for nodeid too. > >>> + /* > >>> + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while > >>> + * the OS may have different memory page sizes. > >>> + * > >>> + * To correctly map between them, we need to: 1. Determine the OS memory > >>> + * page size 2. Calculate how many OS pages are used by all buffer blocks > >>> + * 3. Calculate how many OS pages are contained within each database > >>> + * block. > >>> + * > >>> + * This information is needed before calling move_pages() for NUMA memory > >>> + * node inquiry. > >>> + */ > >>> + os_page_size = pg_numa_get_pagesize(); > >>> + > >>> + /* > >>> + * Allocate memory for page pointers and status based on total shared > >>> + * memory size. This simplified approach allocates enough space for all > >>> + * pages in shared memory rather than calculating the exact requirements > >>> + * for each segment. > >>> + * > >>> + * XXX Isn't this wasteful? But there probably is one large segment of > >>> + * shared memory, much larger than the rest anyway. > >>> + */ > >>> + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; > >>> + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); > >>> + pages_status = palloc(sizeof(int) * shm_total_page_count); > >> > >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), > >> could more be moved to a shared helper function? > > > > -> Tomas? > > > > I'm not against that in principle, but when I tried it didn't quite help > that much. But maybe it's better with the current patch version. > > >>> + hash_seq_init(&hstat, ShmemIndex); > >>> + > >>> + /* output all allocated entries */ > >>> + memset(nulls, 0, sizeof(nulls)); > >>> + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) > >>> + { > >> > >> One thing that's interesting with using ShmemIndex is that this won't get > >> anonymous allocations. I think that's ok for now, it'd be complicated to > >> figure out the unmapped "regions". But I guess it' be good to mention it in > >> the ocs? > > > > Added. > > BTW: it's also in the function comment too. now. > > [..], but I'm sending it as fast as possible > > to avoid a clash with Tomas. > > > > Please keep working on this. I may hava a bit of time in the evening, > but in the worst case I'll merge it into your patch. So, attached is still patchset v25, still with one-off patches (please git am most, but just `patch -p1 < file` last two and just squash it if you are happy. I've intentionally not squash it to provide changelog). This LGTM. -J.
Attachment
- v25-0005-adjust-page-alignment.patch
- v25-0004-Introduce-pg_shmem_allocations_numa-view.patch
- v25-0001-Add-support-for-basic-NUMA-awareness.patch
- v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch
- v25-0003-adjust-page_num.patch
- v25-0006-fixes-for-review-by-Andres.patch
- v25-0007-fix-remaining-outstanding-issues-from-Sunday.patch
pgsql-hackers by date: