Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | Z7bor4Iw54WOp9hW@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
List | pgsql-hackers |
Hi Jakub, On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote: > On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi Bertrand, > > Thanks for playing with this! > > > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observethe same behavior though. > > You are correct, I'm observing identical behaviour, please see attached. Thanks for confirming! > > We probably would need to split it to some separate and new view > within the pg_buffercache extension, but that is going to be slow, yet > still provide valid results. Yup. > In the previous approach that > get_mempolicy() was allocating on 1st access, but it was slow not only > because it was allocating but also because it was just 1 syscall per > 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning > (really allocating) a 128GB buffer cache in future won't cause issues > - that's like 16-17mln (* 2) syscalls to be issued when not using > move_pages(2) Yeah, get_mempolicy() not working on a range is not great. > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > Different perspective: I wanted to use the same approach in the new > pg_shmemallocations_numa, but that won't cut it there. The other idea > that came to my mind is to issue move_pages() from the backend that > has already used all of those pages. That literally mean on of the > below ideas: > 1. from somewhere like checkpointer / bgwriter? > 2. add touching memory on backend startup like always (sic!) > 3. or just attempt to read/touch memory addr just before calling > move_pages(). E.g. this last options is just two lines: > > if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > } > > and it seems to work while still issuing much less syscalls with > move_pages() across backends, well at least here. One of the main issue I see with 1. and 2. is that we would not get accurate results should the kernel decides to migrate the pages. Indeed, the process doing the move_pages() call needs to have accessed the pages more recently than any kernel migrations to see accurate locations. OTOH, one of the main issue that I see with 3. is that the monitoring could probably influence the kernel's decision to start pages migration (I'm not 100% sure but I could imagine it may influence the kernel's decision due to having to read/touch the pages). But I'm thinking: do we really need to know the page location of every single page? I think what we want to see is if the pages are "equally" distributed on all the nodes or are somehow "stuck" to one (or more) nodes. In that case what about using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer or contiguous chunks). We could create a new function that would accept a "sampling distance" as parameter for example, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: