Home > mailing lists

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From	Bertrand Drouvot
Subject	Re: Draft for basic NUMA observability
Date	February 20 11:32:47
Msg-id	Z7bor4Iw54WOp9hW@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw
In response to	Re: Draft for basic NUMA observability (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Responses	Re: Draft for basic NUMA observability
List	pgsql-hackers

Tree view

Hi Jakub,

On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:
> On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi Bertrand,
> 
> Thanks for playing with this!
> 
> > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you
observethe same behavior though.

> 
> You are correct, I'm observing identical behaviour, please see attached.

Thanks for confirming!

> 
> We probably would need to split it to some separate and new view
> within the pg_buffercache extension, but that is going to be slow, yet
> still provide valid results.

Yup.

> In the previous approach that
> get_mempolicy() was allocating on 1st access, but it was slow not only
> because it was allocating but also because it was just 1 syscall per
> 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
> (really allocating) a 128GB buffer cache in future won't cause issues
> - that's like 16-17mln (* 2) syscalls to be issued when not using
> move_pages(2)

Yeah, get_mempolicy() not working on a range is not great.

> > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts?

> 
> Different perspective: I wanted to use the same approach in the new
> pg_shmemallocations_numa, but that won't cut it there. The other idea
> that came to my mind is to issue move_pages() from the backend that
> has already used all of those pages. That literally mean on of the
> below ideas:
> 1. from somewhere like checkpointer / bgwriter?
> 2. add touching memory on backend startup like always (sic!)
> 3. or just attempt to read/touch memory addr just before calling
> move_pages().  E.g. this last options is just two lines:
> 
> if(os_page_ptrs[blk2page+j] == 0) {
> +    volatile uint64 touch pg_attribute_unused();
>     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> (os_page_size*j);
> +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> }
> 
> and it seems to work while still issuing much less syscalls with
> move_pages() across backends, well at least here.

One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages). 

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Bertrand Drouvot
Date: 20 February, 11:11:12
Subject: Re: Remove wal_[sync|write][_time] from pg_stat_wal and track_wal_io_timing

From: jian he
Date: 20 February, 12:18:13
Subject: Re: Non-text mode for pg_dumpall

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

Previous

Next