Thread: Re: Draft for basic NUMA observability

Re: Draft for basic NUMA observability

Bertrand Drouvot

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
> As I have promised to Andres on the Discord hacking server some time
> ago, I'm attaching the very brief (and potentially way too rushed)
> draft of the first step into NUMA observability on PostgreSQL that was
> based on his presentation [0]. It might be rough, but it is to get us
> started. The patches were not really even basically tested, they are
> more like input for discussion - rather than solid code - to shake out
> what should be the proper form of this.
> Right now it gives:
> postgres=# select numa_zone_id, count(*) from pg_buffercache group by
> numa_zone_id;
> NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
>  numa_zone_id | count
> --------------+-------
>               | 16127
>             6 |   256
>             1 |     1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
                Size            os_page_size;
                void            **os_page_ptrs;
                int                     *os_pages_status;
-               int                     os_page_count;
+               uint64          os_page_count;


-               os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.


when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
              | 5177310
            0 |   65192
           -2 |     378
(3 rows)

postgres=# select pg_backend_pid();

In session 2:

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
              | 5177301
            0 |      85
           -2 |   65494
(3 rows)

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
              | 5177301
            0 |      90
           -2 |   65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
              | 5177301
            0 |   65195
           -2 |     384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as  if  the  thread
       had performed a read (load) access to that address, and return the ID of the node where that page was

Forcing the allocation to happen inside a monitoring function is decidedly not great.

The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e 
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?


Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services:

Re: Draft for basic NUMA observability

Bertrand Drouvot
Hi Jakub,

On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:
> On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
> <> wrote:
> Hi Bertrand,
> Thanks for playing with this!
> > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you
observethe same behavior though.
> You are correct, I'm observing identical behaviour, please see attached.

Thanks for confirming!

> We probably would need to split it to some separate and new view
> within the pg_buffercache extension, but that is going to be slow, yet
> still provide valid results.


> In the previous approach that
> get_mempolicy() was allocating on 1st access, but it was slow not only
> because it was allocating but also because it was just 1 syscall per
> 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
> (really allocating) a 128GB buffer cache in future won't cause issues
> - that's like 16-17mln (* 2) syscalls to be issued when not using
> move_pages(2)

Yeah, get_mempolicy() not working on a range is not great.

> > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
> Different perspective: I wanted to use the same approach in the new
> pg_shmemallocations_numa, but that won't cut it there. The other idea
> that came to my mind is to issue move_pages() from the backend that
> has already used all of those pages. That literally mean on of the
> below ideas:
> 1. from somewhere like checkpointer / bgwriter?
> 2. add touching memory on backend startup like always (sic!)
> 3. or just attempt to read/touch memory addr just before calling
> move_pages().  E.g. this last options is just two lines:
> if(os_page_ptrs[blk2page+j] == 0) {
> +    volatile uint64 touch pg_attribute_unused();
>     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> (os_page_size*j);
> +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> }
> and it seems to work while still issuing much less syscalls with
> move_pages() across backends, well at least here.

One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages). 

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?


Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: