Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Draft for basic NUMA observability
Date
Msg-id CAKZiRmz7FWPAJkU6_-5S+e_fd184SWYnLa888wTske15MD_12A@mail.gmail.com
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
Hi Bertrand,

TL;DR; the main problem seems choosing which way to page-fault the
shared memory before the backend is going to use numa_move_pages() as
the memory mappings (fresh after fork()/CoW) seem to be not ready for
numa_move_pages() inquiry.

On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> > We probably would need to split it to some separate and new view
> > within the pg_buffercache extension, but that is going to be slow, yet
> > still provide valid results.
>
> Yup.

OK so I've made that NUMA inquiry (now with that "volatile touch" to
get valid results for not used memory) into a new and separate
pg_buffercache_numa view. This avoids the problem that somebody would
automatically run into this slow path when using pg_buffercache.

> > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts? 
> >
> > Different perspective: I wanted to use the same approach in the new
> > pg_shmemallocations_numa, but that won't cut it there. The other idea
> > that came to my mind is to issue move_pages() from the backend that
> > has already used all of those pages. That literally mean on of the
> > below ideas:
> > 1. from somewhere like checkpointer / bgwriter?
> > 2. add touching memory on backend startup like always (sic!)
> > 3. or just attempt to read/touch memory addr just before calling
> > move_pages().  E.g. this last options is just two lines:
> >
> > if(os_page_ptrs[blk2page+j] == 0) {
> > +    volatile uint64 touch pg_attribute_unused();
> >     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> > (os_page_size*j);
> > +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > }
> >
> > and it seems to work while still issuing much less syscalls with
> > move_pages() across backends, well at least here.
>
> One of the main issue I see with 1. and 2. is that we would not get accurate
> results should the kernel decides to migrate the pages. Indeed, the process doing
> the move_pages() call needs to have accessed the pages more recently than any
> kernel migrations to see accurate locations.

We never get fully accurate state as the zone memory migration might
be happening as we query it, but in theory we could add something to
e.g. checkpointer/bgwriter that would inquiry it on demand and report
it back somewhat through shared memory (?), but I'm somehow afraid
because as stated at the end of email, it might take some time (well
we probably wouldn't need to "touch memory" then after all, as all of
it is active), but that's still impact to those bgworkers. Somehow I
feel safer if that code is NOT part of bgworker.

> OTOH, one of the main issue that I see with 3. is that the monitoring could
> probably influence the kernel's decision to start pages migration (I'm not 100%
> sure but I could imagine it may influence the kernel's decision due to having to
> read/touch the pages).
>
> But I'm thinking: do we really need to know the page location of every single page?
> I think what we want to see is if the pages are "equally" distributed on all
> the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
> using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
> or contiguous chunks). We could create a new function that would accept a
> "sampling distance" as parameter for example, thoughts?

The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation. Probably some
DBA in the wild could also query it to see how PG/kernel distributes
memory from time to time. It seems to be more debugging and coding aid
for future NUMA optimizations, rather than being used by some
monitoring constantly. I would even dare to say it would require
--enable-debug (or some other developer-only toggle), but apparently
there's no need to hide it like that if those are separate views.

Changes since previous version:
0. rebase due the recent OAuth commit introducing libcurl
1. cast uint64 for NBuffers as You found out
2. put stuff into pg_buffercache_numa
3. 0003 adds pg_shmem_numa_allocations Or should we rather call it
pg_shmem_numa_zones or maybe just pg_shm_numa ?

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried
additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it
probably only works for parent//postmaster). I've also tried
MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too:

+       rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ);
+       if(rc != 0) {
+               elog(NOTICE, "madvice() failed");
+       }
[..]
-                       volatile uint64 touch pg_attribute_unused();
                        os_page_ptrs[i] = (char *)ent->location + (i *
os_page_size);
-                       touch = *(uint64 *)os_page_ptrs[i];

with volatile touching memory or MADV_POPULATE_READ the result seems
to reliable (s_b 128MB here):

postgres@postgres:1234 : 14442 # select * from
pg_shmem_numa_allocations  order by numa_size desc;
                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 | 134221824
 XLOG Ctl                                       |            0 |   4206592
 Buffer Descriptors                             |            0 |   1048576
 transaction                                    |            0 |    528384
 Checkpointer Data                              |            0 |    524288
 Checkpoint BufferIds                           |            0 |    327680
 Shared Memory Stats                            |            0 |    311296
[..]

without at least one of those two, new backend reports complete garbage:

                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 |    995328
 Shared Memory Stats                            |            0 |    245760
 shmInvalBuffer                                 |            0 |     65536
 Buffer Descriptors                             |            0 |     65536
 Backend Status Array                           |            0 |     61440
 serializable                                   |            0 |     57344
[..]

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
d. fix tests, indent it, docs, make cfbot happy

As for the sampling, dunno, fine for me. As an optional argument? but
wouldn't it be better to find a way to actually for it to be quick?

OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to
128GB with numactl --interleave=all pg_ctl start:

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 Buffer Blocks                                  |            0 | 17179869184
 Buffer Blocks                                  |            1 | 17179869184
 Buffer Blocks                                  |            2 | 17179869184
 Buffer Blocks                                  |            3 | 17179869184
 Buffer Blocks                                  |            4 | 17179869184
 Buffer Blocks                                  |            5 | 17179869184
 Buffer Blocks                                  |            6 | 17179869184
 Buffer Blocks                                  |            7 | 17179869184
 Buffer IO Condition Variables                  |            0 |    33554432
 Buffer IO Condition Variables                  |            1 |    33554432
 Buffer IO Condition Variables                  |            2 |    33554432
[..]

but it takes 23s. Yes it takes 23s to just gather that info with
memory touch, but that's ~128GB of memory and is hardly responsible
(lack of C_F_I()). By default without numactl's interleave=all, you
get clear picture of lack of NUMA awareness in PG shared segment (just
as Andres presented, but now it is evident; well it is subject to
autobalancing of course):

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 commit_timestamp                               |            0 |     2097152
 commit_timestamp                               |            1 |     6291456
 commit_timestamp                               |            2 |           0
 commit_timestamp                               |            3 |           0
 commit_timestamp                               |            4 |           0
[..]
 transaction                                    |            0 |    14680064
 transaction                                    |            1 |           0
 transaction                                    |            2 |           0
 transaction                                    |            3 |           0
 transaction                                    |            4 |     2097152
[..]

Somehow without interleave it is very quick too.

-J.

Attachment

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Anti join confusion
Next
From: Dean Rasheed
Date:
Subject: Re: Virtual generated columns