Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Draft for basic NUMA observability
Date
Msg-id CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com
Whole thread Raw
In response to Re: Draft for basic NUMA observability  (Andres Freund <andres@anarazel.de>)
Responses Re: Draft for basic NUMA observability
List pgsql-hackers
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing  multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
                      name                      |    off    | ?column?
 | allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
 Proc Header                                    | 147114112 |
147114240 |            128
 Proc Array                                     | 147274752 |
147275392 |            640
 KnownAssignedXids                              | 147275392 |
147310848 |          35456
 KnownAssignedXidsValid                         | 147310848 |
147319808 |           8960
 Backend Status Array                           | 147319808 |
147381248 |          61440

postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
       name        | numa_zone_id | numa_size
-------------------+--------------+-----------
 KnownAssignedXids |            0 |   2097152
 Proc Array        |            0 |   2097152
 Proc Header       |            0 |   2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.

Added.

> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages)               hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)               hugepages=on        INVALID RESULTS (-2)
patch(touchpages)                hugepages=off        CORRECT RESULT
patch(touchpages)                hugepages=on        CORRECT RESULT
patch(-touchpages)+MAP_POPULATE  hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE  hugepages=on        INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)

Done.

> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> >          SANITIZER_FLAGS: -fsanitize=address
> >          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > +      # FIXME: use or not the libnuma?
> > +      #      --with-libnuma \
> > +      #
> >        # Normally, the "relation segment" code basically has no coverage in our
> >        # tests, because we (quite reasonably) don't generate tables large
> >        # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.

Fixed.

> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache.  I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.

... it's just me cutting corners :^), fixed now.

[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?

Yup, fixed.

> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> >                       else
> >                               fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > +                     blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

> > +                     j = 0;
> > +                     do {
> > +                             /*
> > +                              * Many buffers can point to the same page, but we want to
> > +                              * query just first address.
> > +                              *
> > +                              * In order to get reliable results we also need to touch memory pages
> > +                              * so that inquiry about NUMA zone doesn't return -2.
> > +                              */
> > +                             if(os_page_ptrs[blk2page+j] == 0) {
> > +                                     volatile uint64 touch pg_attribute_unused();
> > +                                     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > +                                     touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > +                             }
> > +                             j++;
> > +                     } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

> > +#ifdef USE_LIBNUMA
> > +             if(query_numa) {
> > +                     /* According to numa(3) it is required to initialize library even if that's no-op. */
> > +                     if(numa_available() == -1) {
> > +                             pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> > +                             elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> > +                     } else {
> > +                             /* Amortize the number of pages we need to query about */
> > +                             if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > +                                     elog(ERROR, "failed NUMA pages inquiry status");
> > +                             }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?

Both fixed.

-J.

Attachment

pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: Spinlock can be released twice in procsignal.c
Next
From: Maxim Orlov
Date:
Subject: Re: Proposal: Limitations of palloc inside checkpointer