Thread: Re: Draft for basic NUMA observability

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
> As I have promised to Andres on the Discord hacking server some time
> ago, I'm attaching the very brief (and potentially way too rushed)
> draft of the first step into NUMA observability on PostgreSQL that was
> based on his presentation [0]. It might be rough, but it is to get us
> started. The patches were not really even basically tested, they are
> more like input for discussion - rather than solid code - to shake out
> what should be the proper form of this.
> 
> Right now it gives:
> 
> postgres=# select numa_zone_id, count(*) from pg_buffercache group by
> numa_zone_id;
> NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
>  numa_zone_id | count
> --------------+-------
>               | 16127
>             6 |   256
>             1 |     1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
                Size            os_page_size;
                void            **os_page_ptrs;
                int                     *os_pages_status;
-               int                     os_page_count;
+               uint64          os_page_count;

and

-               os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.

Observations:

when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177310
            0 |   65192
           -2 |     378
(3 rows)

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
        1662580

In session 2:

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |      85
           -2 |   65494
(3 rows)

                             ^
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |      90
           -2 |   65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |   65195
           -2 |     384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

"
One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as  if  the  thread
       had performed a read (load) access to that address, and return the ID of the node where that page was
allocated.

Forcing the allocation to happen inside a monitoring function is decidedly not great.
"

The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e 
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi Jakub,

On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:
> On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi Bertrand,
> 
> Thanks for playing with this!
> 
> > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you
observethe same behavior though.
 
> 
> You are correct, I'm observing identical behaviour, please see attached.

Thanks for confirming!

> 
> We probably would need to split it to some separate and new view
> within the pg_buffercache extension, but that is going to be slow, yet
> still provide valid results.

Yup.

> In the previous approach that
> get_mempolicy() was allocating on 1st access, but it was slow not only
> because it was allocating but also because it was just 1 syscall per
> 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
> (really allocating) a 128GB buffer cache in future won't cause issues
> - that's like 16-17mln (* 2) syscalls to be issued when not using
> move_pages(2)

Yeah, get_mempolicy() not working on a range is not great.

> > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts?
 
> 
> Different perspective: I wanted to use the same approach in the new
> pg_shmemallocations_numa, but that won't cut it there. The other idea
> that came to my mind is to issue move_pages() from the backend that
> has already used all of those pages. That literally mean on of the
> below ideas:
> 1. from somewhere like checkpointer / bgwriter?
> 2. add touching memory on backend startup like always (sic!)
> 3. or just attempt to read/touch memory addr just before calling
> move_pages().  E.g. this last options is just two lines:
> 
> if(os_page_ptrs[blk2page+j] == 0) {
> +    volatile uint64 touch pg_attribute_unused();
>     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> (os_page_size*j);
> +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> }
> 
> and it seems to work while still issuing much less syscalls with
> move_pages() across backends, well at least here.

One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages). 

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
Hi Bertrand,

TL;DR; the main problem seems choosing which way to page-fault the
shared memory before the backend is going to use numa_move_pages() as
the memory mappings (fresh after fork()/CoW) seem to be not ready for
numa_move_pages() inquiry.

On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> > We probably would need to split it to some separate and new view
> > within the pg_buffercache extension, but that is going to be slow, yet
> > still provide valid results.
>
> Yup.

OK so I've made that NUMA inquiry (now with that "volatile touch" to
get valid results for not used memory) into a new and separate
pg_buffercache_numa view. This avoids the problem that somebody would
automatically run into this slow path when using pg_buffercache.

> > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts? 
> >
> > Different perspective: I wanted to use the same approach in the new
> > pg_shmemallocations_numa, but that won't cut it there. The other idea
> > that came to my mind is to issue move_pages() from the backend that
> > has already used all of those pages. That literally mean on of the
> > below ideas:
> > 1. from somewhere like checkpointer / bgwriter?
> > 2. add touching memory on backend startup like always (sic!)
> > 3. or just attempt to read/touch memory addr just before calling
> > move_pages().  E.g. this last options is just two lines:
> >
> > if(os_page_ptrs[blk2page+j] == 0) {
> > +    volatile uint64 touch pg_attribute_unused();
> >     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> > (os_page_size*j);
> > +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > }
> >
> > and it seems to work while still issuing much less syscalls with
> > move_pages() across backends, well at least here.
>
> One of the main issue I see with 1. and 2. is that we would not get accurate
> results should the kernel decides to migrate the pages. Indeed, the process doing
> the move_pages() call needs to have accessed the pages more recently than any
> kernel migrations to see accurate locations.

We never get fully accurate state as the zone memory migration might
be happening as we query it, but in theory we could add something to
e.g. checkpointer/bgwriter that would inquiry it on demand and report
it back somewhat through shared memory (?), but I'm somehow afraid
because as stated at the end of email, it might take some time (well
we probably wouldn't need to "touch memory" then after all, as all of
it is active), but that's still impact to those bgworkers. Somehow I
feel safer if that code is NOT part of bgworker.

> OTOH, one of the main issue that I see with 3. is that the monitoring could
> probably influence the kernel's decision to start pages migration (I'm not 100%
> sure but I could imagine it may influence the kernel's decision due to having to
> read/touch the pages).
>
> But I'm thinking: do we really need to know the page location of every single page?
> I think what we want to see is if the pages are "equally" distributed on all
> the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
> using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
> or contiguous chunks). We could create a new function that would accept a
> "sampling distance" as parameter for example, thoughts?

The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation. Probably some
DBA in the wild could also query it to see how PG/kernel distributes
memory from time to time. It seems to be more debugging and coding aid
for future NUMA optimizations, rather than being used by some
monitoring constantly. I would even dare to say it would require
--enable-debug (or some other developer-only toggle), but apparently
there's no need to hide it like that if those are separate views.

Changes since previous version:
0. rebase due the recent OAuth commit introducing libcurl
1. cast uint64 for NBuffers as You found out
2. put stuff into pg_buffercache_numa
3. 0003 adds pg_shmem_numa_allocations Or should we rather call it
pg_shmem_numa_zones or maybe just pg_shm_numa ?

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried
additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it
probably only works for parent//postmaster). I've also tried
MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too:

+       rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ);
+       if(rc != 0) {
+               elog(NOTICE, "madvice() failed");
+       }
[..]
-                       volatile uint64 touch pg_attribute_unused();
                        os_page_ptrs[i] = (char *)ent->location + (i *
os_page_size);
-                       touch = *(uint64 *)os_page_ptrs[i];

with volatile touching memory or MADV_POPULATE_READ the result seems
to reliable (s_b 128MB here):

postgres@postgres:1234 : 14442 # select * from
pg_shmem_numa_allocations  order by numa_size desc;
                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 | 134221824
 XLOG Ctl                                       |            0 |   4206592
 Buffer Descriptors                             |            0 |   1048576
 transaction                                    |            0 |    528384
 Checkpointer Data                              |            0 |    524288
 Checkpoint BufferIds                           |            0 |    327680
 Shared Memory Stats                            |            0 |    311296
[..]

without at least one of those two, new backend reports complete garbage:

                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 |    995328
 Shared Memory Stats                            |            0 |    245760
 shmInvalBuffer                                 |            0 |     65536
 Buffer Descriptors                             |            0 |     65536
 Backend Status Array                           |            0 |     61440
 serializable                                   |            0 |     57344
[..]

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
d. fix tests, indent it, docs, make cfbot happy

As for the sampling, dunno, fine for me. As an optional argument? but
wouldn't it be better to find a way to actually for it to be quick?

OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to
128GB with numactl --interleave=all pg_ctl start:

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 Buffer Blocks                                  |            0 | 17179869184
 Buffer Blocks                                  |            1 | 17179869184
 Buffer Blocks                                  |            2 | 17179869184
 Buffer Blocks                                  |            3 | 17179869184
 Buffer Blocks                                  |            4 | 17179869184
 Buffer Blocks                                  |            5 | 17179869184
 Buffer Blocks                                  |            6 | 17179869184
 Buffer Blocks                                  |            7 | 17179869184
 Buffer IO Condition Variables                  |            0 |    33554432
 Buffer IO Condition Variables                  |            1 |    33554432
 Buffer IO Condition Variables                  |            2 |    33554432
[..]

but it takes 23s. Yes it takes 23s to just gather that info with
memory touch, but that's ~128GB of memory and is hardly responsible
(lack of C_F_I()). By default without numactl's interleave=all, you
get clear picture of lack of NUMA awareness in PG shared segment (just
as Andres presented, but now it is evident; well it is subject to
autobalancing of course):

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 commit_timestamp                               |            0 |     2097152
 commit_timestamp                               |            1 |     6291456
 commit_timestamp                               |            2 |           0
 commit_timestamp                               |            3 |           0
 commit_timestamp                               |            4 |           0
[..]
 transaction                                    |            0 |    14680064
 transaction                                    |            1 |           0
 transaction                                    |            2 |           0
 transaction                                    |            3 |           0
 transaction                                    |            4 |     2097152
[..]

Somehow without interleave it is very quick too.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Andres Freund
Date:
Hi,

On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:
> On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > OTOH, one of the main issue that I see with 3. is that the monitoring could
> > probably influence the kernel's decision to start pages migration (I'm not 100%
> > sure but I could imagine it may influence the kernel's decision due to having to
> > read/touch the pages).
> >
> > But I'm thinking: do we really need to know the page location of every single page?
> > I think what we want to see is if the pages are "equally" distributed on all
> > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
> > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
> > or contiguous chunks). We could create a new function that would accept a
> > "sampling distance" as parameter for example, thoughts?
> 
> The way I envision it (and I think what Andres wanted, not sure, still
> yet to see him comment on all of this) is to give PG devs a way to
> quickly spot NUMA imbalances, even for single relation.

Yea. E.g. for some benchmark workloads the difference whether the root btree
page is on the same NUMA node as the workload or not makes a roughly 2x perf
difference. It's really hard to determine that today.


> If there would be agreement that this is the way we want to have it
> (from the backend and not from checkpointer), here's what's left on
> the table to be done here:

> a. isn't there something quicker for touching / page-faulting memory ?

If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...


> If not then maybe add CHECKS_FOR_INTERRUPTS() there?

Should definitely be there.


> BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> help (it probably only works for parent//postmaster).

Yes, needs to be in postmaster.

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?


FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.




> b. refactor shared code so that it goes into src/port (but with
> Linux-only support so far)
> c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?

You mean a specific context instead of CurrentMemoryContext?


> diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> index 91b51142d2e..e3b7554d9e8 100644
> --- a/.cirrus.tasks.yml
> +++ b/.cirrus.tasks.yml
> @@ -436,6 +436,10 @@ task:
>          SANITIZER_FLAGS: -fsanitize=address
>          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
>  
> +
> +      # FIXME: use or not the libnuma?
> +      #      --with-libnuma \
> +      #
>        # Normally, the "relation segment" code basically has no coverage in our
>        # tests, because we (quite reasonably) don't generate tables large
>        # enough in tests. We've had plenty bugs that we didn't notice due
> the

I don't see why not.


> diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> new file mode 100644
> index 00000000000..e5b3d1f7dd2
> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,30 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the function.
> +DROP FUNCTION pg_buffercache_pages() CASCADE;

Why? I think that's going to cause problems, as the pg_buffercache view
depends on it, and user views might turn in depend on pg_buffercache.  I think
CASCADE is rarely, if ever, ok to use in an extension scripot.


> +CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
> +RETURNS SETOF RECORD
> +AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
> +LANGUAGE C PARALLEL SAFE;
>
> +-- Create a view for convenient access.
> +CREATE OR REPLACE VIEW pg_buffercache AS
> +    SELECT P.* FROM pg_buffercache_pages(false) AS P
> +    (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
> +     relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
> +     pinning_backends int4);
> +
> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
> +    SELECT P.* FROM pg_buffercache_pages(true) AS P
> +    (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
> +     relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
> +     pinning_backends int4, numa_zone_id int4);
> +
> +-- Don't want these to be available to public.
> +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;

We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?


> @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
>              else
>                  fctx->record[i].isvalid = false;
>  
> +#ifdef USE_LIBNUMA
> +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> +            blk2page = (int) i * pages_per_blk;

BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.



> +            j = 0;
> +            do {
> +                /*
> +                 * Many buffers can point to the same page, but we want to
> +                 * query just first address.
> +                 *
> +                 * In order to get reliable results we also need to touch memory pages
> +                 * so that inquiry about NUMA zone doesn't return -2.
> +                 */
> +                if(os_page_ptrs[blk2page+j] == 0) {
> +                    volatile uint64 touch pg_attribute_unused();
> +                    os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> +                    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> +                }
> +                j++;
> +            } while(j < (int)pages_per_blk);
> +#endif
> +

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.


> +#ifdef USE_LIBNUMA
> +        if(query_numa) {
> +            /* According to numa(3) it is required to initialize library even if that's no-op. */
> +            if(numa_available() == -1) {
> +                pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> +                elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> +            } else {
> +                /* Amortize the number of pages we need to query about */
> +                if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> +                    elog(ERROR, "failed NUMA pages inquiry status");
> +                }

I wonder if we ought to override numa_error() so we can display more useful
errors.


> +
> +    LWLockAcquire(ShmemIndexLock, LW_SHARED);

Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.


> +        os_page_ptrs = palloc(sizeof(void *) * os_page_count);
> +        os_pages_status = palloc(sizeof(int) * os_page_count);

Why do this in very loop iteration?


Greetings,

Andres Freund



Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).

That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?

Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing  multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
                      name                      |    off    | ?column?
 | allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
 Proc Header                                    | 147114112 |
147114240 |            128
 Proc Array                                     | 147274752 |
147275392 |            640
 KnownAssignedXids                              | 147275392 |
147310848 |          35456
 KnownAssignedXidsValid                         | 147310848 |
147319808 |           8960
 Backend Status Array                           | 147319808 |
147381248 |          61440

postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
       name        | numa_zone_id | numa_size
-------------------+--------------+-----------
 KnownAssignedXids |            0 |   2097152
 Proc Array        |            0 |   2097152
 Proc Header       |            0 |   2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.

Added.

> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages)               hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)               hugepages=on        INVALID RESULTS (-2)
patch(touchpages)                hugepages=off        CORRECT RESULT
patch(touchpages)                hugepages=on        CORRECT RESULT
patch(-touchpages)+MAP_POPULATE  hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE  hugepages=on        INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)

Done.

> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> >          SANITIZER_FLAGS: -fsanitize=address
> >          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > +      # FIXME: use or not the libnuma?
> > +      #      --with-libnuma \
> > +      #
> >        # Normally, the "relation segment" code basically has no coverage in our
> >        # tests, because we (quite reasonably) don't generate tables large
> >        # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.

Fixed.

> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache.  I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.

... it's just me cutting corners :^), fixed now.

[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?

Yup, fixed.

> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> >                       else
> >                               fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > +                     blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

> > +                     j = 0;
> > +                     do {
> > +                             /*
> > +                              * Many buffers can point to the same page, but we want to
> > +                              * query just first address.
> > +                              *
> > +                              * In order to get reliable results we also need to touch memory pages
> > +                              * so that inquiry about NUMA zone doesn't return -2.
> > +                              */
> > +                             if(os_page_ptrs[blk2page+j] == 0) {
> > +                                     volatile uint64 touch pg_attribute_unused();
> > +                                     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > +                                     touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > +                             }
> > +                             j++;
> > +                     } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

> > +#ifdef USE_LIBNUMA
> > +             if(query_numa) {
> > +                     /* According to numa(3) it is required to initialize library even if that's no-op. */
> > +                     if(numa_available() == -1) {
> > +                             pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> > +                             elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> > +                     } else {
> > +                             /* Amortize the number of pages we need to query about */
> > +                             if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > +                                     elog(ERROR, "failed NUMA pages inquiry status");
> > +                             }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?

Both fixed.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> > Does the issue with "new" backends seeing pages as not present exist both with
> > and without huge pages?
>
> That's a good point and from what I can see it's correct with huge pages being
> used (it means all processes see the same NUMA node assignment regardless of
> access patterns).

Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware]).

> That said, wouldn't that be too strong to impose a restriction that huge_pages
> must be enabled?
>
> Jakub, thanks for the new patch version! FWIW, I did not look closely to the
> code yet (just did the minor changes already shared to have valid result with non
> tiny shared buffer size). I'll look closely at the code for sure once we all agree
> on the design part of it.

Cool, I think we are pretty close actually, but others might have
different perspective.

-J.



Re: Draft for basic NUMA observability

From
Andres Freund
Date:
Hi,

On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > FWIW, what you posted fails on CI:
> > https://cirrus-ci.com/task/5114213770723328
> >
> > Probably some ifdefs are missing. The sanity-check task configures with
> > minimal dependencies, which is why you're seeing this even on linux.
> 
> Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> of this CI stuff...

FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Greetings,

Andres Freund



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > > FWIW, what you posted fails on CI:
> > > https://cirrus-ci.com/task/5114213770723328
> > >
> > > Probably some ifdefs are missing. The sanity-check task configures with
> > > minimal dependencies, which is why you're seeing this even on linux.
> >
> > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> > of this CI stuff...
>
> FYI, you can enable CI on a github repo, to see results without posting to the
> list:
> https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Thanks, I'll take a look into it.

Meanwhile v5 is attached with slight changes to try to make cfbot happy:
1. fixed tests and added tiny copy-cat basic tests for
pg_buffercache_numa and pg_shm_numa_allocations views
2. win32 doesn't have sysconf()

No docs yet.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote:
> On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > > > FWIW, what you posted fails on CI:
> > > > https://cirrus-ci.com/task/5114213770723328
> > > >
> > > > Probably some ifdefs are missing. The sanity-check task configures with
> > > > minimal dependencies, which is why you're seeing this even on linux.
> > >
> > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> > > of this CI stuff...
> >
> > FYI, you can enable CI on a github repo, to see results without posting to the
> > list:
> > https://github.com/postgres/postgres/blob/master/src/tools/ci/README
> 
> Thanks, I'll take a look into it.
> 
> Meanwhile v5 is attached with slight changes to try to make cfbot happy:

Thanks for the updated version!

FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.

Sharing here as .txt files:

v5-0004-configure-changes.txt: changes in configure + add a test on numa.h
availability and a call to numa_available.

v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA
because the file is still using sysconf() in the non-NUMA code path. Also,
removed a ";" in "#endif;" in the non-NUMA code path.

v5-0006-meson.build-changes.txt.

Those apply on top of your v5.

Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.

Still did not look closely to the code.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi Jakub,

On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote:
> On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi,
> >
> > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> > > Does the issue with "new" backends seeing pages as not present exist both with
> > > and without huge pages?
> >
> > That's a good point and from what I can see it's correct with huge pages being
> > used (it means all processes see the same NUMA node assignment regardless of
> > access patterns).
> 
> Hi Bertrand , please see that nearby thread. I've got quite the
> opposite results. I need page-fault memory or I get invalid results
> ("-2"). What kernel version are you using ? (I've tried it on two
> 6.10.x series kernels , virtualized in both cases; one was EPYC [real
> NUMA, but not VM so not real hardware])

Thanks for sharing your numbers!

It looks like that with hp enabled then the shared_buffers plays a role.

1. With hp, shared_buffers 4GB:

 huge_pages_status
-------------------
 on
(1 row)

 shared_buffers
----------------
 4GB
(1 row)

NOTICE:  os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id | count
--------------+--------
              | 507618
            0 |   1054
           -2 |  15616
(3 rows)

2. With hp, shared_buffers 23GB:

 huge_pages_status
-------------------
 on
(1 row)

 shared_buffers
----------------
 23GB
(1 row)

NOTICE:  os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id |  count
--------------+---------
              | 2997974
            0 |   16682
(2 rows)

3. no hp, shared_buffers 23GB:

 huge_pages_status
-------------------
 off
(1 row)

 shared_buffers
----------------
 23GB
(1 row)

ERROR:  extension "pg_buffercache" already exists
NOTICE:  os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 2997975
           -2 |   16482
            1 |     199
(3 rows)

Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've
no ideas. Anyway there is little than we can do and the "touchpages" patch
seems to provide "accurate" results.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]
> > Meanwhile v5 is attached with slight changes to try to make cfbot happy:
>
> Thanks for the updated version!
>
> FWIW, I had to do a few changes to get an error free compiling experience with
> autoconf/or meson and both with or without the libnuma configure option.
>
> Sharing here as .txt files:
>
> Also the pg_buffercache test fails without the libnuma configure option. Maybe
> some tests should depend of the libnuma configure option.
[..]

Thank you so much for this Bertrand !

I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote:
> On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> [..]
> > > Meanwhile v5 is attached with slight changes to try to make cfbot happy:
> >
> > Thanks for the updated version!
> >
> > FWIW, I had to do a few changes to get an error free compiling experience with
> > autoconf/or meson and both with or without the libnuma configure option.
> >
> > Sharing here as .txt files:
> >
> > Also the pg_buffercache test fails without the libnuma configure option. Maybe
> > some tests should depend of the libnuma configure option.
> [..]
> 
> Thank you so much for this Bertrand !
> 
> I've applied those , played a little bit with configure and meson and
> reproduced the test error and fixed it by silencing that NOTICE in
> tests. So v6 is attached even before I get a chance to start using
> that CI. Still waiting for some input and tests regarding that earlier
> touchpages attempt, docs are still missing...

Thanks for the new version!

I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).

A few random comments:

=== 1

+               /*
+                * This is for gathering some NUMA statistics. We might be using
+                * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+                * allocated in various different OS memory pages sizes, so first we
+                * need to understand the OS memory page size before calling
+                * move_pages()
+                */
+               os_page_size = pg_numa_get_pagesize();
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+               pages_per_blk = (float) BLCKSZ / os_page_size;
+
+               elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+                        os_page_count, os_page_size, pages_per_blk);
+
+               os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+               os_pages_status = palloc(sizeof(int) * os_page_count);
+               memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here
+                */
+               memset(os_pages_status, 0xff, sizeof(int) * os_page_count);

I think that if (query_numa) check should also wrap that entire section of code.

=== 2

+                       if (query_numa)
+                       { 
+                          blk2page = (int) i * pages_per_blk;
+                          j = 0;
+                          do
+                          {

This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid 
code duplication as much a possible though, maybe using a shared 
populate_buffercache_entry() or such helper function?

=== 3

+#define ONE_GIGABYTE 1024*1024*1024
+                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
+                                                       CHECK_FOR_INTERRUPTS();
+                                       }

Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.

=== 4

+               pfree(os_page_ptrs);
+               pfree(os_pages_status);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 5

+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{

That's a good idea.

+               for (i = 0; i < shm_ent_page_count; i++)
+               {
+                       /*
+                        * In order to get reliable results we also need to touch memory
+                        * pages so that inquiry about NUMA zone doesn't return -2.
+                        */
+                       volatile uint64 touch pg_attribute_unused();
+
+                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);

That sounds right.

Could we also avoid some code duplication with pg_get_shmem_allocations()?

Also same remarks about pfree() and ONE_GIGABYTE as above.

A few other things:

==== 6

+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "port/pg_numa.h"

Not at the right position, should be between those 2:

#include "miscadmin.h"
#include "storage/lwlock.h"

==== 7

+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *       Miscellaneous functions for bit-wise operations.

description is not correct. Also the "Copyright (c) 2019-2025" might be
"Copyright (c) 2025" instead.

=== 8

+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ *             Basic NUMA portability routines

s/numa.c/pg_numa.c/ ?

=== 9

+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
  *       contrib/pg_buffercache/pg_buffercache_pages.c
  *-------------------------------------------------------------------------
  */
+#include "pg_config.h"
 #include "postgres.h"

Is this new include needed?

 #include "access/htup_details.h"
@@ -13,10 +14,12 @@
 #include "funcapi.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"

not in the right order.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
> On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Why is this done before we even have gotten -2 back? Even if we need it, it
> > seems like we ought to defer this until necessary.
> 
> Not fixed yet: maybe we could even do a `static` with
> `has_this_run_earlier` and just perform this work only once during the
> first time?

Not sure I get your idea, could you share what the code would look like?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
Hi!

On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> I did some tests and it looks like it's giving correct results. I don't see -2
> anymore and every backend reports correct distribution (with or without hp,
> with "small" or "large" shared buffer).

Cool! Attached is v7 that is fully green on cirrus CI that Andres
recommended, we will see how cfbot reacts to this. BTW docs are still
missing. When started with proper interleave=all with s_b=64GB,
hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench:

postgres=# select buffers_used, buffers_unused from pg_buffercache_summary();
 buffers_used | buffers_unused
--------------+----------------
       170853 |        8217755
(
ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group
by numa_zone_id order by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id |  count
--------------+---------
            0 |   42752
            1 |   42752
            2 |   42752
            3 |   42597
              | 8217755
Time: 5828.100 ms (00:05.828)
postgres=# select 3*42752+42597;
 ?column?
----------
   170853

postgres=# select * from pg_shmem_numa_allocations order by numa_size
desc limit 12;
DEBUG:  NUMA: page-faulting shared memory segments for proper NUMA readouts
        name        | numa_zone_id |  numa_size
--------------------+--------------+-------------
 Buffer Blocks      |            0 | 17179869184
 Buffer Blocks      |            1 | 17179869184
 Buffer Blocks      |            3 | 17179869184
 Buffer Blocks      |            2 | 17179869184
 Buffer Descriptors |            2 |   134217728
 Buffer Descriptors |            1 |   134217728
 Buffer Descriptors |            0 |   134217728
 Buffer Descriptors |            3 |   134217728
 Checkpointer Data  |            1 |    67108864
 Checkpointer Data  |            0 |    67108864
 Checkpointer Data  |            2 |    67108864
 Checkpointer Data  |            3 |    67108864

Time: 68.579 ms


> A few random comments:
>
> === 1
>
[..]
> I think that the if (query_numa) check should also wrap that entire section of code.

Done.


> === 2
>
> +                       if (query_numa)
> +                       {
> +                          blk2page = (int) i * pages_per_blk;
> +                          j = 0;
> +                          do
> +                          {
>
> This check is done for every page. I wonder if it would not make sense
> to create a brand new function for pg_buffercache_numa and just let the
> current pg_buffercache_pages() as it is. That said it would be great to avoid
> code duplication as much a possible though, maybe using a shared
> populate_buffercache_entry() or such helper function?

Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
there maybe to hint compiler with some smaller routine to reduce
complexity of that main routine? (assuming NUMA inquiry is going to be
rare).

> === 3
>
> +#define ONE_GIGABYTE 1024*1024*1024
> +                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
> +                                                       CHECK_FOR_INTERRUPTS();
> +                                       }
>
> Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
> for every page instead? (I don't see with a 30GB shared buffer). I've the
> feeling that we could get rid of the "ONE_GIGABYTE" check.

You are right and no it was simply my premature optimization attempt,
as apparently CFI on closer looks seems to be already having
unlikely() and looks really cheap, so yeah I've removed that.

> === 4
>
> +               pfree(os_page_ptrs);
> +               pfree(os_pages_status);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).

Yes, I wanted to have it just for illustrative and stylish purposes,
but right, removed.

> === 5
>
> +/* SQL SRF showing NUMA zones for allocated shared memory */
> +Datum
> +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
> +{
[..]
>
> +               for (i = 0; i < shm_ent_page_count; i++)
> +               {
> +                       /*
> +                        * In order to get reliable results we also need to touch memory
> +                        * pages so that inquiry about NUMA zone doesn't return -2.
> +                        */
> +                       volatile uint64 touch pg_attribute_unused();
> +
> +                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
> +                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
>
> That sounds right.
>
> Could we also avoid some code duplication with pg_get_shmem_allocations()?

Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?

>
> Also same remarks about pfree() and ONE_GIGABYTE as above.

Fixed.

> A few other things:
>
> ==== 6
>
> +#include "port/pg_numa.h"
> Not at the right position, should be between those 2:
>
> #include "miscadmin.h"
> #include "storage/lwlock.h"

Fixed.


> ==== 7
>
> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *       Miscellaneous functions for bit-wise operations.
>
> description is not correct. Also the "Copyright (c) 2019-2025" might be
> "Copyright (c) 2025" instead.

Fixed.

> === 8
>
> +++ b/src/port/pg_numa.c
> @@ -0,0 +1,150 @@
> +/*-------------------------------------------------------------------------
> + *
> + * numa.c
> + *             Basic NUMA portability routines
>
> s/numa.c/pg_numa.c/ ?

Fixed.

> === 9
>
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -6,6 +6,7 @@
>   *       contrib/pg_buffercache/pg_buffercache_pages.c
>   *-------------------------------------------------------------------------
>   */
> +#include "pg_config.h"
>  #include "postgres.h"
>
> Is this new include needed?

Removed, don't remember how it arrived here, most have been some
artifact of earlier attempts.

>  #include "access/htup_details.h"
> @@ -13,10 +14,12 @@
>  #include "funcapi.h"
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
>
> not in the right order.

Fixed.

And also those from nearby message:

On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
> > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Why is this done before we even have gotten -2 back? Even if we need it, it
> > > seems like we ought to defer this until necessary.
> >
> > Not fixed yet: maybe we could even do a `static` with
> > `has_this_run_earlier` and just perform this work only once during the
> > first time?
>
> Not sure I get your idea, could you share what the code would look like?

Please see pg_buffercache_pages I've just added static bool firstUseInBackend:

postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
[..]
postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
[..]

Same was done to the pg_get_shmem_numa_allocations.

Also CFbot/cirrus was getting:

> [11:01:05.027] In file included from ../../src/include/postgres.h:49,
> [11:01:05.027]                  from pg_buffercache_pages.c:10:
> [11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’:
> [11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3
hastype ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=] 

Fixed with %zu (for size_t) instead of %ld.

> Linux - Debian Bookworm - Autoconf got:
> [10:42:59.216] checking numa.h usability... no
> [10:42:59.268] checking numa.h presence... no
> [10:42:59.286] checking for numa.h... no
> [10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma

I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this.

> [13:50:47.449] gcc -m32 @src/backend/postgres.rsp
> [13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format

I've also got an error in 32-bit build as libnuma.h is there, but
apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE)
and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for
such build.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote:
> Hi!
> 
> On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > I did some tests and it looks like it's giving correct results. I don't see -2
> > anymore and every backend reports correct distribution (with or without hp,
> > with "small" or "large" shared buffer).
> 
> Cool! Attached is v7

Thanks for the new version!

> > === 2
> >
> > +                       if (query_numa)
> > +                       {
> > +                          blk2page = (int) i * pages_per_blk;
> > +                          j = 0;
> > +                          do
> > +                          {
> >
> > This check is done for every page. I wonder if it would not make sense
> > to create a brand new function for pg_buffercache_numa and just let the
> > current pg_buffercache_pages() as it is. That said it would be great to avoid
> > code duplication as much a possible though, maybe using a shared
> > populate_buffercache_entry() or such helper function?
> 
> Well, I've made query_numa a parameter there simply to avoid that code
> duplication in the first place, look at those TupleDescInitEntry()...

Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.

> IMHO rarely anybody uses pg_buffercache, but we could add unlikely()

I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.

> > === 5
> >
> > Could we also avoid some code duplication with pg_get_shmem_allocations()?
> 
> Not sure I understand do you want to avoid code duplication
> pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
> pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
> true) ?

I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared 
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> > Cool! Attached is v7
> Thanks for the new version!

... and another one: 7b ;)

> > > === 2
[..]
> > Well, I've made query_numa a parameter there simply to avoid that code
> > duplication in the first place, look at those TupleDescInitEntry()...
>
> Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> or such function: to put the "duplicated" code in it and then use this
> shared function in pg_buffercache_pages() and in the new numa related one.

OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..

> > IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
>
> I think unlikely() should be used for optimization based on code path likelihood,
> not based on how often users might use a feature.

In 7b I've removed the unlikely() - For a moment I was thinking that
you are concerned about this loop for NBuffers to be as much optimized
as it can and that's the reason for splitting the routines.

> > > === 5
[..]
> I meant to say avoid code duplication between pg_get_shmem_allocations() and
> pg_get_shmem_numa_allocations(). It might be possible to create a shared
> function for them too. That said, it looks like that the savings (if any), would
> not be that much, so maybe just forget about it.

 Yeah, OK, so let's leave it at that.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>Hi,

> > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > or such function: to put the "duplicated" code in it and then use this
> > shared function in pg_buffercache_pages() and in the new numa related one.
>
> OK, so hastily attempted that in 7b , I had to do a larger refactor
> there to avoid code duplication between those two. I don't know which
> attempt is better though (7 vs 7b)..
>

I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id

-J.

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> Hi,
> On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >Hi,
>
> > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > or such function: to put the "duplicated" code in it and then use this
> > > shared function in pg_buffercache_pages() and in the new numa related one.
> >
> > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > there to avoid code duplication between those two. I don't know which
> > attempt is better though (7 vs 7b)..
> >
>
> I'm attaching basically the earlier stuff (v7b) as v8 with the
> following minor changes:
> - docs are included
> - changed int8 to int4 in one function definition for numa_zone_id

.. and v9 attached because cfbot partially complained about
.cirrus.tasks.yml being adjusted recently (it seems master is hot
these days).

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote:
> On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > Hi,
> > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > >Hi,
> >
> > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > > or such function: to put the "duplicated" code in it and then use this
> > > > shared function in pg_buffercache_pages() and in the new numa related one.
> > >
> > > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > > there to avoid code duplication between those two. I don't know which
> > > attempt is better though (7 vs 7b)..
> > >
> >
> > I'm attaching basically the earlier stuff (v7b) as v8 with the
> > following minor changes:
> > - docs are included
> > - changed int8 to int4 in one function definition for numa_zone_id
> 
> .. and v9 attached because cfbot partially complained about
> .cirrus.tasks.yml being adjusted recently (it seems master is hot
> these days).
> 

Thanks for the new version!

Some random comments on 0001:

=== 1

It does not compiles "alone". It's missing:

+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader  /* standard header for all Postgres shmem */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;

and

--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int  ssl_renegotiation_limit;
  */
 int            huge_pages = HUGE_PAGES_TRY;
 int            huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int            huge_pages_status = HUGE_PAGES_UNKNOWN;

That come with 0002. So it has to be in 0001 instead.

=== 2

+else
+  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi

Maybe we should add the same test (checking for numa.h) for meson?

=== 3

+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)

It looks to me that we can remove those 2 lines.

=== 4

+         Only supported on Linux.

s/on Linux/on platforms for which the libnuma library is implemented/?

I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.

=== 5

+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>

I had a closer look to other header files and it looks like it "should" be:

#include "c.h"
#include "postgres.h"

#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif

#include "port/pg_numa.h"
#include "storage/pg_shmem.h"

And is "#include "c.h"" really needed?

=== 6

+/* FIXME not tested, might crash */

That's a bit scary.

=== 7

+        * XXX: for now we issue just WARNING, but long-term that might depend on
+        * numa_set_strict() here

s/here/here./

Did not look carefully at all the comments in 0001, 0002 and 0003 though.

A few random comments regarding 0002:

=== 8

# create extension pg_buffercache;
ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages
 
CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7

While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.

=== 9

+       do
+       {
+               if (os_page_ptrs[blk2page + j] == 0)

blk2page + j will be repeated multiple times, better to store it in a local
variable instead?

=== 10

+                       if (firstUseInBackend == true)


if (firstUseInBackend) instead?

=== 11

+       int                     j = 0,
+                               blk2page = (int) i * pages_per_blk;

I wonder if size_t is more appropriate for blk2page:

size_t blk2page = (size_t)(i * pages_per_blk) maybe?

=== 12

as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?

"
for (size_t j = 0; j < pages_per_blk; j++)
{
    if (os_page_ptrs[blk2page + j] == 0)
    {
"

=== 13

+       if (buf_state & BM_DIRTY)
+               fctx->record[i].isdirty = true;
+       else
+               fctx->record[i].isdirty = false;

  fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?

=== 14

+       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+               fctx->record[i].isvalid = true;
+       else
+               fctx->record[i].isvalid = false;

fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?

=== 15

+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)

I wonder if "i" should get a more descriptive name?

=== 16

s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).

I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.

=== 17

+       static bool firstUseInBackend = true;

maybe we should give it a more descriptive name?

Also I need to think more about how firstUseInBackend is used, for example:

   ==== 17.1

   would it be better to define it as a file scope variable? (at least that
   would avoid to pass it as an extra parameter in some functions).

   === 17.2 

   what happens if firstUseInBackend is set to false and later on the pages
   are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
   called again by a backend that already had set firstUseInBackend to false,
   would that provide accurate information?

=== 18

+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+       FuncCallContext *funcctx;
+       BufferCachePagesContext *fctx;  /* User function context. */
+       bool            query_numa = true;

I don't think we need query_numa anymore in pg_buffercache_numa_pages().

I think that we can just ERROR (or NOTICE and return) here:

+               if (pg_numa_init() == -1)
+               {
+                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable.");
 
+                       query_numa = false;

and fully get rid of query_numa.

=== 19

And then here:

        for (i = 0; i < NBuffers; i++)
        {
            populate_buffercache_entry(i, fctx);
            if (query_numa)
                pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
        }

        if (query_numa)
        {
            if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
                elog(ERROR, "failed NUMA pages inquiry: %m");

            for (i = 0; i < NBuffers; i++)
            {
                int         blk2page = (int) i * pages_per_blk;

                /*
                 * Technically we can get errors too here and pass that to
                 * user. Also we could somehow report single DB block spanning
                 * more than one NUMA zone, but it should be rare.
                 */
                fctx->record[i].numa_zone_id = os_pages_status[blk2page];
            }

maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

A few random comments regarding 0003:

=== 20

+  <indexterm zone="view-pg-shmem-numa-allocations">
+   <primary>pg_shmem_numa_allocations</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes

s/pg_shmem_allocations/pg_shmem_numa_allocations/?

=== 21

+  /* FIXME: should we release LWlock here ? */
+  elog(ERROR, "failed NUMA pages inquiry status: %m");

There is no need, see src/backend/storage/lmgr/README.

=== 22

+#define MAX_NUMA_ZONES 32              /* FIXME? */
+               Size            zones[MAX_NUMA_ZONES];

could we rely on pg_numa_get_max_node() instead?

=== 23

+                       if (s >= 0)
+                               zones[s]++;

should we also check that s is below a limit?

=== 24

Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?


Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I 
did a little in 0001 though)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Thanks for the new version!

v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:

> Some random comments on 0001:
>
> === 1
>
> It does not compiles "alone". It's missing:
>
[..]
> +extern PGDLLIMPORT int huge_pages_status;
[..]
> -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> +int            huge_pages_status = HUGE_PAGES_UNKNOWN;
>
> That come with 0002. So it has to be in 0001 instead.

Ugh, good catch, I haven't thought about it in isolation, they are
separate to just ease review, but should be committed together. Fixed.

> === 2
>
> +else
> +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> +fi
>
> Maybe we should add the same test (checking for numa.h) for meson?

TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?

> === 3
>
> +# FIXME: filter-out / with/without with_libnuma?
> +LIBS += $(LIBNUMA_LIBS)
>
> It looks to me that we can remove those 2 lines.

Done.

> === 4
>
> +         Only supported on Linux.
>
> s/on Linux/on platforms for which the libnuma library is implemented/?
>
> I did a quick grep on "Only supported on" and it looks like that could be
> a more consistent wording.

Fixed.

> === 5
>
> +#include "c.h"
> +#include "postgres.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
> +#include <unistd.h>
>
> I had a closer look to other header files and it looks like it "should" be:
>
> #include "c.h"
> #include "postgres.h"
>
> #include <unistd.h>
> #ifdef WIN32
> #include <windows.h>
> #endif
>
> #include "port/pg_numa.h"
> #include "storage/pg_shmem.h"
>
> And is "#include "c.h"" really needed?

Fixed both. It seems to compile without c.h.

> === 6
>
> +/* FIXME not tested, might crash */
>
> That's a bit scary.

When you are in support for long enough it is becoming the norm ;) But
on serious note Andres wanted have numa error/warning handlers (used
by libnuma), but current code has no realistic code-path to hit it
from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
situation won't change until one day in future (I hope!) we start
using some more advanced libnuma functionality for interleaving
memory, please see my earlier reply:
https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
 E.g. numa_available(3) is tiny wrapper , see
https://github.com/numactl/numactl/blob/master/libnuma.c#L872

For now, I've adjusted that FIXME to XXX, but still don't know we
could inject failure to see this triggered...

> === 7
>
> +        * XXX: for now we issue just WARNING, but long-term that might depend on
> +        * numa_set_strict() here
>
> s/here/here./

Done.

> Did not look carefully at all the comments in 0001, 0002 and 0003 though.
>
> A few random comments regarding 0002:
>
> === 8
>
> # create extension pg_buffercache;
> ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> RETURNS SETOF RECORD
> AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> LANGUAGE C PARALLEL SAFE"
> extension script file "pg_buffercache--1.2.sql", near line 7
>
> While that's ok if 0003 is applied. I think that each individual patch should
> "fully" work individually.

STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?

> === 9
>
> +       do
> +       {
> +               if (os_page_ptrs[blk2page + j] == 0)
>
> blk2page + j will be repeated multiple times, better to store it in a local
> variable instead?

Done.

> === 10
>
> +                       if (firstUseInBackend == true)
>
>
> if (firstUseInBackend) instead?

Done everywhere.

> === 11
>
> +       int                     j = 0,
> +                               blk2page = (int) i * pages_per_blk;
>
> I wonder if size_t is more appropriate for blk2page:
>
> size_t blk2page = (size_t)(i * pages_per_blk) maybe?

Sure, done.

> === 12
>
> as we know that we'll iterate until pages_per_blk then would a for loop be more
> appropriate here, something like?
>
> "
> for (size_t j = 0; j < pages_per_blk; j++)
> {
>     if (os_page_ptrs[blk2page + j] == 0)
>     {
> "

Sure.

> === 13
>
> +       if (buf_state & BM_DIRTY)
> +               fctx->record[i].isdirty = true;
> +       else
> +               fctx->record[i].isdirty = false;
>
>   fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
>
> === 14
>
> +       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> +               fctx->record[i].isvalid = true;
> +       else
> +               fctx->record[i].isvalid = false;
>
> fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> maybe?

It is coming from the original pg_buffercache and it is less readable
to me, so I don't want to touch that too much because that might open
refactoring doors too wide.

> === 15
>
> +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
>
> I wonder if "i" should get a more descriptive name?

Done, buffer_id.

> === 16
>
> s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> with pg_stat_io_build_tuples for example).
>
> I now realize that I did propose populate_buffercache_entry() up-thread,
> sorry for changing my mind.

OK, I've tried to create a better name, but all my ideas were too
long, but pg_buffercache_build_tuple sounds nice, so lets use that.

> === 17
>
> +       static bool firstUseInBackend = true;
>
> maybe we should give it a more descriptive name?

I couldn't come up with anything that wouldn't look too long, so
instead I've added a comment explaining the meaning behind this
variable, hope that's good enough.

> Also I need to think more about how firstUseInBackend is used, for example:
>
>    ==== 17.1
>
>    would it be better to define it as a file scope variable? (at least that
>    would avoid to pass it as an extra parameter in some functions).

I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?

>    === 17.2
>
>    what happens if firstUseInBackend is set to false and later on the pages
>    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
>    called again by a backend that already had set firstUseInBackend to false,
>    would that provide accurate information?

It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:

postgres=# select 'create table foo' || s || ' as select
generate_series(1, 100000);' from generate_series(1, 4) s;
postgres=# \gexec
SELECT 100000
SELECT 100000
SELECT 100000
SELECT 100000
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id; -- before:
 numa_zone_id |  count
--------------+---------
            0 |     256
            1 |    4131
              | 8384221
postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
           1451

-- now use another root (!) session to "migratepages(8)" from numactl
will also shift shm segment:
# migratepages 1451 1 3

-- while above will be in progress for a lot of time , but the outcome
is visible much faster in that backend (pg is functioning):
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id;
 numa_zone_id |  count
--------------+---------
            0 |     256
            3 |    4480
              | 8383872

So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.

BTW: migratepages(8) was stuck for 1-2 minutes there on
"__wait_rcu_gp" according to it's wchan, without any sign of activity
on the OS and then out of blue completed just fine, s_b=64GB,HP=on.

> === 18
>
> +Datum
> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> +{
> +       FuncCallContext *funcctx;
> +       BufferCachePagesContext *fctx;  /* User function context. */
> +       bool            query_numa = true;
>
> I don't think we need query_numa anymore in pg_buffercache_numa_pages().
>
> I think that we can just ERROR (or NOTICE and return) here:
>
> +               if (pg_numa_init() == -1)
> +               {
> +                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable."); 
> +                       query_numa = false;
>
> and fully get rid of query_numa.

Right... fixed it here, but it made the tests  blow up, so I've had to
find a way to conditionally launch tests based on NUMA availability
and that's how pg_numa_available() was born. It's in ipc/shmem.c
because I couldn't find a better place for it...

> === 19
>
> And then here:
>
>         for (i = 0; i < NBuffers; i++)
>         {
>             populate_buffercache_entry(i, fctx);
>             if (query_numa)
>                 pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
>         }
>
>         if (query_numa)
>         {
>             if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
>                 elog(ERROR, "failed NUMA pages inquiry: %m");
>
>             for (i = 0; i < NBuffers; i++)
>             {
>                 int         blk2page = (int) i * pages_per_blk;
>
>                 /*
>                  * Technically we can get errors too here and pass that to
>                  * user. Also we could somehow report single DB block spanning
>                  * more than one NUMA zone, but it should be rare.
>                  */
>                 fctx->record[i].numa_zone_id = os_pages_status[blk2page];
>             }
>
> maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
that prepares **os_page_ptrs, which is then used by
pg_numa_query_pages() and then we fill the data. But now after
removing query_numa , it reads smoother anyway. Can you please take a
look again on this, is this up to the project standards?

> A few random comments regarding 0003:
>
> === 20
>
> +   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
>
> s/pg_shmem_allocations/pg_shmem_numa_allocations/?

Fixed.

> === 21
>
> +  /* FIXME: should we release LWlock here ? */
> +  elog(ERROR, "failed NUMA pages inquiry status: %m");
>
> There is no need, see src/backend/storage/lmgr/README.

Thanks, fixed.

> === 22
>
> +#define MAX_NUMA_ZONES 32              /* FIXME? */
> +               Size            zones[MAX_NUMA_ZONES];
>
> could we rely on pg_numa_get_max_node() instead?

Sure, done.

> === 23
>
> +                       if (s >= 0)
> +                               zones[s]++;
>
> should we also check that s is below a limit?

STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.

> === 24
>
> Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> no possible holes? I mean could a system have node 0,1 and 3 but not 2?

I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline. Even *if*
someone would be doing some preventive hardware maintenance, that
still wouldn't hurt, as we are just using the Id of the zone to
display it -- the max would be already higher. I mean, technically one
could use lsmem(1) (it mentions removable flag, after let's pages and
processes migrated away and from there) and then use chcpu(1) to
--disable CPUs on that zone (to prevent new ones coming there and
allocating new local memory there) and then offlining that memory
region via chmem(1) --disable. Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`

> Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> did a little in 0001 though)

That was an attempt to say "Thank You", OK, I've aligned it that way,
so you are still referenced in 0001. I wouldn't find motivation to
work on this if you won't respond to those emails ;)

There is one known issue, CI returned for numa.out test, that I need
to get bottom to (it does not reproduce for me locally) inside
numa.out/sql:

SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90


-J.

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>
> > Thanks for the new version!
>
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:
>
> > Some random comments on 0001:
> >
> > === 1
> >
> > It does not compiles "alone". It's missing:
> >
> [..]
> > +extern PGDLLIMPORT int huge_pages_status;
> [..]
> > -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> > +int            huge_pages_status = HUGE_PAGES_UNKNOWN;
> >
> > That come with 0002. So it has to be in 0001 instead.
>
> Ugh, good catch, I haven't thought about it in isolation, they are
> separate to just ease review, but should be committed together. Fixed.
>
> > === 2
> >
> > +else
> > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
>
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?
>
> > === 3
> >
> > +# FIXME: filter-out / with/without with_libnuma?
> > +LIBS += $(LIBNUMA_LIBS)
> >
> > It looks to me that we can remove those 2 lines.
>
> Done.
>
> > === 4
> >
> > +         Only supported on Linux.
> >
> > s/on Linux/on platforms for which the libnuma library is implemented/?
> >
> > I did a quick grep on "Only supported on" and it looks like that could be
> > a more consistent wording.
>
> Fixed.
>
> > === 5
> >
> > +#include "c.h"
> > +#include "postgres.h"
> > +#include "port/pg_numa.h"
> > +#include "storage/pg_shmem.h"
> > +#include <unistd.h>
> >
> > I had a closer look to other header files and it looks like it "should" be:
> >
> > #include "c.h"
> > #include "postgres.h"
> >
> > #include <unistd.h>
> > #ifdef WIN32
> > #include <windows.h>
> > #endif
> >
> > #include "port/pg_numa.h"
> > #include "storage/pg_shmem.h"
> >
> > And is "#include "c.h"" really needed?
>
> Fixed both. It seems to compile without c.h.
>
> > === 6
> >
> > +/* FIXME not tested, might crash */
> >
> > That's a bit scary.
>
> When you are in support for long enough it is becoming the norm ;) But
> on serious note Andres wanted have numa error/warning handlers (used
> by libnuma), but current code has no realistic code-path to hit it
> from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
> situation won't change until one day in future (I hope!) we start
> using some more advanced libnuma functionality for interleaving
> memory, please see my earlier reply:
> https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
>  E.g. numa_available(3) is tiny wrapper , see
> https://github.com/numactl/numactl/blob/master/libnuma.c#L872
>
> For now, I've adjusted that FIXME to XXX, but still don't know we
> could inject failure to see this triggered...
>
> > === 7
> >
> > +        * XXX: for now we issue just WARNING, but long-term that might depend on
> > +        * numa_set_strict() here
> >
> > s/here/here./
>
> Done.
>
> > Did not look carefully at all the comments in 0001, 0002 and 0003 though.
> >
> > A few random comments regarding 0002:
> >
> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
>
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?
>
> > === 9
> >
> > +       do
> > +       {
> > +               if (os_page_ptrs[blk2page + j] == 0)
> >
> > blk2page + j will be repeated multiple times, better to store it in a local
> > variable instead?
>
> Done.
>
> > === 10
> >
> > +                       if (firstUseInBackend == true)
> >
> >
> > if (firstUseInBackend) instead?
>
> Done everywhere.
>
> > === 11
> >
> > +       int                     j = 0,
> > +                               blk2page = (int) i * pages_per_blk;
> >
> > I wonder if size_t is more appropriate for blk2page:
> >
> > size_t blk2page = (size_t)(i * pages_per_blk) maybe?
>
> Sure, done.
>
> > === 12
> >
> > as we know that we'll iterate until pages_per_blk then would a for loop be more
> > appropriate here, something like?
> >
> > "
> > for (size_t j = 0; j < pages_per_blk; j++)
> > {
> >     if (os_page_ptrs[blk2page + j] == 0)
> >     {
> > "
>
> Sure.
>
> > === 13
> >
> > +       if (buf_state & BM_DIRTY)
> > +               fctx->record[i].isdirty = true;
> > +       else
> > +               fctx->record[i].isdirty = false;
> >
> >   fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
> >
> > === 14
> >
> > +       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > +               fctx->record[i].isvalid = true;
> > +       else
> > +               fctx->record[i].isvalid = false;
> >
> > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > maybe?
>
> It is coming from the original pg_buffercache and it is less readable
> to me, so I don't want to touch that too much because that might open
> refactoring doors too wide.
>
> > === 15
> >
> > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
> >
> > I wonder if "i" should get a more descriptive name?
>
> Done, buffer_id.
>
> > === 16
> >
> > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> > with pg_stat_io_build_tuples for example).
> >
> > I now realize that I did propose populate_buffercache_entry() up-thread,
> > sorry for changing my mind.
>
> OK, I've tried to create a better name, but all my ideas were too
> long, but pg_buffercache_build_tuple sounds nice, so lets use that.
>
> > === 17
> >
> > +       static bool firstUseInBackend = true;
> >
> > maybe we should give it a more descriptive name?
>
> I couldn't come up with anything that wouldn't look too long, so
> instead I've added a comment explaining the meaning behind this
> variable, hope that's good enough.
>
> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> >    ==== 17.1
> >
> >    would it be better to define it as a file scope variable? (at least that
> >    would avoid to pass it as an extra parameter in some functions).
>
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?
>
> >    === 17.2
> >
> >    what happens if firstUseInBackend is set to false and later on the pages
> >    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> >    called again by a backend that already had set firstUseInBackend to false,
> >    would that provide accurate information?
>
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
>
> postgres=# select 'create table foo' || s || ' as select
> generate_series(1, 100000);' from generate_series(1, 4) s;
> postgres=# \gexec
> SELECT 100000
> SELECT 100000
> SELECT 100000
> SELECT 100000
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id; -- before:
>  numa_zone_id |  count
> --------------+---------
>             0 |     256
>             1 |    4131
>               | 8384221
> postgres=# select pg_backend_pid();
>  pg_backend_pid
> ----------------
>            1451
>
> -- now use another root (!) session to "migratepages(8)" from numactl
> will also shift shm segment:
> # migratepages 1451 1 3
>
> -- while above will be in progress for a lot of time , but the outcome
> is visible much faster in that backend (pg is functioning):
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id;
>  numa_zone_id |  count
> --------------+---------
>             0 |     256
>             3 |    4480
>               | 8383872
>
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.
>
> BTW: migratepages(8) was stuck for 1-2 minutes there on
> "__wait_rcu_gp" according to it's wchan, without any sign of activity
> on the OS and then out of blue completed just fine, s_b=64GB,HP=on.
>
> > === 18
> >
> > +Datum
> > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> > +{
> > +       FuncCallContext *funcctx;
> > +       BufferCachePagesContext *fctx;  /* User function context. */
> > +       bool            query_numa = true;
> >
> > I don't think we need query_numa anymore in pg_buffercache_numa_pages().
> >
> > I think that we can just ERROR (or NOTICE and return) here:
> >
> > +               if (pg_numa_init() == -1)
> > +               {
> > +                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable."); 
> > +                       query_numa = false;
> >
> > and fully get rid of query_numa.
>
> Right... fixed it here, but it made the tests  blow up, so I've had to
> find a way to conditionally launch tests based on NUMA availability
> and that's how pg_numa_available() was born. It's in ipc/shmem.c
> because I couldn't find a better place for it...
>
> > === 19
> >
> > And then here:
> >
> >         for (i = 0; i < NBuffers; i++)
> >         {
> >             populate_buffercache_entry(i, fctx);
> >             if (query_numa)
> >                 pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
> >         }
> >
> >         if (query_numa)
> >         {
> >             if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
> >                 elog(ERROR, "failed NUMA pages inquiry: %m");
> >
> >             for (i = 0; i < NBuffers; i++)
> >             {
> >                 int         blk2page = (int) i * pages_per_blk;
> >
> >                 /*
> >                  * Technically we can get errors too here and pass that to
> >                  * user. Also we could somehow report single DB block spanning
> >                  * more than one NUMA zone, but it should be rare.
> >                  */
> >                 fctx->record[i].numa_zone_id = os_pages_status[blk2page];
> >             }
> >
> > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
>
> Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
> that prepares **os_page_ptrs, which is then used by
> pg_numa_query_pages() and then we fill the data. But now after
> removing query_numa , it reads smoother anyway. Can you please take a
> look again on this, is this up to the project standards?
>
> > A few random comments regarding 0003:
> >
> > === 20
> >
> > +   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
> >
> > s/pg_shmem_allocations/pg_shmem_numa_allocations/?
>
> Fixed.
>
> > === 21
> >
> > +  /* FIXME: should we release LWlock here ? */
> > +  elog(ERROR, "failed NUMA pages inquiry status: %m");
> >
> > There is no need, see src/backend/storage/lmgr/README.
>
> Thanks, fixed.
>
> > === 22
> >
> > +#define MAX_NUMA_ZONES 32              /* FIXME? */
> > +               Size            zones[MAX_NUMA_ZONES];
> >
> > could we rely on pg_numa_get_max_node() instead?
>
> Sure, done.
>
> > === 23
> >
> > +                       if (s >= 0)
> > +                               zones[s]++;
> >
> > should we also check that s is below a limit?
>
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.
>
> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
>
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline. Even *if*
> someone would be doing some preventive hardware maintenance, that
> still wouldn't hurt, as we are just using the Id of the zone to
> display it -- the max would be already higher. I mean, technically one
> could use lsmem(1) (it mentions removable flag, after let's pages and
> processes migrated away and from there) and then use chcpu(1) to
> --disable CPUs on that zone (to prevent new ones coming there and
> allocating new local memory there) and then offlining that memory
> region via chmem(1) --disable. Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`
>
> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
>
> That was an attempt to say "Thank You", OK, I've aligned it that way,
> so you are still referenced in 0001. I wouldn't find motivation to
> work on this if you won't respond to those emails ;)
>
> There is one known issue, CI returned for numa.out test, that I need
> to get bottom to (it does not reproduce for me locally) inside
> numa.out/sql:
>
> SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
>

Hi, ok, so v11 should fix this last problem and make cfbot happy.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote:
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Thanks for the new version!
> 
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:

Thanks for the new version!

> > === 2
> >
> > +else
> > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
> 
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?

Yeah, idea was to have both in sync. 

> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages
 
> > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
> 
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?

Applying 0001 + 0002 only does produce the issue above. I think that each
sub-patch once applied should pass make check-world, that means:

0001 : should pass
0001 + 0002 : should pass
0001 + 0002 + 0003 : should pass

> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> >    ==== 17.1
> >
> >    would it be better to define it as a file scope variable? (at least that
> >    would avoid to pass it as an extra parameter in some functions).
> 
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?

That would be "just" one more. I think it's better to use it to avoid the "current"
code using "useless" function parameters.

> >    === 17.2
> >
> >    what happens if firstUseInBackend is set to false and later on the pages
> >    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> >    called again by a backend that already had set firstUseInBackend to false,
> >    would that provide accurate information?
> 
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
> 
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.

Great, thanks for the demo and the explanation!

> > === 19
> >
> 
> Can you please take a look again on this

Sure, will do.

> > === 23
> >
> > +                       if (s >= 0)
> > +                               zones[s]++;
> >
> > should we also check that s is below a limit?
> 
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.

I meant to say, ensure that it is below the max node number.

> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
> 
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline.

Yeah probably.

> Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`

Yeah, I agree, thanks for clearing my doubts on it.

> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
> 
> That was an attempt to say "Thank You",

Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-)

I'll have a look at v11 soon.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Thu, Mar 13, 2025 at 02:15:14PM +0000, Bertrand Drouvot wrote:
> > > === 19
> > >
> > 
> > Can you please take a look again on this
> 
> Sure, will do.


> I'll have a look at v11 soon.


About 0001:

=== 1

git am produces:

.git/rebase-apply/patch:378: new blank line at EOF.
+
.git/rebase-apply/patch:411: new blank line at EOF.
+
warning: 2 lines add whitespace errors.

=== 2

+  Returns true if a <acronym>NUMA</acronym> support is available.

What about "Returns true if the server has been compiled with NUMA support"?

=== 3

+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+       if(pg_numa_init() == -1)
+               PG_RETURN_BOOL(false);
+       PG_RETURN_BOOL(true);
+}

What about PG_RETURN_BOOL(pg_numa_init() != -1)?

Also I wonder if pg_numa.c would not be a better place for it.

=== 4

+{ oid => '5102', descr => 'Is NUMA compilation available?',
+  proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+  proargtypes => '', prosrc => 'pg_numa_available' },
+

Not sure that 5102 is a good choice.

src/include/catalog/unused_oids is telling me:

Best practice is to start with a random choice in the range 8000-9999.
Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)

So maybe use 9685 instead?

=== 5

@@ -12477,3 +12481,4 @@
   prosrc => 'gist_stratnum_common' },

 ]
+

garbage?

=== 6

Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
and src/port/pg_numa.c.

=== 7

+Size
+pg_numa_get_pagesize(void)
+{
+       Size os_page_size = sysconf(_SC_PAGESIZE);
+       if (huge_pages_status == HUGE_PAGES_ON)
+                GetHugePageSize(&os_page_size, NULL);
+       return os_page_size;
+}

I think that's more usual to:

Size
pg_numa_get_pagesize(void)
{
       Size os_page_size = sysconf(_SC_PAGESIZE);

       if (huge_pages_status == HUGE_PAGES_ON)
                GetHugePageSize(&os_page_size, NULL);

       return os_page_size;
}

I think that makes sense to check huge_pages_status as you did.

=== 8

+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);

I wonder if that would make sense to add a comment mentioning
github.com/numactl/numactl/blob/master/libnuma.c here.


I still need to look at 0002 and 0003.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

Thank you very much for the review! I'm answering to both reviews in
one go and the results is attached v12, seems it all should be solved
now:

> > > === 2
> > >
> > > +else
> > > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > > +fi
> > >
> > > Maybe we should add the same test (checking for numa.h) for meson?
> >
> > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> > between meson and autoconf?
>
> Yeah, idea was to have both in sync.

Added.

> > > === 8
> > >
> > > # create extension pg_buffercache;
> > > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> > > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > > RETURNS SETOF RECORD
> > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > > LANGUAGE C PARALLEL SAFE"
> > > extension script file "pg_buffercache--1.2.sql", near line 7
> > >
> > > While that's ok if 0003 is applied. I think that each individual patch should
> > > "fully" work individually.
> >
> > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> > 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> > pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> > patches or keep them separate?
>
> Applying 0001 + 0002 only does produce the issue above. I think that each
> sub-patch once applied should pass make check-world, that means:
>
> 0001 : should pass
> 0001 + 0002 : should pass
> 0001 + 0002 + 0003 : should pass

OK, I've retested v11 for all three of them. It worked fine (I think
in v10 I've moved one function to 0001, but pg_numa_query_pages() as
per error message above was always in the 0001).

> > > Also I need to think more about how firstUseInBackend is used, for example:
> > >
> > >    ==== 17.1
> > >
> > >    would it be better to define it as a file scope variable? (at least that
> > >    would avoid to pass it as an extra parameter in some functions).
> >
> > I have no strong opinion on this, but I have one doubt (for future):
> > isn't creating global variables making life harder for upcoming
> > multithreading guys ?
>
> That would be "just" one more. I think it's better to use it to avoid the "current"
> code using "useless" function parameters.

Done.

> > > === 23
> > >
> > > +                       if (s >= 0)
> > > +                               zones[s]++;
> > >
> > > should we also check that s is below a limit?
> >
> > STILL OPEN QUESTION: I'm not sure it would give us value to report
> > e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> > could somehow overallocate that array and remember negative ones too.
>
> I meant to say, ensure that it is below the max node number.

Done, but I doubt the kernel would return a value higher than
numa_max_nodes(), but who knows. Additional defense for this array is
now there.

> SECOND REVIEW//v11-0001 review
>
> === 1
>
> git am produces:
>
> .git/rebase-apply/patch:378: new blank line at EOF.
> +
> .git/rebase-apply/patch:411: new blank line at EOF.
> +
> warning: 2 lines add whitespace errors.

Should be gone, but in at least one case (0003/numa.out) we need to
have empty EOF because otherwise expected tests don't pass (even if
numa.sql doesnt have EOF in numa.sql)

> === 2
>
> +  Returns true if a <acronym>NUMA</acronym> support is available.
>
> What about "Returns true if the server has been compiled with NUMA support"?

Done.

> === 3
>
> +Datum
> +pg_numa_available(PG_FUNCTION_ARGS)
> +{
> +       if(pg_numa_init() == -1)
> +               PG_RETURN_BOOL(false);
> +       PG_RETURN_BOOL(true);
> +}
>
> What about PG_RETURN_BOOL(pg_numa_init() != -1)?
>
> Also I wonder if pg_numa.c would not be a better place for it.

Both done.

> === 4
>
> +{ oid => '5102', descr => 'Is NUMA compilation available?',
> +  proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
> +  proargtypes => '', prosrc => 'pg_numa_available' },
> +
>
> Not sure that 5102 is a good choice.
>
> src/include/catalog/unused_oids is telling me:
>
> Best practice is to start with a random choice in the range 8000-9999.
> Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)
>
> So maybe use 9685 instead?

It's because i've got 5101 there earlier (for pg_shm_numa_allocs
view), but I've aligned both (5102@0001 and 5101@0003) to 968[56].

> === 5
>
> @@ -12477,3 +12481,4 @@
>    prosrc => 'gist_stratnum_common' },
>
>  ]
> +
>
> garbage?

Yea, fixed.

> === 6
>
> Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
> and src/port/pg_numa.c.

Fixed.

> === 7
>
> +Size
> +pg_numa_get_pagesize(void)
> +{
> +       Size os_page_size = sysconf(_SC_PAGESIZE);
> +       if (huge_pages_status == HUGE_PAGES_ON)
> +                GetHugePageSize(&os_page_size, NULL);
> +       return os_page_size;
> +}
>
> I think that's more usual to:
>
> Size
> pg_numa_get_pagesize(void)
> {
>        Size os_page_size = sysconf(_SC_PAGESIZE);
>
>        if (huge_pages_status == HUGE_PAGES_ON)
>                 GetHugePageSize(&os_page_size, NULL);
>
>        return os_page_size;
> }
>
> I think that makes sense to check huge_pages_status as you did.

Done.

> === 8
>
> +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
> +extern void numa_error(char *where);
>
> I wonder if that would make sense to add a comment mentioning
> github.com/numactl/numactl/blob/master/libnuma.c here.

Sure thing, added.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
> On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi,
> 
> Thank you very much for the review! I'm answering to both reviews in
> one go and the results is attached v12, seems it all should be solved
> now:

Thanks for v12! 

I'll review 0001 and 0003 later, but want to share what I've done for 0002.

I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of 
v12 0002).

The main changes are:

=== 1

Some doc wording changes.

=== 2

Some comments, pgindent and friends changes.

=== 3

         relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
-        pinning_backends int4, numa_zone_id int4);
+        pinning_backends int4, zone_id int4);

I changed numa_zone_id to zone_id as we are already in "numa" related functions
and/or views.

=== 4

-                       CHECK_FOR_INTERRUPTS();
                }
+
+               CHECK_FOR_INTERRUPTS();

I think that it's better to put the CFI outside of the if condition, so that it
can be called for every page.

=== 5

-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)

for clarity.

=== 6

-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)

for clarity.

=== 7

-               int                     os_page_count = 0;
+               uint64          os_page_count = 0;

I think that's better.

=== 8

But then, here, we need to change its format to %lu and and to cast to (unsigned long)
to make the CI CompilerWarnings happy.

That's fine because we don't provide NUMA support for 32 bits anyway.

-               elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
-                               os_page_count, os_page_size, pages_per_blk);
+               elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+                        (unsigned long) os_page_count, os_page_size, pages_per_blk);

=== 9

-static bool firstUseInBackend = true;
+static bool firstNumaTouch = true;

Looks better to me but still not 100% convinced by the name.

=== 10

 static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)

as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
a lof of examples in the code that make use of "FunctionCallInfo" for non
fmgr-compatible function.

and also:

=== 11

I don't like the fact that we iterate 2 times over NBuffers in
pg_buffercache_numa_pages().

But I'm having having hard time finding a better approach given the fact that
pg_numa_query_pages() needs all the pointers "prepared" before it can be called...

Those 2 loops are probably the best approach, unless someone has a better idea.

=== 12

Upthread you asked "Can you please take a look again on this, is this up to the
project standards?"

Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?

What do you think? The comments, doc and code changes are just proposals and are
fully open to discussion.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
> > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you very much for the review! I'm answering to both reviews in
> > one go and the results is attached v12, seems it all should be solved
> > now:
>
> Thanks for v12!
>
> I'll review 0001 and 0003 later, but want to share what I've done for 0002.
>
> I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
> on top of v11 0002 (I just rebased it a bit so that it now applies on top of
> v12 0002).

Hey Bertrand,

all LGTM (good ideas), so here's v13 attached with applied all of that
(rebased, tested). BTW: I'm sending to make cfbot as it still tried to
apply that .patch (on my side it .patch, not .txt)

> === 9
>
> -static bool firstUseInBackend = true;
> +static bool firstNumaTouch = true;
>
> Looks better to me but still not 100% convinced by the name.

IMHO, Yes, it looks much better.

> === 10
>
>  static BufferCachePagesContext *
> -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
> +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
>
> as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
> a lof of examples in the code that make use of "FunctionCallInfo" for non
> fmgr-compatible function.

Cool, thanks.

> and also:
>
> === 11
>
> I don't like the fact that we iterate 2 times over NBuffers in
> pg_buffercache_numa_pages().
>
> But I'm having having hard time finding a better approach given the fact that
> pg_numa_query_pages() needs all the pointers "prepared" before it can be called...
>
> Those 2 loops are probably the best approach, unless someone has a better idea.

IMHO, it doesn't hurt and I've also not been able to think of any better idea.

> === 12
>
> Upthread you asked "Can you please take a look again on this, is this up to the
> project standards?"
>
> Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?

Yes, this was for an earlier doubt regarding question "19" about
reviewing the code after removal of `query_numa` variable. This is the
same code for 2 loops, IMHO it is good now.

> What do you think? The comments, doc and code changes are just proposals and are
> fully open to discussion.

They are great, thank You!

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Mon, Mar 17, 2025 at 08:28:46AM +0100, Jakub Wartak wrote:
> On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
> > on top of v11 0002 (I just rebased it a bit so that it now applies on top of
> > v12 0002).
> 
> Hey Bertrand,
> 
> all LGTM (good ideas), so here's v13 attached with applied all of that
> (rebased, tested).

Thanks for v13! 

Looking at 0003:

=== 1

+      <entry>NUMA mappings for shared memory allocations</entry>

s/NUMA mappings/NUMA node mappings/ maybe?

=== 2

+  <para>
+   The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+   assigned allocations made from the server's main shared memory segment.

What about?

"
shows how shared memory allocations in the server's main shared memory segment
are distributed across NUMA nodes" ?

=== 3

+       <structfield>numa_zone_id</structfield> <type>int4</type>

s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
0002.

BTW, I wonder if "node_id" would be better (to match the descriptions...).
If so, would also need to be done in 0002.

=== 4

+      ID of NUMA node

<acronym>NUMA</acronym> node ID? (to be consistent with 0002).

=== 5

+static bool firstUseInBackend = true;

Let's use firstNumaTouch to be consistent with 0002.

=== 6

+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be
unavailable.");;

There is 2 ";" + I think that we should used the same wording as in
pg_buffercache_numa_pages().

=== 7

What about using ERROR instead? (like in pg_buffercache_numa_pages())

=== 8

+       /*
+        * This is for gathering some NUMA statistics. We might be using various
+        * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+        * various different OS memory pages sizes, so first we need to understand
+        * the OS memory page size before calling move_pages()
+        */
+       os_page_size = pg_numa_get_pagesize();

Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling 
pg_numa_get_pagesize()?

=== 9

+       max_zones = pg_numa_get_max_node();

I think we are mixing "zone" and "node". I think we should standardize on one
and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
vote for node, but zone is fine too if you prefer.

=== 10

+       /*
+        * Preallocate memory all at once without going into details which shared
+        * memory segment is the biggest (technically min s_b can be as low as
+        * 16xBLCKSZ)
+        */

What about? 

"
Allocate memory for page pointers and status based on total shared memory size.
This simplified approach allocates enough space for all pages in shared memory
rather than calculating the exact requirements for each segment.
" instead?

=== 11

+       int                     shm_total_page_count,
+                               shm_ent_page_count,

I think those 2 should be uint64.

=== 12

+       /*
+        * XXX: We are ignoring in NUMA version reporting of the following regions
+        * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+        * allocated but not counted via the shmem index 2. output as-of-yet
+        * unused shared memory
+        */

why XXX?

what about?

"
We are ignoring the following memory regions (as compared to
pg_get_shmem_allocations())....

=== 13

+       page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+       memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);

maybe we could use palloc0() here?

=== 14

and I realize that we could probably use it in 0002 for os_page_ptrs.

=== 15

I think there is still some multi-lines comments that are missing a period. I
probably also missed some in 0002 during the previous review. I think that's
worth another check.

=== 16

+    * In order to get reliable results we also need to touch memory
+    * pages so that inquiry about NUMA zone doesn't return -2.
+    */

maybe use the same wording as in 0002?

=== 17

The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
but (as for 0002) it looks like we can not avoid it (or at least I don't see
a way to avoid it).

I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
updated.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Thanks for v13!

Rebased and fixes inside in the attached v14 (it passes CI too):

> Looking at 0003:
>
> === 1
>
> +      <entry>NUMA mappings for shared memory allocations</entry>
>
> s/NUMA mappings/NUMA node mappings/ maybe?

Done.

> === 2
>
> +  <para>
> +   The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
> +   assigned allocations made from the server's main shared memory segment.
>
> What about?
>
> "
> shows how shared memory allocations in the server's main shared memory segment
> are distributed across NUMA nodes" ?

Done.

> === 3
>
> +       <structfield>numa_zone_id</structfield> <type>int4</type>
>
> s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
> 0002.
>
> BTW, I wonder if "node_id" would be better (to match the descriptions...).
> If so, would also need to be done in 0002.

Somewhat duplicate, please see answer for #9

> === 4
>
> +      ID of NUMA node
>
> <acronym>NUMA</acronym> node ID? (to be consistent with 0002).
>
> === 5
>
> +static bool firstUseInBackend = true;
>
> Let's use firstNumaTouch to be consistent with 0002.

Done.

> === 6
>
> + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be
unavailable.");;
>
> There is 2 ";" + I think that we should used the same wording as in
> pg_buffercache_numa_pages().
>
> === 7
>
> What about using ERROR instead? (like in pg_buffercache_numa_pages())

Both are synced now.

> === 8
>
> +       /*
> +        * This is for gathering some NUMA statistics. We might be using various
> +        * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
> +        * various different OS memory pages sizes, so first we need to understand
> +        * the OS memory page size before calling move_pages()
> +        */
> +       os_page_size = pg_numa_get_pagesize();
>
> Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling
> pg_numa_get_pagesize()?

Done, improved style of the comment there and synced pg_buffercache
one to shmem.c one.

> === 9
>
> +       max_zones = pg_numa_get_max_node();
>
> I think we are mixing "zone" and "node". I think we should standardize on one
> and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> vote for node, but zone is fine too if you prefer.

Given that numa(7) does not use "zone" keyword at all and both
/proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
into zones, we can conclude that "zone" is simply a subdivision within
a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
internal information about zones), therefore we should stay away from
using "zone" within the patch at all, as we are just fetching NUMA
node info. My bad, it's a terminology error on my side from start -
I've probably saw "zone" info in some command output back then when we
had that workshop and started using it and somehow it propagated
through the patchset up to this day... I've adjusted it all and
settled on "numa_node_id" column name.

> === 10
>
> +       /*
> +        * Preallocate memory all at once without going into details which shared
> +        * memory segment is the biggest (technically min s_b can be as low as
> +        * 16xBLCKSZ)
> +        */
>
> What about?
>
> "
> Allocate memory for page pointers and status based on total shared memory size.
> This simplified approach allocates enough space for all pages in shared memory
> rather than calculating the exact requirements for each segment.
> " instead?

Done.

> === 11
>
> +       int                     shm_total_page_count,
> +                               shm_ent_page_count,
>
> I think those 2 should be uint64.

Right...

> === 12
>
> +       /*
> +        * XXX: We are ignoring in NUMA version reporting of the following regions
> +        * (compare to pg_get_shmem_allocations() case): 1. output shared memory
> +        * allocated but not counted via the shmem index 2. output as-of-yet
> +        * unused shared memory
> +        */
>
> why XXX?
>
> what about?
>
> "
> We are ignoring the following memory regions (as compared to
> pg_get_shmem_allocations())....

Fixed , it was apparently leftover of when I was thinking if we should
still report it.

> === 13
>
> +       page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
> +       memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
>
> maybe we could use palloc0() here?

Of course!++

> === 14
>
> and I realize that we could probably use it in 0002 for os_page_ptrs.

Of course!++

> === 15
>
> I think there is still some multi-lines comments that are missing a period. I
> probably also missed some in 0002 during the previous review. I think that's
> worth another check.

Please do such a check, I've tried pgident on all .c files, but I'm
apparently blind to such issues. BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??

> === 16
>
> +    * In order to get reliable results we also need to touch memory
> +    * pages so that inquiry about NUMA zone doesn't return -2.
> +    */
>
> maybe use the same wording as in 0002?

But 0002 used:

  "In order to get reliable results we also need to touch memory pages, so that
  inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
  pages)"

or are you looking at something different?

> === 17
>
> The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> but (as for 0002) it looks like we can not avoid it (or at least I don't see
> a way to avoid it).

Hm, it's literally debug code. Why would we care so much if it is 2
loops rather than 1? (as stated earlier we need to pack ptrs and then
analyze it)

> I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> updated.

Cool, thanks in advance.


-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
> On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Thanks for v13!
> 
> Rebased and fixes inside in the attached v14 (it passes CI too):

Thanks!

> > === 9
> >
> > +       max_zones = pg_numa_get_max_node();
> >
> > I think we are mixing "zone" and "node". I think we should standardize on one
> > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> > vote for node, but zone is fine too if you prefer.
> 
> Given that numa(7) does not use "zone" keyword at all and both
> /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
> into zones, we can conclude that "zone" is simply a subdivision within
> a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
> ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
> internal information about zones), therefore we should stay away from
> using "zone" within the patch at all, as we are just fetching NUMA
> node info. My bad, it's a terminology error on my side from start -
> I've probably saw "zone" info in some command output back then when we
> had that workshop and started using it and somehow it propagated
> through the patchset up to this day...

Thanks for the explanation.

> I've adjusted it all and settled on "numa_node_id" column name.

Yeah, I can see, things like:

+  <para>
+   The definitions of the columns exposed are identical to the
+   <structname>pg_buffercache</structname> view, except that this one includes
+   one additional <structfield>numa_node_id</structfield> column as defined in
+   <xref linkend="pgbuffercache-numa-columns"/>.

and like:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>numa_size</structfield> <type>int4</type>
+      </para>

I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.

I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.

I think "node_id", "size" as column(s) name should be enough.

Or maybe that re-adding "numa_" was intentional?

> > === 15
> >
> > I think there is still some multi-lines comments that are missing a period. I
> > probably also missed some in 0002 during the previous review. I think that's
> > worth another check.
> 
> Please do such a check,

Found much more:


+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)

is missing the period

and

+       /*
+        * Switch context when allocating stuff to be used in later calls
+        */

should be as before, meaning on current HEAD:

/* Switch context when allocating stuff to be used in later calls */

and

+       /*
+        * Return to original context when allocating transient memory
+        */

should be as before, meaning on current HEAD:

/* Return to original context when allocating transient memory */

and

+       /*
+        * Note if the buffer is valid, and has storage created
+        */

should be as before, meaning on current HEAD:

/* Note if the buffer is valid, and has storage created */

and

+               /*
+                * unused for v1.0 callers, but the array is always long enough
+                */


should be as before, meaning on current HEAD:

/* unused for v1.0 callers, but the array is always long enough */

and

+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings

is missing the period

and

+                * To correctly map between them, we need to: - Determine the OS
+                * memory page size - Calculate how many OS pages are used by all
+                * buffer blocks - Calculate how many OS pages are contained within
+                * each database block

is missing the period (2 times as this comment appears in 0002 and 0003)

and

+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here


is missing the period (2 times as this comment appears in 0002 and 0003)

and

+       /*
+        * We are ignoring the following memory regions (as compared to
+        * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+        * counted via the shmem index 2. output as-of-yet unused shared memory

is missing the period

> I've tried pgident on all .c files, but I'm
> apparently blind to such issues.

I don't think pgident would report missing period.

> BTW if patch has anything left that
> causes pgident to fix, that is not picked by CI but it is picked by
> buildfarm??

I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.

> > === 16
> >
> > +    * In order to get reliable results we also need to touch memory
> > +    * pages so that inquiry about NUMA zone doesn't return -2.
> > +    */
> >
> > maybe use the same wording as in 0002?
> 
> But 0002 used:
> 
>   "In order to get reliable results we also need to touch memory pages, so that
>   inquiry about NUMA zone doesn't return -2 (which indicates
> unmapped/unallocated
>   pages)"
> 
> or are you looking at something different?

Nope, I meant to say that it could make sense to have the exact same comment.

> > === 17
> >
> > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> > but (as for 0002) it looks like we can not avoid it (or at least I don't see
> > a way to avoid it).
> 
> Hm, it's literally debug code. Why would we care so much if it is 2
> loops rather than 1? (as stated earlier we need to pack ptrs and then
> analyze it)

Yeah, but if we could just loop one time I'm pretty sure we'd have done that.

> > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> > updated.
> :w

> Cool, thanks in advance.

0001 looks in a good shape from my point of view.

For 0002:

=== 1

I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.

Maybe something like:

0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above

We did it that way in c2a50ac678e and ff7c40d7fd6 for example.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi! v15 attached, rebased, CI-tested, all fixed incorporated.

> > I've adjusted it all and settled on "numa_node_id" column name.
>
[...]
> I think that we can get rid of the "numa_" stuff in column(s) name as
> the column(s) are part of "numa" relation views/output anyway.
[...]

Done, you are probably right (it was done to keep consistency between
those two views probably), I'm just not that strongly attached to the
naming things.

> > Please do such a check,
>
> Found much more:
>
[.. 9 issues with missing dots at the end of sentences in comments +
fixes to comment structure in relation to HEAD..]

All fixed.

> > BTW if patch has anything left that
> > causes pgident to fix, that is not picked by CI but it is picked by
> > buildfarm??
>
> I think it has to be done manually before each commit and that this is anyway
> done at least once per release cycle.

OK, thanks for clarification.

[..]
> >
> > But 0002 used:
> >
> >   "In order to get reliable results we also need to touch memory pages, so that
> >   inquiry about NUMA zone doesn't return -2 (which indicates
> > unmapped/unallocated
> >   pages)"
> >
> > or are you looking at something different?
>
> Nope, I meant to say that it could make sense to have the exact same comment.

Synced those two.

[..]
>
> 0001 looks in a good shape from my point of view.

Cool!

> For 0002:
>
> === 1
>
> I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
> deserve their own patch. That would ease the review for the "real" numa stuff.
>

Done, 0001+0002 alone passes the meson test.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Nazir Bilal Yavuz
Date:
Hi,

Thank you for working on this!

On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>
> Hi! v15 attached, rebased, CI-tested, all fixed incorporated.

This needs to be rebased after 8eadd5c73c.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Mar 27, 2025 at 12:31 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi! v15 attached, rebased, CI-tested, all fixed incorporated.
>
> This needs to be rebased after 8eadd5c73c.

Hi Nazir, thanks for spotting! I've not connected the dots with AIO
going in and my libnuma dependency blowing up... Attached is rebased
v16 that passed my CI run.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Álvaro Herrera
Date:
Hello

I think you should remove numa_warn() and numa_error() from 0001.
AFAICS they are dead code (even with all your patches applied), and
furthermore would get you in trouble regarding memory allocation because
src/port is not allowed to use palloc et al.  If you wanted to keep them
you'd have to have them in src/common, but looking at the rest of the
code in that patch, ISTM src/port is the right place for it.  If in the
future you discover that you do need numa_warn(), you can create a
src/common/ file for it then.

Is pg_buffercache really the best place for these NUMA introspection
routines?  I'm not saying that it isn't, maybe we're okay with that
(particularly if we can avoid duplicated code), but it seems a bit weird
to me.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo."                 (Augusto Pinochet a una corte de justicia)



Re: Draft for basic NUMA observability

From
Andres Freund
Date:
Hi,

On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
>    setup_additional_packages_script: |
> -    #apt-get update
> -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> +    apt-get update
> +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> +      libnuma1 \
> +      libnuma-dev

I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.

> +#
> +# libnuma
> +#
> +AC_MSG_CHECKING([whether to build with libnuma support])
> +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],

Most other dependencies say "build with libxyz ..."


> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *      Basic NUMA portability routines
> + *
> + *
> + * Copyright (c) 2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *     src/include/port/pg_numa.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NUMA_H
> +#define PG_NUMA_H
> +
> +#include "c.h"
> +#include "postgres.h"

Headers should never include either of those headers.  Nor should .c files
include both.


> @@ -200,6 +200,8 @@ pgxs_empty = [
>  
>    'ICU_LIBS',
>  
> +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> +
>    'LIBURING_CFLAGS', 'LIBURING_LIBS',
>  ]

Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?


> +Size
> +pg_numa_get_pagesize(void)
> +{
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +
> +    return os_page_size;
> +}

Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?


> +#ifndef FRONTEND
> +/*
> + * XXX: not really tested as there is no way to trigger this in our
> + * current usage of libnuma.
> + *
> + * The libnuma built-in code can be seen here:
> + * https://github.com/numactl/numactl/blob/master/libnuma.c
> + *
> + */
> +void
> +numa_warn(int num, char *fmt,...)
> +{
> +    va_list        ap;
> +    int            olde = errno;
> +    int            needed;
> +    StringInfoData msg;
> +
> +    initStringInfo(&msg);
> +
> +    va_start(ap, fmt);
> +    needed = appendStringInfoVA(&msg, fmt, ap);
> +    va_end(ap);
> +    if (needed > 0)
> +    {
> +        enlargeStringInfo(&msg, needed);
> +        va_start(ap, fmt);
> +        appendStringInfoVA(&msg, fmt, ap);
> +        va_end(ap);
> +    }
> +
> +    ereport(WARNING,
> +            (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
> +             errmsg_internal("libnuma: WARNING: %s", msg.data)));

I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.

> +Size
> +pg_numa_get_pagesize(void)
> +{
> +#ifndef WIN32
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +#else
> +    Size        os_page_size;
> +    SYSTEM_INFO sysinfo;
> +
> +    GetSystemInfo(&sysinfo);
> +    os_page_size = sysinfo.dwPageSize;
> +#endif
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +    return os_page_size;
> +}


I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.


> From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Wed, 19 Mar 2025 09:34:56 +0100
> Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
>  primary function to separate functions.
> 
> This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> and get_buffercache_tuple() that help fill result tuplestores based on the
> buffercache contents. This will be used in a follow-up commit that implements
> NUMA observability in pg_buffercache.
> 
> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
>  1 file changed, 178 insertions(+), 139 deletions(-)
> 
> diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> index 62602af1775..86e0b8afe01 100644
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -14,7 +14,6 @@
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
>  
> -
>  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM    8

Independent change.


>  #define NUM_BUFFERCACHE_PAGES_ELEM    9
>  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
>  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
>  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
>  
> -Datum
> -pg_buffercache_pages(PG_FUNCTION_ARGS)
> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> + * memory mapping inquiries to display NUMA node information for each buffer.
> + */

If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?



> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * Build buffer cache information for a single buffer.
> + */
> +static void
> +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> +{

This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.


> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    bufHdr = GetBufferDescriptor(record_id);
> +    /* Lock each buffer header before inspecting. */
> +    buf_state = LockBufHdr(bufHdr);
> +
> +    fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> +    fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> +    fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> +    fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> +    fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> +    fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> +    fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> +    fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);

As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.



> +static Datum
> +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> +{
> +    Datum        values[NUM_BUFFERCACHE_PAGES_ELEM];
> +    bool        nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> +    HeapTuple    tuple;
> +
> +    values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> +    nulls[0] = false;
> +
> +    /*
> +     * Set all fields except the bufferid to null if the buffer is unused or
> +     * not valid.
> +     */
> +    if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> +        fctx->record[record_id].isvalid == false)
> +    {
> +        nulls[1] = true;
> +        nulls[2] = true;
> +        nulls[3] = true;
> +        nulls[4] = true;
> +        nulls[5] = true;
> +        nulls[6] = true;
> +        nulls[7] = true;
> +
> +        /* unused for v1.0 callers, but the array is always long enough */
> +        nulls[8] = true;

I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.


> +    }
> +    else
> +    {
> +        values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> +        nulls[1] = false;
> +        values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> +        nulls[2] = false;
> +        values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> +        nulls[3] = false;
> +        values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> +        nulls[4] = false;
> +        values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> +        nulls[5] = false;
> +        values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> +        nulls[6] = false;
> +        values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> +        nulls[7] = false;

Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable.  Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...


> @@ -0,0 +1,28 @@
> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif

You could avoid the need for an alternative output file if you instead made
the queries do something like
  SELECT NOT pg_numa_available() OR count(*) ...



> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,42 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the new function.
> +DROP VIEW pg_buffercache;
> +DROP FUNCTION pg_buffercache_pages();

I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.


(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)


Greetings,

Andres Freund



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Hello

Good morning :)

> I think you should remove numa_warn() and numa_error() from 0001.
> AFAICS they are dead code (even with all your patches applied), and
> furthermore would get you in trouble regarding memory allocation because
> src/port is not allowed to use palloc et al.  If you wanted to keep them
> you'd have to have them in src/common, but looking at the rest of the
> code in that patch, ISTM src/port is the right place for it.  If in the
> future you discover that you do need numa_warn(), you can create a
> src/common/ file for it then.

Understood, trimmed it out from the patch. I'm going to respond also
within minutes to Andres' review and I'm going to post a new version
(v17) there.

> Is pg_buffercache really the best place for these NUMA introspection
> routines?  I'm not saying that it isn't, maybe we're okay with that
> (particularly if we can avoid duplicated code), but it seems a bit weird
> to me.

I think it is, because as I understand, Andres wanted to have
observability per single database *page* and to avoid code duplication
we are just putting it there (it's natural fit). Imagine looking up an
8kB root btree memory page being hit hard from CPUs on other NUMA
nodes (this just gives ability to see that, but you could of course
also get aggregation to get e.g. NUMA node balance for single relation
and so on).

-J.



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

Hi Andres,

> On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
> >    setup_additional_packages_script: |
> > -    #apt-get update
> > -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> > +    apt-get update
> > +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> > +      libnuma1 \
> > +      libnuma-dev
>
> I think libnuma is installed on the relevant platforms, so you shouldn't need
> to install it manually.

Fixed. Right, you mentioned this earlier, I just didnt know when it went online.

> > +#
> > +# libnuma
> > +#
> > +AC_MSG_CHECKING([whether to build with libnuma support])
> > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
>
> Most other dependencies say "build with libxyz ..."

Done.

> > + * pg_numa.h
[..]
> > +#include "c.h"
> > +#include "postgres.h"
>
> Headers should never include either of those headers.  Nor should .c files
> include both.

Fixed, huh, I've found explanation:
https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us

> > @@ -200,6 +200,8 @@ pgxs_empty = [
> >
> >    'ICU_LIBS',
> >
> > +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> > +
> >    'LIBURING_CFLAGS', 'LIBURING_LIBS',
> >  ]
>
> Maybe I am missing something, but are you actually defining and using those
> LIBNUMA_* vars anywhere?

OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)`
in configure.ac that would set those *FLAGS. I'm little bit loss
dependent in how to gurantee that meson is synced with autoconf as per
project requirements - trying to use past commits as reference, but I
still could get something wrong here (especially in
src/Makefile.global.in)

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> Should this have a comment or an assertion that it can only be used after
> shared memory startup? Because before that huge_pages_status won't be
> meaningful?

Added both.

> > +#ifndef FRONTEND
> > +/*
> > + * XXX: not really tested as there is no way to trigger this in our
> > + * current usage of libnuma.
> > + *
> > + * The libnuma built-in code can be seen here:
> > + * https://github.com/numactl/numactl/blob/master/libnuma.c
> > + *
> > + */
> > +void
> > +numa_warn(int num, char *fmt,...)
[..]
>
> I think you would at least have to hold interrupts across this, as
> ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
> out of libnuma in case an interrupt has arrived.

On request by Alvaro I've removed it as that code is simply
unreachable/untestable, but point taken - I'm planning to re-add this
with holding interrupting in future when we start using proper
numa_interleave() one day. Anyway, please let me know if you want
still to keep it as deadcode. BTW for context , why this is deadcode
is explained in the latter part of [1] message (TL;DR; unless we use
pining/numa_interleave/local_alloc() we probably never reach that
warnings/error handlers).

"Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine..."

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> I would probably implement this once, outside of the big ifdef, with one more
> ifdef inside, given that you're sharing the same implementation.

Done.

> > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> > From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> > Date: Wed, 19 Mar 2025 09:34:56 +0100
> > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
> >  primary function to separate functions.
> >
> > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> > and get_buffercache_tuple() that help fill result tuplestores based on the
> > buffercache contents. This will be used in a follow-up commit that implements
> > NUMA observability in pg_buffercache.
> >
> > Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> > ---
> >  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
> >  1 file changed, 178 insertions(+), 139 deletions(-)
> >
> > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> > index 62602af1775..86e0b8afe01 100644
> > --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> > @@ -14,7 +14,6 @@
> >  #include "storage/buf_internals.h"
> >  #include "storage/bufmgr.h"
> >
> > -
> >  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM       8
>
> Independent change.

Fixed.

> >  #define NUM_BUFFERCACHE_PAGES_ELEM   9
> >  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
> >  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
> >  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
> >
> > -Datum
> > -pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> > + * memory mapping inquiries to display NUMA node information for each buffer.
> > + */
>
> If it's a helper routine it's probably not identical to
> pg_buffercache_numa_pages()?

Of course not, fixed by removing that comment.

> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * Build buffer cache information for a single buffer.
> > + */
> > +static void
> > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
>
> This isn't really building a tuple tuple? Seems somewhat confusing, because
> get_buffercache_tuple() does actually build one.

s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless
someone wants to come with better name.

> > +     BufferDesc *bufHdr;
> > +     uint32          buf_state;
> > +
> > +     bufHdr = GetBufferDescriptor(record_id);
> > +     /* Lock each buffer header before inspecting. */
> > +     buf_state = LockBufHdr(bufHdr);
> > +
> > +     fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> > +     fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> > +     fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> > +     fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> > +     fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> > +     fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> > +     fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> > +     fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
>
> As above, I think this would be more readable if you put
> fctx->record[record_id] into a local var.

Done.

> > +static Datum
> > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
> > +     Datum           values[NUM_BUFFERCACHE_PAGES_ELEM];
> > +     bool            nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> > +     HeapTuple       tuple;
> > +
> > +     values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> > +     nulls[0] = false;
> > +
> > +     /*
> > +      * Set all fields except the bufferid to null if the buffer is unused or
> > +      * not valid.
> > +      */
> > +     if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> > +             fctx->record[record_id].isvalid == false)
> > +     {
> > +             nulls[1] = true;
> > +             nulls[2] = true;
> > +             nulls[3] = true;
> > +             nulls[4] = true;
> > +             nulls[5] = true;
> > +             nulls[6] = true;
> > +             nulls[7] = true;
> > +
> > +             /* unused for v1.0 callers, but the array is always long enough */
> > +             nulls[8] = true;
>
> I'd probably just memset the entire nulls array to either true or false,
> instead of doing it one-by-one.

Done.

> > +     }
> > +     else
> > +     {
> > +             values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> > +             nulls[1] = false;
> > +             values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> > +             nulls[2] = false;
> > +             values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> > +             nulls[3] = false;
> > +             values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> > +             nulls[4] = false;
> > +             values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> > +             nulls[5] = false;
> > +             values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> > +             nulls[6] = false;
> > +             values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> > +             nulls[7] = false;
>
> Seems like it would end up a lot more readable if you put
> fctx->record[record_id] into a local variable.  Unfortunately that'd probably
> be best done in one more commit ahead of the rest of the this one...

Done, i've put it those refactorig changes into the commit already
dedicated only for a refactor. For the record Bertrand also asked for
something about this, but I was somehow afraid to touch Tom's code.

> > @@ -0,0 +1,28 @@
> > +SELECT NOT(pg_numa_available()) AS skip_test \gset
> > +\if :skip_test
> > +\quit
> > +\endif
>
> You could avoid the need for an alternative output file if you instead made
> the queries do something like
>   SELECT NOT pg_numa_available() OR count(*) ...

ITEM REMAINING: Is this for the future or can it stay like that? I
don't have a hard opinion on this, but I've already wasted lots of
cycles to discover that one can have those ".1" alternative expected
result files.

> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,42 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the new function.
> > +DROP VIEW pg_buffercache;
> > +DROP FUNCTION pg_buffercache_pages();
>
> I don't think we can just drop a view in the upgrade script. That will fail if
> anybody created a view depending on pg_buffercache.

Ugh, fixed, thanks. That must have been some leftover (we later do
CREATE OR REPLACE those anyway).

> (Sorry, ran out of time / energy here, i had originally just wanted to comment
> on the apt-get thing in the tests)

Thanks! AIO intensifies ... :)

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
> On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > +Size
> > > +pg_numa_get_pagesize(void)
> [..]
> >
> > Should this have a comment or an assertion that it can only be used after
> > shared memory startup? Because before that huge_pages_status won't be
> > meaningful?
> 
> Added both.

Thanks for the updated version!

+       Assert(IsUnderPostmaster);

I wonder if that would make more sense to add an assertion on huge_pages_status
and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
CreateSharedMemoryAndSemaphores()).

=== About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch

Once applied I can see mention to pg_buffercache_numa_pages() while it
only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch.

I think pg_buffercache_numa_pages() should not be mentioned before it's actually
implemented.

=== 1

+               bufRecord->isvalid == false)
        {
                int                     i;

-               funcctx = SRF_FIRSTCALL_INIT();
-
-               /* Switch context when allocating stuff to be used in later calls */
-               oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-               /* Create a user function context for cross-call persistence */
-               fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+               for (i = 1; i <= 9; i++)
+                       nulls[i] = true; 

"i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
is increased to 10).

In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).

That could also make sense to remove the loop and use memset() that way:

"
memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
"

instead. It's done that way in some other places (hbafuncs.c for example).

=== 2

-               if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
-                       TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
-                                                          INT4OID, -1, 0);

+       if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+               TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+                                                  INT4OID, -1, 0);

I think we should not change the "expected_tupledesc->natts" check here until
v17-0003 is applied.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Hi,

Hi Bertrand, happy to see you back, thanks for review and here's v18
attached (an ideal fit for PG18 ;))

> On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
> > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > > +Size
> > > > +pg_numa_get_pagesize(void)
> > [..]
> > >
> > > Should this have a comment or an assertion that it can only be used after
> > > shared memory startup? Because before that huge_pages_status won't be
> > > meaningful?
> >
> > Added both.
>
> Thanks for the updated version!
>
> +       Assert(IsUnderPostmaster);
>
> I wonder if that would make more sense to add an assertion on huge_pages_status
> and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
> CreateSharedMemoryAndSemaphores()).

Ok, let's have both just in case (this status is by default
initialized to _UNKNOWN, so I hope you haven't had in mind using
GetConfigOption() as this would need guc.h in port?)

> === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch
[..]
> I think pg_buffercache_numa_pages() should not be mentioned before it's actually
> implemented.

Right, fixed.

> === 1
>
[..]
> "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
> is increased to 10).
>
> In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).
>
> That could also make sense to remove the loop and use memset() that way:
>
> "
> memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
> "
>
> instead. It's done that way in some other places (hbafuncs.c for example).

Ouch, good catch.

> === 2
>
> -               if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
> -                       TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
> -                                                          INT4OID, -1, 0);
>
> +       if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
> +               TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
> +                                                  INT4OID, -1, 0);
>
> I think we should not change the "expected_tupledesc->natts" check here until
> v17-0003 is applied.

Right, I've moved that into 003 where it belongs and now 002 has no
single NUMA reference. I've thrown 0001+0002 alone onto CI and it
passed too.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi Jakub,

On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
> On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Hi,
> 
> Hi Bertrand, happy to see you back, thanks for review and here's v18
> attached (an ideal fit for PG18 ;))

Thanks for the new version!

=== About v18-0002

It looks in a good shape to me. The helper's name might still be debatable
though.

I just have 2 comments:

=== 1

+       if (bufRecord->blocknum == InvalidBlockNumber ||
+               bufRecord->isvalid == false)

It seems to me that this check could now fit in one line.

=== 2

+       {
                SRF_RETURN_DONE(funcctx);
+       }

Extra parentheses are not needed.

=== About v18-0003

=== 3

I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
only the necessary changes when updating from 1.5. It means that it should "only"
create the new objects (views and functions in our case) that come in v18-0003 
and grant appropriate privs.

Also it should mention:

"
\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
"
and not:

"
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
"

The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.

=== 4

+               for (i = 0; i < NBuffers; i++)
+               {
+                       int                     blk2page = (int) i * pages_per_blk;
+

I think that should be:

int blk2page = (int) (i * pages_per_blk);

=== About v18-0004

=== 5

When running:

select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss
 
where c.name = s.name;

I can see:

- pg_shmem_numa_allocations reporting a lot of times the same size
- pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size

Do you observe the same?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
Hi,

I've spent a bit of time reviewing this. In general I haven't found
anything I'd call a bug, but here's a couple comments for v18 ... Most
of this is in separate "review" commits, with a couple exceptions.

1) Please update the commit messages, with proper formatting, etc. I
tried to do that in the attached v19, but please go through that, add
relevant details, update list of reviewers, etc. The subject should not
be overly long, etc.

2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
use just "libnuma support" similar to other libraries.

3) I don't think we need pg_numa.h to have this:

extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);

AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
would it be necessary here? It's enough to have a prototype in .c file.

4) Improved .sgml to have acronym/productname in a couple places.

5) I don't think the comment for pg_buffercache_init_entries() is very
useful. That it's helper for pg_buffercache_pages() tells me nothing
about how to use it, what the arguments are, etc.

6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
comment too. I mean, there's no info about what the arguments are, which
arguments are input or output, etc. And it only discussed one option
(block page < memory page), but what happens in the other case? The
formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
saying it's wrong).

However, it seems rather suspicious that pages_per_blk is calculated as
float, and pg_buffercache_numa_prepare_ptrs() then does this:

    for (size_t j = 0; j < pages_per_blk; j++)
    { ... }

I mean, isn't this vulnerable to rounding errors, which might trigger
some weird behavior? If not, it'd be good to add a comment why this is
fine, it confuses me a lot. I personally would probably prefer doing
just integer arithmetic here.


7) This para in the docs seems self-contradictory:

 <para>
  The <function>pg_buffercache_numa_pages()</function> provides the same
information
  as <function>pg_buffercache_pages()</function> but is slower because
it also
  provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
  The <structname>pg_buffercache_numa</structname> view wraps the
function for
  convenient use.
 </para>

I mean, "provides the same information, but is slower because it
provides different information" is strange. I understand the point, but
maybe rephrase somehow?

8) Why is pg_numa_available() marked as volatile? Should not change in a
running cluster, no?

9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
let's not do that.

   <sect2 id="pgbuffercache-pg-buffercache_numa">

10) I think it'd be good to mention/explain why move_pages is used
instead of get_mempolicy - more efficient with batching, etc. This would
be useful both in the commit message and before the move_pages call (and
in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
pointers like this etc.).

11) This could use UINT64_FORMAT, instead of a cast:

    elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
        (unsigned long) os_page_count, os_page_size, pages_per_blk);


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi Jakub,
>
> On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
> > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > > Hi,
> >
> > Hi Bertrand, happy to see you back, thanks for review and here's v18
> > attached (an ideal fit for PG18 ;))
>
> Thanks for the new version!
>

Hi Bertrand,

I'll be attaching v20 when responding to the follow-up review by Tomas
to avoid double-sending this in the list, but review findings from
here are fixed in v20.

> === About v18-0002
>
> It looks in a good shape to me. The helper's name might still be debatable
> though.
>
> I just have 2 comments:
>
> === 1
>
> +       if (bufRecord->blocknum == InvalidBlockNumber ||
> +               bufRecord->isvalid == false)
>
> It seems to me that this check could now fit in one line.

OK.

> === 2
>
> +       {
>                 SRF_RETURN_DONE(funcctx);
> +       }
>
> Extra parentheses are not needed.

This also should be fixed in v20.

>
> === About v18-0003
>
> === 3
>
> I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
> only the necessary changes when updating from 1.5. It means that it should "only"
> create the new objects (views and functions in our case) that come in v18-0003
> and grant appropriate privs.
>
> Also it should mention:
>
> "
> \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
> "
> and not:
>
> "
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> "
>
> The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.

Hm, good find, it might be leftover from first attempts where we were
aiming just for adding column numa_node_id to pg_buffercache_pages()
(rather than adding new view), all hopefully fixed.

> === 4
>
> +               for (i = 0; i < NBuffers; i++)
> +               {
> +                       int                     blk2page = (int) i * pages_per_blk;
> +
>
> I think that should be:
>
> int blk2page = (int) (i * pages_per_blk);

OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
it's own... I believe orginal ident, would fix this on it's own?

> === About v18-0004
>
> === 5
>
> When running:
>
> select c.name, c.size as num_size, s.size as shmem_size
> from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss 
> where c.name = s.name;
>
> I can see:
>
> - pg_shmem_numa_allocations reporting a lot of times the same size
> - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size
>
> Do you observe the same?
>

Yes, it is actually by design: the pg_shmem_allocations.size is sum of
page sizes not size of struct, e.g. with "order by 3 desc":

                      name                      | num_size  | shmem_size
------------------------------------------------+-----------+------------
 WaitEventCustomCounterData                     |      4096 |          8
 Archiver Data                                  |      4096 |          8
 SerialControlData                              |      4096 |         16

I was even wondering if it does make sense to output "shm pointer
address" (or at least offset) there to see which shm structures are on
the same page, but we have the same pg_shm_allocations already.

-J.



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> I've spent a bit of time reviewing this. In general I haven't found
> anything I'd call a bug, but here's a couple comments for v18 ... Most
> of this is in separate "review" commits, with a couple exceptions.

Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".

> 1) Please update the commit messages, with proper formatting, etc. I
> tried to do that in the attached v19, but please go through that, add
> relevant details, update list of reviewers, etc. The subject should not
> be overly long, etc.

Fixed by you.

> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
> use just "libnuma support" similar to other libraries.

Fixed by you.

> 3) I don't think we need pg_numa.h to have this:
>
> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>
> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
> would it be necessary here? It's enough to have a prototype in .c file.

Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.

> 4) Improved .sgml to have acronym/productname in a couple places.

Great.

> 5) I don't think the comment for pg_buffercache_init_entries() is very
> useful. That it's helper for pg_buffercache_pages() tells me nothing
> about how to use it, what the arguments are, etc.

I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)

> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
> comment too. I mean, there's no info about what the arguments are, which
> arguments are input or output, etc. And it only discussed one option
> (block page < memory page), but what happens in the other case? The
> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
> saying it's wrong).
>
> However, it seems rather suspicious that pages_per_blk is calculated as
> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>
>     for (size_t j = 0; j < pages_per_blk; j++)
>     { ... }
>
> I mean, isn't this vulnerable to rounding errors, which might trigger
> some weird behavior? If not, it'd be good to add a comment why this is
> fine, it confuses me a lot. I personally would probably prefer doing
> just integer arithmetic here.

Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP,  DEBUG:  NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG:  NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906

so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB    4kB                    = 0.5
2kB    2048kb                 = .0009765625
2kB    1024*1024kb # 1GB      = .0000019073486328125 # worst-case?
8kB    4kB                    = 2
8kB    2048kb                 = .003906250 # example from above (x86_64, 2M HP)
8kB    1024*1024kb # 1GB      = .00000762939453
32kB   4kB                    = 8
32kB   2048kb                 = .0156250
32kB   1024*1024kb # 1GB      = .000030517578125

So that loop:
    for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)

Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.

So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)

Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .

OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
    float f = 2097152 * 0.003906250;
 under clang  -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all of the core arithmetics here to integer?
- or simply go with doubles just to be sure? I went with doubles in
v20, comments explaining are not there yet.

> 7) This para in the docs seems self-contradictory:
>
>  <para>
>   The <function>pg_buffercache_numa_pages()</function> provides the same
> information
>   as <function>pg_buffercache_pages()</function> but is slower because
> it also
>   provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
>   The <structname>pg_buffercache_numa</structname> view wraps the
> function for
>   convenient use.
>  </para>
>
> I mean, "provides the same information, but is slower because it
> provides different information" is strange. I understand the point, but
> maybe rephrase somehow?

Oh my... yes, now it looks way better.

> 8) Why is pg_numa_available() marked as volatile? Should not change in a
> running cluster, no?

No it shouldn't, good find, made it 's'table.

> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
> let's not do that.
>
>    <sect2 id="pgbuffercache-pg-buffercache_numa">

Fixed.

> 10) I think it'd be good to mention/explain why move_pages is used
> instead of get_mempolicy - more efficient with batching, etc. This would
> be useful both in the commit message and before the move_pages call

Ok, added in 0001.

> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
> pointers like this etc.).

Added reference to that earlier new comment here too in 0003.

> 11) This could use UINT64_FORMAT, instead of a cast:
>
>     elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
> pages_per_blk=%.2f",
>         (unsigned long) os_page_count, os_page_size, pages_per_blk);

Done.

12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"

OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.

13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"

OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1

Should we add such functionality?


-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi Jakub,

On Wed, Apr 02, 2025 at 04:45:53PM +0200, Jakub Wartak wrote:
> On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > === 4
> >
> > +               for (i = 0; i < NBuffers; i++)
> > +               {
> > +                       int                     blk2page = (int) i * pages_per_blk;
> > +
> >
> > I think that should be:
> >
> > int blk2page = (int) (i * pages_per_blk);
> 
> OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
> it's own... I believe orginal ident, would fix this on it's own?

My comment was not about indention but about the fact that I think that the
casting is not a the right place. I think that's the result of the multiplication
that we want to be casted (cast operator has higher precedence than Multiplication
operator).

> >
> > select c.name, c.size as num_size, s.size as shmem_size
> > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss
 
> > where c.name = s.name;
> >
> > I can see:
> >
> > - pg_shmem_numa_allocations reporting a lot of times the same size
> > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size
> >
> > Do you observe the same?
> >
> 
> Yes, it is actually by design: the pg_shmem_allocations.size is sum of
> page sizes not size of struct,

Ok, but then does it make sense to see some num_size < shmem_size?

postgres=# select c.name, c.size as num_size, s.size as shmem_size
 from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss
 
 where c.name = s.name and s.size > c.size;
     name      | num_size  | shmem_size
---------------+-----------+------------
 XLOG Ctl      |   4194304 |    4208200
 Buffer Blocks | 134217728 |  134221824
 AioHandleIOV  |   2097152 |    2850816
(3 rows)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:

On 4/2/25 16:46, Jakub Wartak wrote:
> On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> I've spent a bit of time reviewing this. In general I haven't found
>> anything I'd call a bug, but here's a couple comments for v18 ... Most
>> of this is in separate "review" commits, with a couple exceptions.
> 
> Hi, thank you very much for help on this, yes I did not anticipate
> this patch to organically grow like that...
> I've squashed those review findings into v20 and provided answers for
> the "review:".
> 

Thanks.

>> 1) Please update the commit messages, with proper formatting, etc. I
>> tried to do that in the attached v19, but please go through that, add
>> relevant details, update list of reviewers, etc. The subject should not
>> be overly long, etc.
> 
> Fixed by you.
> 

OK, so you agree the commit messages are complete / correct?

>> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
>> use just "libnuma support" similar to other libraries.
> 
> Fixed by you.
> 

OK. FWIW if you disagree with some of my proposed changes, feel free to
push back. I'm sure some may be more a matter of personal preference.

>> 3) I don't think we need pg_numa.h to have this:
>>
>> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>>
>> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
>> would it be necessary here? It's enough to have a prototype in .c file.
> 
> Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.
> 
>> 4) Improved .sgml to have acronym/productname in a couple places.
> 
> Great.
> 
>> 5) I don't think the comment for pg_buffercache_init_entries() is very
>> useful. That it's helper for pg_buffercache_pages() tells me nothing
>> about how to use it, what the arguments are, etc.
> 
> I've added an explanation (in 0003 though), so that this is covered.
> I've always assumed that 'static' functions don't need that much of
> that(?)
> 

I think that's mostly true - the (non-static) functions that are part of
the API for a module need better / more detailed docs. But that doesn't
mean static functions shouldn't have at least basic docs (unless the
function is trivial / obvious, but I don't think that's the case here).

If I happen to work a pg_buffercache patch in a couple months, I'll
still need to understand what the function does. It won't save me that
I'm working on the same file ...

I'm not saying this needs a detailed docs, but "helper for X" adds very
little information - I can easily see where it's called from, right?

>> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
>> comment too. I mean, there's no info about what the arguments are, which
>> arguments are input or output, etc. And it only discussed one option
>> (block page < memory page), but what happens in the other case? The
>> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
>> saying it's wrong).
>>
>> However, it seems rather suspicious that pages_per_blk is calculated as
>> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>>
>>     for (size_t j = 0; j < pages_per_blk; j++)
>>     { ... }
>>
>> I mean, isn't this vulnerable to rounding errors, which might trigger
>> some weird behavior? If not, it'd be good to add a comment why this is
>> fine, it confuses me a lot. I personally would probably prefer doing
>> just integer arithmetic here.
> 
> Please bear with me: If you set client_min_messages to debug1 and then
> pg_buffercache_numa will dump:
> a) without HP,  DEBUG:  NUMA: os_page_count=32768 os_page_size=4096
> pages_per_blk=2.00
> b) with HP (2M) DEBUG:  NUMA: os_page_count=64 os_page_size=2097152
> pages_per_blk=0.003906
> 
> so we need to be agile to support two cases as you mention (BLCKSZ >
> PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
> 4kB..1GB, thus we can get in that float the following sample values:
> BLCKSZ pagesize
> 2kB    4kB                    = 0.5
> 2kB    2048kb                 = .0009765625
> 2kB    1024*1024kb # 1GB      = .0000019073486328125 # worst-case?
> 8kB    4kB                    = 2
> 8kB    2048kb                 = .003906250 # example from above (x86_64, 2M HP)
> 8kB    1024*1024kb # 1GB      = .00000762939453
> 32kB   4kB                    = 8
> 32kB   2048kb                 = .0156250
> 32kB   1024*1024kb # 1GB      = .000030517578125
> 
> So that loop:
>     for (size_t j = 0; j < pages_per_blk; j++)
> is quite generic and launches in both cases. I've somehow failed to
> somehow come up with integer-based math and generic code for this
> (without special cases which seem to be no-go here?). So, that loop
> then will:
> a) launch many times to support BLCKSZ > pagesize, that is when single
> DB block spans multiple memory pages
> b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
> example above)
> 

Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
confusing and difficult to reason about ...

I agree we don't want special cases for each possible combination of
page sizes (I'm not sure we even know all the combinations). What I was
thinking about is two branches, one for (block >= page) and another for
(block < page). AFAICK both values have to be 2^k, so this would
guarantee we have either (block/page) or (page/block) as integer.

I wonder if you could even just calculate both, and have one loop that
deals with both.

> Loop touches && stores addresses into os_page_ptrs[] as input to this
> one big move_pages(2) query. So we basically ask for all memory pages
> for NBuffers. Once we get our NUMA information we then use blk2page =
> up_to_NBuffers * pages_per_blk to resolve memory pointers back to
> Buffers, if anywhere it could be a problem here.
> 

IMHO this is the information I'd expect in the function comment.

> So let's say we have s_b=4TB (it wont work for sure for other reasons,
> let's assume we have it), let's also assume we have no huge
> pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
> which multiplied by 2 = INT_MAX (integer overflow bug), so I think
> that int is not big enough there in pg_buffercache_numa_pages() (it
> should be "size_t blk2page" there as in
> pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)
> 
> Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
> NBuffers=2097152 * 0.003906250 = 8192.0 .
> 
> OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
>     float f = 2097152 * 0.003906250;
>  under clang  -Weverything I got "implicit conversion increases
> floating-point precision: 'float' to 'double'", so either it is:
> - we somehow rewrite all of the core arithmetics here to integer?
> - or simply go with doubles just to be sure? I went with doubles in
> v20, comments explaining are not there yet.
> 

When I say "integer arithmetic" I don't mean it should use 32-bit ints,
or any other data type. I mean that it works with non-floating point
values. It could be int64, Size or whatever is large enough to not
overflow. I really don't see how changing stuff to double makes this
easier to understand.

>> 7) This para in the docs seems self-contradictory:
>>
>>  <para>
>>   The <function>pg_buffercache_numa_pages()</function> provides the same
>> information
>>   as <function>pg_buffercache_pages()</function> but is slower because
>> it also
>>   provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
>>   The <structname>pg_buffercache_numa</structname> view wraps the
>> function for
>>   convenient use.
>>  </para>
>>
>> I mean, "provides the same information, but is slower because it
>> provides different information" is strange. I understand the point, but
>> maybe rephrase somehow?
> 
> Oh my... yes, now it looks way better.
> 
>> 8) Why is pg_numa_available() marked as volatile? Should not change in a
>> running cluster, no?
> 
> No it shouldn't, good find, made it 's'table.
> 
>> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
>> let's not do that.
>>
>>    <sect2 id="pgbuffercache-pg-buffercache_numa">
> 
> Fixed.
> 
>> 10) I think it'd be good to mention/explain why move_pages is used
>> instead of get_mempolicy - more efficient with batching, etc. This would
>> be useful both in the commit message and before the move_pages call
> 
> Ok, added in 0001.
> 
>> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
>> pointers like this etc.).
> 
> Added reference to that earlier new comment here too in 0003.
> 

Will take a look in the evening.

>> 11) This could use UINT64_FORMAT, instead of a cast:
>>
>>     elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
>> pages_per_blk=%.2f",
>>         (unsigned long) os_page_count, os_page_size, pages_per_blk);
> 
> Done.
> 
> 12) You have also raised "why not pg_shm_allocations_numa" instead of
> "pg_shm_numa_allocations"
> 
> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
> naming things in general), I can change if you want.
> 

Me neither. I wonder if there's some precedent when adding similar
variants for other catalogs ... can you check? I've been thinking about
pg_stats and pg_stats_ext, but maybe there's a better example?

> 13) In the patch: "review: What if we get multiple pages per buffer
> (the default). Could we get multiple nodes per buffer?"
> 
> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
> output multiple rows per single buffer (with "page_no") then we could
> get this:
> buffer1:..:page0:numanodeID1
> buffer1:..:page1:numanodeID2
> buffer2:..:page0:numanodeID1
> 
> Should we add such functionality?
> 


When you say "today no" does that mean we know all pages will be on the
same node, or that there may be pages from different nodes and we can't
display that? That'd not be great, IMHO.

I'm not a huge fan of returning multiple rows per buffer, with one row
per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
rest of the fields is for the whole buffer, it'd be wrong to duplicate
that for each page.

I wonder if we should have a bitmap of nodes for the buffer (but then
what if there are multiple pages from the same node?), or maybe just an
array of nodes, with one element per page.


regards

-- 
Tomas Vondra




Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Wed, Apr 2, 2025 at 5:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi Jakub,

Hi Bertrand,

> > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
> > it's own... I believe orginal ident, would fix this on it's own?
>
> My comment was not about indention but about the fact that I think that the
> casting is not a the right place. I think that's the result of the multiplication
> that we want to be casted (cast operator has higher precedence than Multiplication
> operator).

Oh! I've missed that, but v21 got a rewrite (still not polished) just
to show it can be done without float points as Tomas requested.

[..]
> Ok, but then does it make sense to see some num_size < shmem_size?
>
> postgres=# select c.name, c.size as num_size, s.size as shmem_size
>  from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss 
>  where c.name = s.name and s.size > c.size;
>      name      | num_size  | shmem_size
> ---------------+-----------+------------
>  XLOG Ctl      |   4194304 |    4208200
>  Buffer Blocks | 134217728 |  134221824
>  AioHandleIOV  |   2097152 |    2850816

This was a real bug, fixed in v21 , the ent->allocated_size was not
properly page aligned. Thanks for the attention to detail.

-J.



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> OK, so you agree the commit messages are complete / correct?

Yes.

> OK. FWIW if you disagree with some of my proposed changes, feel free to
> push back. I'm sure some may be more a matter of personal preference.

No, it's all fine. I will probably have lots of questions about
setting proper env for development that cares itself about style, but
that's for another day.

> [..floats..]
> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
> confusing and difficult to reason about ...
>
> I agree we don't want special cases for each possible combination of
> page sizes (I'm not sure we even know all the combinations). What I was
> thinking about is two branches, one for (block >= page) and another for
> (block < page). AFAICK both values have to be 2^k, so this would
> guarantee we have either (block/page) or (page/block) as integer.
>
> I wonder if you could even just calculate both, and have one loop that
> deals with both.
>
[..]
> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
> or any other data type. I mean that it works with non-floating point
> values. It could be int64, Size or whatever is large enough to not
> overflow. I really don't see how changing stuff to double makes this
> easier to understand.

I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values. It should be more readable too with
those comments. I have not put it into its own function, because now
it fits the whole screen, so hopefully one can follow visually. Please
let me know if that code solves the doubts or feel free to reformat
it. That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.

> > 12) You have also raised "why not pg_shm_allocations_numa" instead of
> > "pg_shm_numa_allocations"
> >
> > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
> > naming things in general), I can change if you want.
> >
>
> Me neither. I wonder if there's some precedent when adding similar
> variants for other catalogs ... can you check? I've been thinking about
> pg_stats and pg_stats_ext, but maybe there's a better example?

Hm, it seems we always go with suffix "_somethingnew":

* pg_stat_database -> pg_stat_database_conflicts
* pg_stat_subscription -> pg_stat_subscription_stats
* even here: pg_buffercache -> pg_buffercache_numa

@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...

> > 13) In the patch: "review: What if we get multiple pages per buffer
> > (the default). Could we get multiple nodes per buffer?"
> >
> > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
> > output multiple rows per single buffer (with "page_no") then we could
> > get this:
> > buffer1:..:page0:numanodeID1
> > buffer1:..:page1:numanodeID2
> > buffer2:..:page0:numanodeID1
> >
> > Should we add such functionality?
>
> When you say "today no" does that mean we know all pages will be on the
> same node, or that there may be pages from different nodes and we can't
> display that? That'd not be great, IMHO.
>
> I'm not a huge fan of returning multiple rows per buffer, with one row
> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
> rest of the fields is for the whole buffer, it'd be wrong to duplicate
> that for each page.

OPEN_QUESTION: With v21 we have all the information available, we are
just unable to display this in pg_buffercache_numa right now. We could
trim the view so that it has 3 columns (and user needs to JOIN to
pg_buffercache for more details like relationoid), but then what the
whole refactor (0002) was for if we would just return bufferId like
below:

buffer1:page0:numanodeID1
buffer1:page1:numanodeID2
buffer2:page0:numanodeID1
buffer2:page1:numanodeID1

There's also the problem that reading/joining could be inconsistent
and even slower.

> I wonder if we should have a bitmap of nodes for the buffer (but then
> what if there are multiple pages from the same node?), or maybe just an
> array of nodes, with one element per page.

AFAIR this has been discussed back in end of January, and the
conclusion was more or less - on Discord - that everything sucks
(bitmaps, BIT() datatype, arrays,...) either from implementation or
user side, but apparently arrays [] would suck the least from
implementation side. So we could probably do something like up to
node_max_nodes():
buffer1:..:{0, 2, 0, 0}
buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
buffer3:..:{0, 0, 0, 2}

Other idea is JSON or even simple string with numa_node_id<->count:
buffer1:..:"1=2"
buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
buffer3:..:"3=2"

I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...

-J.

Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
> > OK, so you agree the commit messages are complete / correct?
> 
> Yes.

Not 100% sure on my side. 

=== v21-0002

Says:

"
This introduces three new functions:

    - pg_buffercache_init_entries
    - pg_buffercache_build_tuple
    - get_buffercache_tuple
"

While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).

About v21-0002:

=== 1

I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).

About v21-0003:

=== 2

> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values.

+               if (buffers_per_page > 1)
+                       os_page_query_count = NBuffers;
+               else
+                       os_page_query_count = NBuffers * pages_per_buffer;

yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.

=== 3

+                       if (buffers_per_page > 1)
+                       {

As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:

"
if (buffers_per_page > 1) {
    /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
    for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
    /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
    for (i = 0; i < NBuffers; i++) {
.
.
.
"

=== 4

> That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.

Yeah I think that it can be simply removed then.

=== 5

> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...

I think that pg_shm_allocations_numa() is better (given the examples you just
shared).

=== 6

> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
>> OK, so you agree the commit messages are complete / correct?
> 
> Yes.
> 
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
> 
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
> 
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
> 
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
> 

IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
> 
> Hm, it seems we always go with suffix "_somethingnew":
> 
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
> 
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
> 

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
> 
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
> 
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
> 
> There's also the problem that reading/joining could be inconsistent
> and even slower.
> 

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.

>> I wonder if we should have a bitmap of nodes for the buffer (but then
>> what if there are multiple pages from the same node?), or maybe just an
>> array of nodes, with one element per page.
> 
> AFAIR this has been discussed back in end of January, and the
> conclusion was more or less - on Discord - that everything sucks
> (bitmaps, BIT() datatype, arrays,...) either from implementation or
> user side, but apparently arrays [] would suck the least from
> implementation side. So we could probably do something like up to
> node_max_nodes():
> buffer1:..:{0, 2, 0, 0}
> buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
> buffer3:..:{0, 0, 0, 2}
> 
> Other idea is JSON or even simple string with numa_node_id<->count:
> buffer1:..:"1=2"
> buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
> buffer3:..:"3=2"
> 
> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().

Attached is v22, with some minor review comments:

1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?

2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).

3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
On 4/3/25 10:23, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
>> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi Tomas,
>>
>>> OK, so you agree the commit messages are complete / correct?
>>
>> Yes.
> 
> Not 100% sure on my side. 
> 
> === v21-0002
> 
> Says:
> 
> "
> This introduces three new functions:
> 
>     - pg_buffercache_init_entries
>     - pg_buffercache_build_tuple
>     - get_buffercache_tuple
> "
> 
> While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
> is).
> 

Ah, OK. Jakub, can you correct (and double-check) this in the next
version of the patch?

> About v21-0002:
> 
> === 1
> 
> I can see that the pg_buffercache_init_entries() helper comments are added in
> v21-0003 but I think that it would be better to add them in v21-0002
> (where the helper is actually created).
> 

+1 to that

> About v21-0003:
> 
> === 2
> 
>> I hear you, attached v21 / 0003 is free of float/double arithmetics
>> and uses non-float point values.
> 
> +               if (buffers_per_page > 1)
> +                       os_page_query_count = NBuffers;
> +               else
> +                       os_page_query_count = NBuffers * pages_per_buffer;
> 
> yeah, that's more elegant. I think that it properly handles the relationships
> between buffer and page sizes without relying on float arithmetic.
> 

In the review I just submitted, I changed this to

    os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);

but maybe it's less clear. Feel free to undo my change.

> === 3
> 
> +                       if (buffers_per_page > 1)
> +                       {
> 
> As buffers_per_page does not change, I think I'd put this check outside of the
> for (i = 0; i < NBuffers; i++) loop, something like:
> 
> "
> if (buffers_per_page > 1) {
>     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> .
> } else {
>     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
>     for (i = 0; i < NBuffers; i++) {
> .
> .

I don't know. It's a matter of opinion, but I find the current code
fairly understandable. Maybe if it meaningfully reduced the code
nesting, but even with the extra branch we'd still need the for loop.
I'm not against doing this differently, but I'd have to see how that
looks. Until then I think it's fine to have the code as is.

> .
> "
> 
> === 4
> 
>> That _numa_prepare_ptrs() is unused and will need to be removed,
>> but we can still move some code there if necessary.
> 
> Yeah I think that it can be simply removed then.
> 

I'm not particularly attached on having the _ptrs() function, but why
couldn't it build the os_page_ptrs array as before?

> === 5
> 
>> @Bertrand: do you have anything against pg_shm_allocations_numa
>> instead of pg_shm_numa_allocations? I don't mind changing it...
> 
> I think that pg_shm_allocations_numa() is better (given the examples you just
> shared).
> 
> === 6
> 
>> I find all of those non-user friendly and I'm afraid I won't be able
>> to pull that alone in time...
> 
> Maybe we could add a few words in the doc to mention the "multiple nodes per
> buffer" case? And try to improve it for say 19? Also maybe we should just focus
> till v21-0003 (and discard v21-0004 for 18).
> 

IMHO it's not enough to paper this over by mentioning it in the docs.

I'm a bit confused about which patch you suggest to leave out. I think
0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
We've been discussing how to represent nodes for buffers in 0003.

I understand it'd require a bit of coding, but AFAICS adding an array
would be fairly trivial amount of code. Something like

  values[i++]
    = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));

would do, I think. But if we decide to do the 3-column view, that'd be
even simpler, I think. AFAICS it means we could mostly ditch the
pg_buffercache refactoring in patch 0002, and 0003 would get much
simpler too I think.

If Jakub doesn't have time to work on this, I can take a stab at it
later today.


regards

-- 
Tomas Vondra




Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,

Hi Bertrand,

> On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
[..]

> === v21-0002
> While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
> is).

Fixed

> About v21-0002:
>
> === 1
>
> I can see that the pg_buffercache_init_entries() helper comments are added in
> v21-0003 but I think that it would be better to add them in v21-0002
> (where the helper is actually created).

Moved

> About v21-0003:
>
> === 2
>
> > I hear you, attached v21 / 0003 is free of float/double arithmetics
> > and uses non-float point values.
>
> +               if (buffers_per_page > 1)
> +                       os_page_query_count = NBuffers;
> +               else
> +                       os_page_query_count = NBuffers * pages_per_buffer;
>
> yeah, that's more elegant. I think that it properly handles the relationships
> between buffer and page sizes without relying on float arithmetic.

Cool

> === 3
>
> +                       if (buffers_per_page > 1)
> +                       {
>
> As buffers_per_page does not change, I think I'd put this check outside of the
> for (i = 0; i < NBuffers; i++) loop, something like:
>
> "
> if (buffers_per_page > 1) {
>     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> .
> } else {
>     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> "

Done.

> === 4
>
> > That _numa_prepare_ptrs() is unused and will need to be removed,
> > but we can still move some code there if necessary.
>
> Yeah I think that it can be simply removed then.

Removed.

> === 5
>
> > @Bertrand: do you have anything against pg_shm_allocations_numa
> > instead of pg_shm_numa_allocations? I don't mind changing it...
>
> I think that pg_shm_allocations_numa() is better (given the examples you just
> shared).

OK, let's go with this name then (in v22).

> === 6
>
> > I find all of those non-user friendly and I'm afraid I won't be able
> > to pull that alone in time...
>
> Maybe we could add a few words in the doc to mention the "multiple nodes per
> buffer" case? And try to improve it for say 19?

Right, we could also put it as a limitation. I would be happy to leave
it as it must be a rare condition, but Tomas stated he's not.

> Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).

Do you mean discard pg_buffercache_numa (0002+0003) and instead go
with pg_shm_allocations_numa (0004) ?

BTW: I've noticed that Tomas responded with his v22 to this after I've
solved all of the above in mine v22, so I'll drop v23 soon and then
let's continue there.


-J.



Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi Jakub,

On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> Right, we could also put it as a limitation. I would be happy to leave
> it as it must be a rare condition, but Tomas stated he's not.
> 
> > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).
> 
> Do you mean discard pg_buffercache_numa (0002+0003) and instead go
> with pg_shm_allocations_numa (0004) ?

No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is
confident enough to also focus in addition to 0004, that's fine too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra <tomas@vondra.me> wrote:

> Ah, OK. Jakub, can you correct (and double-check) this in the next
> version of the patch?

Done.

> > About v21-0002:
> >
> > === 1
> >
> > I can see that the pg_buffercache_init_entries() helper comments are added in
> > v21-0003 but I think that it would be better to add them in v21-0002
> > (where the helper is actually created).
> >
>
> +1 to that

Done.

> > About v21-0003:
> >
> > === 2
> >
> >> I hear you, attached v21 / 0003 is free of float/double arithmetics
> >> and uses non-float point values.
> >
> > +               if (buffers_per_page > 1)
> > +                       os_page_query_count = NBuffers;
> > +               else
> > +                       os_page_query_count = NBuffers * pages_per_buffer;
> >
> > yeah, that's more elegant. I think that it properly handles the relationships
> > between buffer and page sizes without relying on float arithmetic.
> >
>
> In the review I just submitted, I changed this to
>
>     os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
>
> but maybe it's less clear. Feel free to undo my change.

Cool, thanks, will send shortly v23 with this applied.

>
> > === 3
> >
> > +                       if (buffers_per_page > 1)
> > +                       {
> >
> > As buffers_per_page does not change, I think I'd put this check outside of the
> > for (i = 0; i < NBuffers; i++) loop, something like:
> >
> > "
> > if (buffers_per_page > 1) {
> >     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
> >     for (i = 0; i < NBuffers; i++) {
> > .
> > .
> > .
> > .
> > } else {
> >     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
> >     for (i = 0; i < NBuffers; i++) {
> > .
> > .
>
> I don't know. It's a matter of opinion, but I find the current code
> fairly understandable. Maybe if it meaningfully reduced the code
> nesting, but even with the extra branch we'd still need the for loop.
> I'm not against doing this differently, but I'd have to see how that
> looks. Until then I think it's fine to have the code as is.

v23 will have incorporated Bertrand's idea soon. No hard feelings, but
it's kind of painful to switch like that ;)

> > === 4
> >
> >> That _numa_prepare_ptrs() is unused and will need to be removed,
> >> but we can still move some code there if necessary.
> >
> > Yeah I think that it can be simply removed then.
> >
>
> I'm not particularly attached on having the _ptrs() function, but why
> couldn't it build the os_page_ptrs array as before?

I've removed it in v23, the code for me just didn't have flow...

> > === 5
> >
> >> @Bertrand: do you have anything against pg_shm_allocations_numa
> >> instead of pg_shm_numa_allocations? I don't mind changing it...
> >
> > I think that pg_shm_allocations_numa() is better (given the examples you just
> > shared).

Done.

> > === 6
> >
> >> I find all of those non-user friendly and I'm afraid I won't be able
> >> to pull that alone in time...
> >
> > Maybe we could add a few words in the doc to mention the "multiple nodes per
> > buffer" case? And try to improve it for say 19? Also maybe we should just focus
> > till v21-0003 (and discard v21-0004 for 18).
> >
>
> IMHO it's not enough to paper this over by mentioning it in the docs.

OK

> I'm a bit confused about which patch you suggest to leave out. I think
> 0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
> We've been discussing how to represent nodes for buffers in 0003.

IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but
we settled on arrays right?

> I understand it'd require a bit of coding, but AFAICS adding an array
> would be fairly trivial amount of code. Something like
>
>   values[i++]
>     = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));
>
> would do, I think. But if we decide to do the 3-column view, that'd be
> even simpler, I think. AFAICS it means we could mostly ditch the
> pg_buffercache refactoring in patch 0002, and 0003 would get much
> simpler too I think.
>
> If Jakub doesn't have time to work on this, I can take a stab at it
> later today.

I won't be able to even start on this today, so if you have cycles for
please do so...

-J.



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:

> On 4/3/25 09:01, Jakub Wartak wrote:
> > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

Here's v23 attached (had to rework it because the you sent v22 just
the moment you I wanted to send it) Change include:
- your review should be incorporated
- name change for shm view
- pg_buffercache_numa refactor a little to keep out the if outside
loops (as Bertrand wanted initially)

So let's continue in this subthread.

> I think a view with just 3 columns would be a good solution. It's what
> pg_shmem_allocations_numa already does, so it'd be consistent with that
> part too.
>
> I'm not too worried about the cost of the extra join - it's going to be
> a couple dozen milliseconds at worst, I guess, and that's negligible in
> the bigger scheme of things (e.g. compared to how long the move_pages is
> expected to take). Also, it's not like having everything in the same
> view is free - people would have to do some sort of post-processing, and
> that has a cost too.
>
> So unless someone can demonstrate a use case where this would matter,
> I'd not worry about it too much.

OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.

> I'm -1 on JSON, I don't see how would that solve anything better than
> e.g. a regular array, and it's going to be harder to work with. So if we
> don't want to go with the 3-column view proposed earlier, I'd stick to a
> simple array. I don't think there's a huge difference between those two
> approaches, it should be easy to convert between those approaches using
> unnest() and array_agg().
>
> Attached is v22, with some minor review comments:
>
> 1) I suggested we should just use "libnuma support" in configure,
> instead of talking about "NUMA awareness support", and AFAICS you
> agreed. But I still see the old text in configure ... is that
> intentional or a bit of forgotten text?

It was my bad, too many rebases and mishaps with git voodoo..

> 2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
> and simplified a couple lines (but that's a matter of preference).

Great, thanks.

> 3) I don't think it's correct for pg_get_shmem_numa_allocations to just
> silently ignore nodes outside the valid range. I suggest we simply do
> elog(ERROR), as it's an internal error we don't expect to happen.

It's there too.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
On 4/3/25 15:12, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
>> ...
>>
>> So unless someone can demonstrate a use case where this would matter,
>> I'd not worry about it too much.
> 
> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> it's just that I don't have cycles left today and probably lack skills
> (i've never dealt with arrays so far) thus it would be slow to get it
> right... but I can pick up anything tomorrow morning.
> 

OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.

It does these changes:

1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.

2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.

3) It now builds an array of records, with one record per buffer/page.

4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:

   os_page_count = NBuffers * Max(1, pages_per_buffer);

and then

  for (i = 0; i < NBuffers; i++)
  {
      for (j = 0; j < Max(1, pages_per_buffer); j++)
      {
          ..
      }
  }

and everything just works fine, I think.


Opinions? I personally find this much cleaner / easier to understand.


regards

-- 
Tomas Vondra
Attachment

Re: Draft for basic NUMA observability

From
Bertrand Drouvot
Date:
Hi,

On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> On 4/3/25 15:12, Jakub Wartak wrote:
> > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > 
> >> ...
> >>
> >> So unless someone can demonstrate a use case where this would matter,
> >> I'd not worry about it too much.
> > 
> > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > it's just that I don't have cycles left today and probably lack skills
> > (i've never dealt with arrays so far) thus it would be slow to get it
> > right... but I can pick up anything tomorrow morning.
> > 
> 
> OK, I took a stab at reworking/simplifying this the way I proposed.
> Here's v24 - needs more polishing, but hopefully enough to show what I
> had in mind.
> 
> It does these changes:
> 
> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> is not "extending" the existing one.

I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.

The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+  The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.

> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> page_num and node_id. page_num is a sequence starting from 0 for each
> buffer.

+1 on the idea

> 3) It now builds an array of records, with one record per buffer/page.
> 
> 4) I realized we don't really need to worry about buffers_per_page very
> much, except for logging/debugging. There's always "at least one page"
> per buffer, even if an incomplete one, so we can do this:
> 
>    os_page_count = NBuffers * Max(1, pages_per_buffer);
> 
> and then
> 
>   for (i = 0; i < NBuffers; i++)
>   {
>       for (j = 0; j < Max(1, pages_per_buffer); j++)

That's a nice simplification as we always need to take care of at least one page
per buffer.

> and everything just works fine, I think.

I think the same.

> Opinions? I personally find this much cleaner / easier to understand.

I agree that's easier to understand and that that looks correct.

A few random comments:

=== 1

It looks like that firstNumaTouch is not set to false anymore.

=== 2

+               pfree(os_page_status);
+               pfree(os_page_ptrs);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 3

+       ro_volatile_var = *(uint64 *)ptr

space missing before "ptr"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> > On 4/3/25 15:12, Jakub Wartak wrote:
> > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > >
> > >> ...
> > >>
> > >> So unless someone can demonstrate a use case where this would matter,
> > >> I'd not worry about it too much.
> > >
> > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > > it's just that I don't have cycles left today and probably lack skills
> > > (i've never dealt with arrays so far) thus it would be slow to get it
> > > right... but I can pick up anything tomorrow morning.
> > >
> >
> > OK, I took a stab at reworking/simplifying this the way I proposed.
> > Here's v24 - needs more polishing, but hopefully enough to show what I
> > had in mind.
> >
> > It does these changes:
> >
> > 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> > is not "extending" the existing one.
>
> I think that makes sense. One would just need to join on the pg_buffercache
> view to get more information about the buffer if needed.
>
> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
> information as <function>pg_buffercache_pages()</function>" is still true.
>
> > 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> > page_num and node_id. page_num is a sequence starting from 0 for each
> > buffer.
>
> +1 on the idea
>
> > 3) It now builds an array of records, with one record per buffer/page.
> >
> > 4) I realized we don't really need to worry about buffers_per_page very
> > much, except for logging/debugging. There's always "at least one page"
> > per buffer, even if an incomplete one, so we can do this:
> >
> >    os_page_count = NBuffers * Max(1, pages_per_buffer);
> >
> > and then
> >
> >   for (i = 0; i < NBuffers; i++)
> >   {
> >       for (j = 0; j < Max(1, pages_per_buffer); j++)
>
> That's a nice simplification as we always need to take care of at least one page
> per buffer.
>
> > and everything just works fine, I think.
>
> I think the same.
>
> > Opinions? I personally find this much cleaner / easier to understand.
>
> I agree that's easier to understand and that that looks correct.
>
> A few random comments:
>
> === 1
>
> It looks like that firstNumaTouch is not set to false anymore.
>
> === 2
>
> +               pfree(os_page_status);
> +               pfree(os_page_ptrs);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).
>
> === 3
>
> +       ro_volatile_var = *(uint64 *)ptr
>
> space missing before "ptr"?
>

+my feedback as I've noticed that Bertrand already provided a review.

Right, the code is now simple , and that Max() is brilliant. I've
attached some review findings as .txt

0001 100%LGTM
0002 doc fix + pgident + Tomas, you should take Authored-by yourself
there for sure, I couldn't pull this off alone in time! So big thanks!
0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
NUMA on ming32...:))

When started with interleave=all on serious hardware, I'm getting (~5s
for s_b=64GB) from pg_buffercache_numa

     node_id |  count
    ---------+---------
           3 | 2097152
           0 | 2097152
           2 | 2097152
           1 | 2097152

so this is valid result (2097152*4 numa nodes*8192 buffer
size/1024/1024/1024 = 64GB)

Also with pgbench -i -s 20 , after ~8s:
    select c.relname, n.node_id, count(*) from pg_buffercache_numa n
    join pg_buffercache b on (b.bufferid = n.bufferid)
    join pg_class c on (c.relfilenode = b.relfilenode)
    group by c.relname, n.node_id order by count(*) desc;
                        relname                    | node_id | count
    -----------------------------------------------+---------+-------
     pgbench_accounts                              |       2 |  8217
     pgbench_accounts                              |       0 |  8190
     pgbench_accounts                              |       3 |  8189
     pgbench_accounts                              |       1 |  8187
     pg_statistic                                  |       2 |    32
     pg_operator                                   |       2 |    14
     pg_depend                                     |       3 |    14
     [..]

pg_shm_allocations_numa also looks good.

I think v24+tiny fixes is good enough to go in.

-J.

Attachment

Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:

On 4/4/25 09:35, Jakub Wartak wrote:
> On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Hi,
>>
>> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
>>> On 4/3/25 15:12, Jakub Wartak wrote:
>>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>>
>>>>> ...
>>>>>
>>>>> So unless someone can demonstrate a use case where this would matter,
>>>>> I'd not worry about it too much.
>>>>
>>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
>>>> it's just that I don't have cycles left today and probably lack skills
>>>> (i've never dealt with arrays so far) thus it would be slow to get it
>>>> right... but I can pick up anything tomorrow morning.
>>>>
>>>
>>> OK, I took a stab at reworking/simplifying this the way I proposed.
>>> Here's v24 - needs more polishing, but hopefully enough to show what I
>>> had in mind.
>>>
>>> It does these changes:
>>>
>>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
>>> is not "extending" the existing one.
>>
>> I think that makes sense. One would just need to join on the pg_buffercache
>> view to get more information about the buffer if needed.
>>
>> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
>> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
>> information as <function>pg_buffercache_pages()</function>" is still true.
>>
>>> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
>>> page_num and node_id. page_num is a sequence starting from 0 for each
>>> buffer.
>>
>> +1 on the idea
>>
>>> 3) It now builds an array of records, with one record per buffer/page.
>>>
>>> 4) I realized we don't really need to worry about buffers_per_page very
>>> much, except for logging/debugging. There's always "at least one page"
>>> per buffer, even if an incomplete one, so we can do this:
>>>
>>>    os_page_count = NBuffers * Max(1, pages_per_buffer);
>>>
>>> and then
>>>
>>>   for (i = 0; i < NBuffers; i++)
>>>   {
>>>       for (j = 0; j < Max(1, pages_per_buffer); j++)
>>
>> That's a nice simplification as we always need to take care of at least one page
>> per buffer.
>>
>>> and everything just works fine, I think.
>>
>> I think the same.
>>
>>> Opinions? I personally find this much cleaner / easier to understand.
>>
>> I agree that's easier to understand and that that looks correct.
>>
>> A few random comments:
>>
>> === 1
>>
>> It looks like that firstNumaTouch is not set to false anymore.
>>
>> === 2
>>
>> +               pfree(os_page_status);
>> +               pfree(os_page_ptrs);
>>
>> Not sure that's needed, we should be in a short-lived memory context here
>> (ExprContext or such).
>>
>> === 3
>>
>> +       ro_volatile_var = *(uint64 *)ptr
>>
>> space missing before "ptr"?
>>
> 
> +my feedback as I've noticed that Bertrand already provided a review.
> 
> Right, the code is now simple , and that Max() is brilliant. I've
> attached some review findings as .txt
> 
> 0001 100%LGTM
> 0002 doc fix + pgident + Tomas, you should take Authored-by yourself
> there for sure, I couldn't pull this off alone in time! So big thanks!
> 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
> NUMA on ming32...:))
> 

OK

> When started with interleave=all on serious hardware, I'm getting (~5s
> for s_b=64GB) from pg_buffercache_numa
> 
>      node_id |  count
>     ---------+---------
>            3 | 2097152
>            0 | 2097152
>            2 | 2097152
>            1 | 2097152
> 
> so this is valid result (2097152*4 numa nodes*8192 buffer
> size/1024/1024/1024 = 64GB)
> 
> Also with pgbench -i -s 20 , after ~8s:
>     select c.relname, n.node_id, count(*) from pg_buffercache_numa n
>     join pg_buffercache b on (b.bufferid = n.bufferid)
>     join pg_class c on (c.relfilenode = b.relfilenode)
>     group by c.relname, n.node_id order by count(*) desc;
>                         relname                    | node_id | count
>     -----------------------------------------------+---------+-------
>      pgbench_accounts                              |       2 |  8217
>      pgbench_accounts                              |       0 |  8190
>      pgbench_accounts                              |       3 |  8189
>      pgbench_accounts                              |       1 |  8187
>      pg_statistic                                  |       2 |    32
>      pg_operator                                   |       2 |    14
>      pg_depend                                     |       3 |    14
>      [..]
> 
> pg_shm_allocations_numa also looks good.
> 
> I think v24+tiny fixes is good enough to go in.
> 

OK.

Do you have any suggestions regarding the column names in the new view?
I'm not sure I like node_id and page_num.


regards

-- 
Tomas Vondra




Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
On 4/4/25 08:50, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
>> On 4/3/25 15:12, Jakub Wartak wrote:
>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>>> ...
>>>>
>>>> So unless someone can demonstrate a use case where this would matter,
>>>> I'd not worry about it too much.
>>>
>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
>>> it's just that I don't have cycles left today and probably lack skills
>>> (i've never dealt with arrays so far) thus it would be slow to get it
>>> right... but I can pick up anything tomorrow morning.
>>>
>>
>> OK, I took a stab at reworking/simplifying this the way I proposed.
>> Here's v24 - needs more polishing, but hopefully enough to show what I
>> had in mind.
>>
>> It does these changes:
>>
>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
>> is not "extending" the existing one.
> 
> I think that makes sense. One would just need to join on the pg_buffercache
> view to get more information about the buffer if needed.
> 
> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
> information as <function>pg_buffercache_pages()</function>" is still true.
> 

Right, thanks for checking the docs.

>> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
>> page_num and node_id. page_num is a sequence starting from 0 for each
>> buffer.
> 
> +1 on the idea
> 
>> 3) It now builds an array of records, with one record per buffer/page.
>>
>> 4) I realized we don't really need to worry about buffers_per_page very
>> much, except for logging/debugging. There's always "at least one page"
>> per buffer, even if an incomplete one, so we can do this:
>>
>>    os_page_count = NBuffers * Max(1, pages_per_buffer);
>>
>> and then
>>
>>   for (i = 0; i < NBuffers; i++)
>>   {
>>       for (j = 0; j < Max(1, pages_per_buffer); j++)
> 
> That's a nice simplification as we always need to take care of at least one page
> per buffer.
> 

OK. I think I'll consider moving some of this code "building" the
entries into a separate function, to keep the main function easier to
understand.

>> and everything just works fine, I think.
> 
> I think the same.
> 
>> Opinions? I personally find this much cleaner / easier to understand.
> 
> I agree that's easier to understand and that that looks correct.
> 
> A few random comments:
> 
> === 1
> 
> It looks like that firstNumaTouch is not set to false anymore.
> 

Damn, my mistake.

> === 2
> 
> +               pfree(os_page_status);
> +               pfree(os_page_ptrs);
> 
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).
> 

Yeah, maybe. It's not allocated in the multi-call context, but I wasn't
sure. Will check.

> === 3
> 
> +       ro_volatile_var = *(uint64 *)ptr
> 
> space missing before "ptr"?
> 

Interesting the pgindent didn't tweak this.


regards

-- 
Tomas Vondra




Re: Draft for basic NUMA observability

From
Jakub Wartak
Date:
On Fri, Apr 4, 2025 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> Do you have any suggestions regarding the column names in the new view?
> I'm not sure I like node_id and page_num.

They actually look good to me. We've discussed earlier dropping
s/numa_//g for column names (after all views contain it already) so
they are fine in this regard.
There's also the question of consistency: (bufferid, page_num,
node_id) -- maybe should just drop "_" and that's it?
Well I would even possibly consider page_num -> ospagenumber, but that's ugly.

-J.



Re: Draft for basic NUMA observability

From
Tomas Vondra
Date:
OK,

here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc. I think 0001 and 0002 are fine, I have a
couple minor questions about 0003.

0002
----
- Adds the new types to typedefs.list, to make pgindent happy.
- Improves comment for pg_buffercache_numa_pages
- Minor formatting tweaks.

- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?


0003
----
- Minor formatting tweaks, comment improvements.
- Isn't this comment a bit confusing / misleading?

    /* Get number of OS aligned pages */

AFAICS the point is to adjust the allocated_size to be a multiple of
os-page_size, to get "all" memory pages the segment uses. But that's not
what I understand by "aligned page" (which is about there the page is
expected to start). Or did I get this wrong?

- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment, but I'm
also not quite sure I understand what "output shared memory" is ...


regards

-- 
Tomas Vondra

Attachment