Thread: Re: Draft for basic NUMA observability

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

13 February, 18:28:47

Hi,

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
> As I have promised to Andres on the Discord hacking server some time
> ago, I'm attaching the very brief (and potentially way too rushed)
> draft of the first step into NUMA observability on PostgreSQL that was
> based on his presentation [0]. It might be rough, but it is to get us
> started. The patches were not really even basically tested, they are
> more like input for discussion - rather than solid code - to shake out
> what should be the proper form of this.
> 
> Right now it gives:
> 
> postgres=# select numa_zone_id, count(*) from pg_buffercache group by
> numa_zone_id;
> NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
>  numa_zone_id | count
> --------------+-------
>               | 16127
>             6 |   256
>             1 |     1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
                Size            os_page_size;
                void            **os_page_ptrs;
                int                     *os_pages_status;
-               int                     os_page_count;
+               uint64          os_page_count;

and

-               os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.

Observations:

when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177310
            0 |   65192
           -2 |     378
(3 rows)

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
        1662580

In session 2:

postgres=#  select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |      85
           -2 |   65494
(3 rows)

                             ^
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |      90
           -2 |   65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE:  os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 5177301
            0 |   65195
           -2 |     384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

"
One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as  if  the  thread
       had performed a read (load) access to that address, and return the ID of the node where that page was
allocated.

Forcing the allocation to happen inside a monitoring function is decidedly not great.
"

The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e 
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

20 February, 11:32:47

Hi Jakub,

On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote:
> On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi Bertrand,
> 
> Thanks for playing with this!
> 
> > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you
observethe same behavior though.

> 
> You are correct, I'm observing identical behaviour, please see attached.

Thanks for confirming!

> 
> We probably would need to split it to some separate and new view
> within the pg_buffercache extension, but that is going to be slow, yet
> still provide valid results.

Yup.

> In the previous approach that
> get_mempolicy() was allocating on 1st access, but it was slow not only
> because it was allocating but also because it was just 1 syscall per
> 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
> (really allocating) a 128GB buffer cache in future won't cause issues
> - that's like 16-17mln (* 2) syscalls to be issued when not using
> move_pages(2)

Yeah, get_mempolicy() not working on a range is not great.

> > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts?

> 
> Different perspective: I wanted to use the same approach in the new
> pg_shmemallocations_numa, but that won't cut it there. The other idea
> that came to my mind is to issue move_pages() from the backend that
> has already used all of those pages. That literally mean on of the
> below ideas:
> 1. from somewhere like checkpointer / bgwriter?
> 2. add touching memory on backend startup like always (sic!)
> 3. or just attempt to read/touch memory addr just before calling
> move_pages().  E.g. this last options is just two lines:
> 
> if(os_page_ptrs[blk2page+j] == 0) {
> +    volatile uint64 touch pg_attribute_unused();
>     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> (os_page_size*j);
> +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> }
> 
> and it seems to work while still issuing much less syscalls with
> move_pages() across backends, well at least here.

One of the main issue I see with 1. and 2. is that we would not get accurate
results should the kernel decides to migrate the pages. Indeed, the process doing
the move_pages() call needs to have accessed the pages more recently than any
kernel migrations to see accurate locations.

OTOH, one of the main issue that I see with 3. is that the monitoring could
probably influence the kernel's decision to start pages migration (I'm not 100%
sure but I could imagine it may influence the kernel's decision due to having to
read/touch the pages). 

But I'm thinking: do we really need to know the page location of every single page?
I think what we want to see is if the pages are "equally" distributed on all
the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
or contiguous chunks). We could create a new function that would accept a
"sampling distance" as parameter for example, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

24 February, 14:57:16

Hi Bertrand,

TL;DR; the main problem seems choosing which way to page-fault the
shared memory before the backend is going to use numa_move_pages() as
the memory mappings (fresh after fork()/CoW) seem to be not ready for
numa_move_pages() inquiry.

On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> > We probably would need to split it to some separate and new view
> > within the pg_buffercache extension, but that is going to be slow, yet
> > still provide valid results.
>
> Yup.

OK so I've made that NUMA inquiry (now with that "volatile touch" to
get valid results for not used memory) into a new and separate
pg_buffercache_numa view. This avoids the problem that somebody would
automatically run into this slow path when using pg_buffercache.

> > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state &
BM_TAG_VALID)),thoughts? 
> >
> > Different perspective: I wanted to use the same approach in the new
> > pg_shmemallocations_numa, but that won't cut it there. The other idea
> > that came to my mind is to issue move_pages() from the backend that
> > has already used all of those pages. That literally mean on of the
> > below ideas:
> > 1. from somewhere like checkpointer / bgwriter?
> > 2. add touching memory on backend startup like always (sic!)
> > 3. or just attempt to read/touch memory addr just before calling
> > move_pages().  E.g. this last options is just two lines:
> >
> > if(os_page_ptrs[blk2page+j] == 0) {
> > +    volatile uint64 touch pg_attribute_unused();
> >     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
> > (os_page_size*j);
> > +    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > }
> >
> > and it seems to work while still issuing much less syscalls with
> > move_pages() across backends, well at least here.
>
> One of the main issue I see with 1. and 2. is that we would not get accurate
> results should the kernel decides to migrate the pages. Indeed, the process doing
> the move_pages() call needs to have accessed the pages more recently than any
> kernel migrations to see accurate locations.

We never get fully accurate state as the zone memory migration might
be happening as we query it, but in theory we could add something to
e.g. checkpointer/bgwriter that would inquiry it on demand and report
it back somewhat through shared memory (?), but I'm somehow afraid
because as stated at the end of email, it might take some time (well
we probably wouldn't need to "touch memory" then after all, as all of
it is active), but that's still impact to those bgworkers. Somehow I
feel safer if that code is NOT part of bgworker.

> OTOH, one of the main issue that I see with 3. is that the monitoring could
> probably influence the kernel's decision to start pages migration (I'm not 100%
> sure but I could imagine it may influence the kernel's decision due to having to
> read/touch the pages).
>
> But I'm thinking: do we really need to know the page location of every single page?
> I think what we want to see is if the pages are "equally" distributed on all
> the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
> using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
> or contiguous chunks). We could create a new function that would accept a
> "sampling distance" as parameter for example, thoughts?

The way I envision it (and I think what Andres wanted, not sure, still
yet to see him comment on all of this) is to give PG devs a way to
quickly spot NUMA imbalances, even for single relation. Probably some
DBA in the wild could also query it to see how PG/kernel distributes
memory from time to time. It seems to be more debugging and coding aid
for future NUMA optimizations, rather than being used by some
monitoring constantly. I would even dare to say it would require
--enable-debug (or some other developer-only toggle), but apparently
there's no need to hide it like that if those are separate views.

Changes since previous version:
0. rebase due the recent OAuth commit introducing libcurl
1. cast uint64 for NBuffers as You found out
2. put stuff into pg_buffercache_numa
3. 0003 adds pg_shmem_numa_allocations Or should we rather call it
pg_shmem_numa_zones or maybe just pg_shm_numa ?

If there would be agreement that this is the way we want to have it
(from the backend and not from checkpointer), here's what's left on
the table to be done here:
a. isn't there something quicker for touching / page-faulting memory ?
If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried
additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it
probably only works for parent//postmaster). I've also tried
MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too:

+       rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ);
+       if(rc != 0) {
+               elog(NOTICE, "madvice() failed");
+       }
[..]
-                       volatile uint64 touch pg_attribute_unused();
                        os_page_ptrs[i] = (char *)ent->location + (i *
os_page_size);
-                       touch = *(uint64 *)os_page_ptrs[i];

with volatile touching memory or MADV_POPULATE_READ the result seems
to reliable (s_b 128MB here):

postgres@postgres:1234 : 14442 # select * from
pg_shmem_numa_allocations  order by numa_size desc;
                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 | 134221824
 XLOG Ctl                                       |            0 |   4206592
 Buffer Descriptors                             |            0 |   1048576
 transaction                                    |            0 |    528384
 Checkpointer Data                              |            0 |    524288
 Checkpoint BufferIds                           |            0 |    327680
 Shared Memory Stats                            |            0 |    311296
[..]

without at least one of those two, new backend reports complete garbage:

                      name                      | numa_zone_id | numa_size
------------------------------------------------+--------------+-----------
 Buffer Blocks                                  |            0 |    995328
 Shared Memory Stats                            |            0 |    245760
 shmInvalBuffer                                 |            0 |     65536
 Buffer Descriptors                             |            0 |     65536
 Backend Status Array                           |            0 |     61440
 serializable                                   |            0 |     57344
[..]

b. refactor shared code so that it goes into src/port (but with
Linux-only support so far)
c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
d. fix tests, indent it, docs, make cfbot happy

As for the sampling, dunno, fine for me. As an optional argument? but
wouldn't it be better to find a way to actually for it to be quick?

OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to
128GB with numactl --interleave=all pg_ctl start:

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 Buffer Blocks                                  |            0 | 17179869184
 Buffer Blocks                                  |            1 | 17179869184
 Buffer Blocks                                  |            2 | 17179869184
 Buffer Blocks                                  |            3 | 17179869184
 Buffer Blocks                                  |            4 | 17179869184
 Buffer Blocks                                  |            5 | 17179869184
 Buffer Blocks                                  |            6 | 17179869184
 Buffer Blocks                                  |            7 | 17179869184
 Buffer IO Condition Variables                  |            0 |    33554432
 Buffer IO Condition Variables                  |            1 |    33554432
 Buffer IO Condition Variables                  |            2 |    33554432
[..]

but it takes 23s. Yes it takes 23s to just gather that info with
memory touch, but that's ~128GB of memory and is hardly responsible
(lack of C_F_I()). By default without numactl's interleave=all, you
get clear picture of lack of NUMA awareness in PG shared segment (just
as Andres presented, but now it is evident; well it is subject to
autobalancing of course):

postgres=# select * from pg_shmem_numa_allocations ;
                      name                      | numa_zone_id |  numa_size
------------------------------------------------+--------------+-------------
[..]
 commit_timestamp                               |            0 |     2097152
 commit_timestamp                               |            1 |     6291456
 commit_timestamp                               |            2 |           0
 commit_timestamp                               |            3 |           0
 commit_timestamp                               |            4 |           0
[..]
 transaction                                    |            0 |    14680064
 transaction                                    |            1 |           0
 transaction                                    |            2 |           0
 transaction                                    |            3 |           0
 transaction                                    |            4 |     2097152
[..]

Somehow without interleave it is very quick too.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

24 February, 17:06:20

Hi,

On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:
> On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > OTOH, one of the main issue that I see with 3. is that the monitoring could
> > probably influence the kernel's decision to start pages migration (I'm not 100%
> > sure but I could imagine it may influence the kernel's decision due to having to
> > read/touch the pages).
> >
> > But I'm thinking: do we really need to know the page location of every single page?
> > I think what we want to see is if the pages are "equally" distributed on all
> > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about
> > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer
> > or contiguous chunks). We could create a new function that would accept a
> > "sampling distance" as parameter for example, thoughts?
> 
> The way I envision it (and I think what Andres wanted, not sure, still
> yet to see him comment on all of this) is to give PG devs a way to
> quickly spot NUMA imbalances, even for single relation.

Yea. E.g. for some benchmark workloads the difference whether the root btree
page is on the same NUMA node as the workload or not makes a roughly 2x perf
difference. It's really hard to determine that today.


> If there would be agreement that this is the way we want to have it
> (from the backend and not from checkpointer), here's what's left on
> the table to be done here:

> a. isn't there something quicker for touching / page-faulting memory ?

If you actually fault in a page the kernel actually has to allocate memory and
then zero it out. That rather severely limits the throughput...


> If not then maybe add CHECKS_FOR_INTERRUPTS() there?

Should definitely be there.


> BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> help (it probably only works for parent//postmaster).

Yes, needs to be in postmaster.

Does the issue with "new" backends seeing pages as not present exist both with
and without huge pages?


FWIW, what you posted fails on CI:
https://cirrus-ci.com/task/5114213770723328

Probably some ifdefs are missing. The sanity-check task configures with
minimal dependencies, which is why you're seeing this even on linux.




> b. refactor shared code so that it goes into src/port (but with
> Linux-only support so far)
> c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?

You mean a specific context instead of CurrentMemoryContext?


> diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> index 91b51142d2e..e3b7554d9e8 100644
> --- a/.cirrus.tasks.yml
> +++ b/.cirrus.tasks.yml
> @@ -436,6 +436,10 @@ task:
>          SANITIZER_FLAGS: -fsanitize=address
>          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
>  
> +
> +      # FIXME: use or not the libnuma?
> +      #      --with-libnuma \
> +      #
>        # Normally, the "relation segment" code basically has no coverage in our
>        # tests, because we (quite reasonably) don't generate tables large
>        # enough in tests. We've had plenty bugs that we didn't notice due
> the

I don't see why not.


> diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> new file mode 100644
> index 00000000000..e5b3d1f7dd2
> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,30 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the function.
> +DROP FUNCTION pg_buffercache_pages() CASCADE;

Why? I think that's going to cause problems, as the pg_buffercache view
depends on it, and user views might turn in depend on pg_buffercache.  I think
CASCADE is rarely, if ever, ok to use in an extension scripot.


> +CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean)
> +RETURNS SETOF RECORD
> +AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
> +LANGUAGE C PARALLEL SAFE;
>
> +-- Create a view for convenient access.
> +CREATE OR REPLACE VIEW pg_buffercache AS
> +    SELECT P.* FROM pg_buffercache_pages(false) AS P
> +    (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
> +     relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
> +     pinning_backends int4);
> +
> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
> +    SELECT P.* FROM pg_buffercache_pages(true) AS P
> +    (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
> +     relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
> +     pinning_backends int4, numa_zone_id int4);
> +
> +-- Don't want these to be available to public.
> +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;

We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
for _numa?


> @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
>              else
>                  fctx->record[i].isvalid = false;
>  
> +#ifdef USE_LIBNUMA
> +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> +            blk2page = (int) i * pages_per_blk;

BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.



> +            j = 0;
> +            do {
> +                /*
> +                 * Many buffers can point to the same page, but we want to
> +                 * query just first address.
> +                 *
> +                 * In order to get reliable results we also need to touch memory pages
> +                 * so that inquiry about NUMA zone doesn't return -2.
> +                 */
> +                if(os_page_ptrs[blk2page+j] == 0) {
> +                    volatile uint64 touch pg_attribute_unused();
> +                    os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> +                    touch = *(uint64 *)os_page_ptrs[blk2page+j];
> +                }
> +                j++;
> +            } while(j < (int)pages_per_blk);
> +#endif
> +

Why is this done before we even have gotten -2 back? Even if we need it, it
seems like we ought to defer this until necessary.


> +#ifdef USE_LIBNUMA
> +        if(query_numa) {
> +            /* According to numa(3) it is required to initialize library even if that's no-op. */
> +            if(numa_available() == -1) {
> +                pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> +                elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> +            } else {
> +                /* Amortize the number of pages we need to query about */
> +                if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> +                    elog(ERROR, "failed NUMA pages inquiry status");
> +                }

I wonder if we ought to override numa_error() so we can display more useful
errors.


> +
> +    LWLockAcquire(ShmemIndexLock, LW_SHARED);

Doing multiple memory allocations while holding an lwlock is probably not a
great idea, even if the lock normally isn't contended.


> +        os_page_ptrs = palloc(sizeof(void *) * os_page_count);
> +        os_pages_status = palloc(sizeof(int) * os_page_count);

Why do this in very loop iteration?


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

24 February, 19:11:15

Hi,

On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

That's a good point and from what I can see it's correct with huge pages being
used (it means all processes see the same NUMA node assignment regardless of
access patterns).

That said, wouldn't that be too strong to impose a restriction that huge_pages
must be enabled?

Jakub, thanks for the new patch version! FWIW, I did not look closely to the
code yet (just did the minor changes already shared to have valid result with non
tiny shared buffer size). I'll look closely at the code for sure once we all agree
on the design part of it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

26 February, 11:38:20

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:

> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing  multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
                      name                      |    off    | ?column?
 | allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
 Proc Header                                    | 147114112 |
147114240 |            128
 Proc Array                                     | 147274752 |
147275392 |            640
 KnownAssignedXids                              | 147275392 |
147310848 |          35456
 KnownAssignedXidsValid                         | 147310848 |
147319808 |           8960
 Backend Status Array                           | 147319808 |
147381248 |          61440

postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
       name        | numa_zone_id | numa_size
-------------------+--------------+-----------
 KnownAssignedXids |            0 |   2097152
 Proc Array        |            0 |   2097152
 Proc Header       |            0 |   2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.

Added.

> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages)               hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)               hugepages=on        INVALID RESULTS (-2)
patch(touchpages)                hugepages=off        CORRECT RESULT
patch(touchpages)                hugepages=on        CORRECT RESULT
patch(-touchpages)+MAP_POPULATE  hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE  hugepages=on        INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)

Done.

> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> >          SANITIZER_FLAGS: -fsanitize=address
> >          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > +      # FIXME: use or not the libnuma?
> > +      #      --with-libnuma \
> > +      #
> >        # Normally, the "relation segment" code basically has no coverage in our
> >        # tests, because we (quite reasonably) don't generate tables large
> >        # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.

Fixed.

> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache.  I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.

... it's just me cutting corners :^), fixed now.

[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?

Yup, fixed.

> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> >                       else
> >                               fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > +                     blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

> > +                     j = 0;
> > +                     do {
> > +                             /*
> > +                              * Many buffers can point to the same page, but we want to
> > +                              * query just first address.
> > +                              *
> > +                              * In order to get reliable results we also need to touch memory pages
> > +                              * so that inquiry about NUMA zone doesn't return -2.
> > +                              */
> > +                             if(os_page_ptrs[blk2page+j] == 0) {
> > +                                     volatile uint64 touch pg_attribute_unused();
> > +                                     os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > +                                     touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > +                             }
> > +                             j++;
> > +                     } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

> > +#ifdef USE_LIBNUMA
> > +             if(query_numa) {
> > +                     /* According to numa(3) it is required to initialize library even if that's no-op. */
> > +                     if(numa_available() == -1) {
> > +                             pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> > +                             elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> > +                     } else {
> > +                             /* Amortize the number of pages we need to query about */
> > +                             if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > +                                     elog(ERROR, "failed NUMA pages inquiry status");
> > +                             }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?

Both fixed.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

26 February, 11:48:41

On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> > Does the issue with "new" backends seeing pages as not present exist both with
> > and without huge pages?
>
> That's a good point and from what I can see it's correct with huge pages being
> used (it means all processes see the same NUMA node assignment regardless of
> access patterns).

Hi Bertrand , please see that nearby thread. I've got quite the
opposite results. I need page-fault memory or I get invalid results
("-2"). What kernel version are you using ? (I've tried it on two
6.10.x series kernels , virtualized in both cases; one was EPYC [real
NUMA, but not VM so not real hardware]).

> That said, wouldn't that be too strong to impose a restriction that huge_pages
> must be enabled?
>
> Jakub, thanks for the new patch version! FWIW, I did not look closely to the
> code yet (just did the minor changes already shared to have valid result with non
> tiny shared buffer size). I'll look closely at the code for sure once we all agree
> on the design part of it.

Cool, I think we are pretty close actually, but others might have
different perspective.

-J.

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

26 February, 12:58:28

Hi,

On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > FWIW, what you posted fails on CI:
> > https://cirrus-ci.com/task/5114213770723328
> >
> > Probably some ifdefs are missing. The sanity-check task configures with
> > minimal dependencies, which is why you're seeing this even on linux.
> 
> Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> of this CI stuff...

FYI, you can enable CI on a github repo, to see results without posting to the
list:
https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

26 February, 16:05:59

On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > > FWIW, what you posted fails on CI:
> > > https://cirrus-ci.com/task/5114213770723328
> > >
> > > Probably some ifdefs are missing. The sanity-check task configures with
> > > minimal dependencies, which is why you're seeing this even on linux.
> >
> > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> > of this CI stuff...
>
> FYI, you can enable CI on a github repo, to see results without posting to the
> list:
> https://github.com/postgres/postgres/blob/master/src/tools/ci/README

Thanks, I'll take a look into it.

Meanwhile v5 is attached with slight changes to try to make cfbot happy:
1. fixed tests and added tiny copy-cat basic tests for
pg_buffercache_numa and pg_shm_numa_allocations views
2. win32 doesn't have sysconf()

No docs yet.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

26 February, 20:13:19

Hi,

On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote:
> On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote:
> > > > FWIW, what you posted fails on CI:
> > > > https://cirrus-ci.com/task/5114213770723328
> > > >
> > > > Probably some ifdefs are missing. The sanity-check task configures with
> > > > minimal dependencies, which is why you're seeing this even on linux.
> > >
> > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
> > > of this CI stuff...
> >
> > FYI, you can enable CI on a github repo, to see results without posting to the
> > list:
> > https://github.com/postgres/postgres/blob/master/src/tools/ci/README
> 
> Thanks, I'll take a look into it.
> 
> Meanwhile v5 is attached with slight changes to try to make cfbot happy:

Thanks for the updated version!

FWIW, I had to do a few changes to get an error free compiling experience with
autoconf/or meson and both with or without the libnuma configure option.

Sharing here as .txt files:

v5-0004-configure-changes.txt: changes in configure + add a test on numa.h
availability and a call to numa_available.

v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA
because the file is still using sysconf() in the non-NUMA code path. Also,
removed a ";" in "#endif;" in the non-NUMA code path.

v5-0006-meson.build-changes.txt.

Those apply on top of your v5.

Also the pg_buffercache test fails without the libnuma configure option. Maybe
some tests should depend of the libnuma configure option.

Still did not look closely to the code.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

27 February, 12:00:53

Hi Jakub,

On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote:
> On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi,
> >
> > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote:
> > > Does the issue with "new" backends seeing pages as not present exist both with
> > > and without huge pages?
> >
> > That's a good point and from what I can see it's correct with huge pages being
> > used (it means all processes see the same NUMA node assignment regardless of
> > access patterns).
> 
> Hi Bertrand , please see that nearby thread. I've got quite the
> opposite results. I need page-fault memory or I get invalid results
> ("-2"). What kernel version are you using ? (I've tried it on two
> 6.10.x series kernels , virtualized in both cases; one was EPYC [real
> NUMA, but not VM so not real hardware])

Thanks for sharing your numbers!

It looks like that with hp enabled then the shared_buffers plays a role.

1. With hp, shared_buffers 4GB:

 huge_pages_status
-------------------
 on
(1 row)

 shared_buffers
----------------
 4GB
(1 row)

NOTICE:  os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id | count
--------------+--------
              | 507618
            0 |   1054
           -2 |  15616
(3 rows)

2. With hp, shared_buffers 23GB:

 huge_pages_status
-------------------
 on
(1 row)

 shared_buffers
----------------
 23GB
(1 row)

NOTICE:  os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id |  count
--------------+---------
              | 2997974
            0 |   16682
(2 rows)

3. no hp, shared_buffers 23GB:

 huge_pages_status
-------------------
 off
(1 row)

 shared_buffers
----------------
 23GB
(1 row)

ERROR:  extension "pg_buffercache" already exists
NOTICE:  os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id |  count
--------------+---------
              | 2997975
           -2 |   16482
            1 |     199
(3 rows)

Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've
no ideas. Anyway there is little than we can do and the "touchpages" patch
seems to provide "accurate" results.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

27 February, 12:05:46

On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
[..]
> > Meanwhile v5 is attached with slight changes to try to make cfbot happy:
>
> Thanks for the updated version!
>
> FWIW, I had to do a few changes to get an error free compiling experience with
> autoconf/or meson and both with or without the libnuma configure option.
>
> Sharing here as .txt files:
>
> Also the pg_buffercache test fails without the libnuma configure option. Maybe
> some tests should depend of the libnuma configure option.
[..]

Thank you so much for this Bertrand !

I've applied those , played a little bit with configure and meson and
reproduced the test error and fixed it by silencing that NOTICE in
tests. So v6 is attached even before I get a chance to start using
that CI. Still waiting for some input and tests regarding that earlier
touchpages attempt, docs are still missing...

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

27 February, 18:34:25

Hi,

On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote:
> On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> [..]
> > > Meanwhile v5 is attached with slight changes to try to make cfbot happy:
> >
> > Thanks for the updated version!
> >
> > FWIW, I had to do a few changes to get an error free compiling experience with
> > autoconf/or meson and both with or without the libnuma configure option.
> >
> > Sharing here as .txt files:
> >
> > Also the pg_buffercache test fails without the libnuma configure option. Maybe
> > some tests should depend of the libnuma configure option.
> [..]
> 
> Thank you so much for this Bertrand !
> 
> I've applied those , played a little bit with configure and meson and
> reproduced the test error and fixed it by silencing that NOTICE in
> tests. So v6 is attached even before I get a chance to start using
> that CI. Still waiting for some input and tests regarding that earlier
> touchpages attempt, docs are still missing...

Thanks for the new version!

I did some tests and it looks like it's giving correct results. I don't see -2
anymore and every backend reports correct distribution (with or without hp,
with "small" or "large" shared buffer).

A few random comments:

=== 1

+               /*
+                * This is for gathering some NUMA statistics. We might be using
+                * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
+                * allocated in various different OS memory pages sizes, so first we
+                * need to understand the OS memory page size before calling
+                * move_pages()
+                */
+               os_page_size = pg_numa_get_pagesize();
+               os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;
+               pages_per_blk = (float) BLCKSZ / os_page_size;
+
+               elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f",
+                        os_page_count, os_page_size, pages_per_blk);
+
+               os_page_ptrs = palloc(sizeof(void *) * os_page_count);
+               os_pages_status = palloc(sizeof(int) * os_page_count);
+               memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
+
+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here
+                */
+               memset(os_pages_status, 0xff, sizeof(int) * os_page_count);

I think that if (query_numa) check should also wrap that entire section of code.

=== 2

+                       if (query_numa)
+                       { 
+                          blk2page = (int) i * pages_per_blk;
+                          j = 0;
+                          do
+                          {

This check is done for every page. I wonder if it would not make sense
to create a brand new function for pg_buffercache_numa and just let the
current pg_buffercache_pages() as it is. That said it would be great to avoid 
code duplication as much a possible though, maybe using a shared 
populate_buffercache_entry() or such helper function?

=== 3

+#define ONE_GIGABYTE 1024*1024*1024
+                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
+                                                       CHECK_FOR_INTERRUPTS();
+                                       }

Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
for every page instead? (I don't see with a 30GB shared buffer). I've the
feeling that we could get rid of the "ONE_GIGABYTE" check.

=== 4

+               pfree(os_page_ptrs);
+               pfree(os_pages_status);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 5

+/* SQL SRF showing NUMA zones for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{

That's a good idea.

+               for (i = 0; i < shm_ent_page_count; i++)
+               {
+                       /*
+                        * In order to get reliable results we also need to touch memory
+                        * pages so that inquiry about NUMA zone doesn't return -2.
+                        */
+                       volatile uint64 touch pg_attribute_unused();
+
+                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);

That sounds right.

Could we also avoid some code duplication with pg_get_shmem_allocations()?

Also same remarks about pfree() and ONE_GIGABYTE as above.

A few other things:

==== 6

+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "port/pg_numa.h"

Not at the right position, should be between those 2:

#include "miscadmin.h"
#include "storage/lwlock.h"

==== 7

+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *       Miscellaneous functions for bit-wise operations.

description is not correct. Also the "Copyright (c) 2019-2025" might be
"Copyright (c) 2025" instead.

=== 8

+++ b/src/port/pg_numa.c
@@ -0,0 +1,150 @@
+/*-------------------------------------------------------------------------
+ *
+ * numa.c
+ *             Basic NUMA portability routines

s/numa.c/pg_numa.c/ ?

=== 9

+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -6,6 +6,7 @@
  *       contrib/pg_buffercache/pg_buffercache_pages.c
  *-------------------------------------------------------------------------
  */
+#include "pg_config.h"
 #include "postgres.h"

Is this new include needed?

 #include "access/htup_details.h"
@@ -13,10 +14,12 @@
 #include "funcapi.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"

not in the right order.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

27 February, 18:42:45

Hi,

On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
> On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Why is this done before we even have gotten -2 back? Even if we need it, it
> > seems like we ought to defer this until necessary.
> 
> Not fixed yet: maybe we could even do a `static` with
> `has_this_run_earlier` and just perform this work only once during the
> first time?

Not sure I get your idea, could you share what the code would look like?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

04 March, 13:48:31

Hi!

On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> I did some tests and it looks like it's giving correct results. I don't see -2
> anymore and every backend reports correct distribution (with or without hp,
> with "small" or "large" shared buffer).

Cool! Attached is v7 that is fully green on cirrus CI that Andres
recommended, we will see how cfbot reacts to this. BTW docs are still
missing. When started with proper interleave=all with s_b=64GB,
hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench:

postgres=# select buffers_used, buffers_unused from pg_buffercache_summary();
 buffers_used | buffers_unused
--------------+----------------
       170853 |        8217755
(
ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group
by numa_zone_id order by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906
 numa_zone_id |  count
--------------+---------
            0 |   42752
            1 |   42752
            2 |   42752
            3 |   42597
              | 8217755
Time: 5828.100 ms (00:05.828)
postgres=# select 3*42752+42597;
 ?column?
----------
   170853

postgres=# select * from pg_shmem_numa_allocations order by numa_size
desc limit 12;
DEBUG:  NUMA: page-faulting shared memory segments for proper NUMA readouts
        name        | numa_zone_id |  numa_size
--------------------+--------------+-------------
 Buffer Blocks      |            0 | 17179869184
 Buffer Blocks      |            1 | 17179869184
 Buffer Blocks      |            3 | 17179869184
 Buffer Blocks      |            2 | 17179869184
 Buffer Descriptors |            2 |   134217728
 Buffer Descriptors |            1 |   134217728
 Buffer Descriptors |            0 |   134217728
 Buffer Descriptors |            3 |   134217728
 Checkpointer Data  |            1 |    67108864
 Checkpointer Data  |            0 |    67108864
 Checkpointer Data  |            2 |    67108864
 Checkpointer Data  |            3 |    67108864

Time: 68.579 ms


> A few random comments:
>
> === 1
>
[..]
> I think that the if (query_numa) check should also wrap that entire section of code.

Done.


> === 2
>
> +                       if (query_numa)
> +                       {
> +                          blk2page = (int) i * pages_per_blk;
> +                          j = 0;
> +                          do
> +                          {
>
> This check is done for every page. I wonder if it would not make sense
> to create a brand new function for pg_buffercache_numa and just let the
> current pg_buffercache_pages() as it is. That said it would be great to avoid
> code duplication as much a possible though, maybe using a shared
> populate_buffercache_entry() or such helper function?

Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
there maybe to hint compiler with some smaller routine to reduce
complexity of that main routine? (assuming NUMA inquiry is going to be
rare).

> === 3
>
> +#define ONE_GIGABYTE 1024*1024*1024
> +                                               if ((i * os_page_size) % ONE_GIGABYTE == 0)
> +                                                       CHECK_FOR_INTERRUPTS();
> +                                       }
>
> Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
> for every page instead? (I don't see with a 30GB shared buffer). I've the
> feeling that we could get rid of the "ONE_GIGABYTE" check.

You are right and no it was simply my premature optimization attempt,
as apparently CFI on closer looks seems to be already having
unlikely() and looks really cheap, so yeah I've removed that.

> === 4
>
> +               pfree(os_page_ptrs);
> +               pfree(os_pages_status);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).

Yes, I wanted to have it just for illustrative and stylish purposes,
but right, removed.

> === 5
>
> +/* SQL SRF showing NUMA zones for allocated shared memory */
> +Datum
> +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
> +{
[..]
>
> +               for (i = 0; i < shm_ent_page_count; i++)
> +               {
> +                       /*
> +                        * In order to get reliable results we also need to touch memory
> +                        * pages so that inquiry about NUMA zone doesn't return -2.
> +                        */
> +                       volatile uint64 touch pg_attribute_unused();
> +
> +                       page_ptrs[i] = (char *) ent->location + (i * os_page_size);
> +                       pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
>
> That sounds right.
>
> Could we also avoid some code duplication with pg_get_shmem_allocations()?

Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?

>
> Also same remarks about pfree() and ONE_GIGABYTE as above.

Fixed.

> A few other things:
>
> ==== 6
>
> +#include "port/pg_numa.h"
> Not at the right position, should be between those 2:
>
> #include "miscadmin.h"
> #include "storage/lwlock.h"

Fixed.


> ==== 7
>
> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *       Miscellaneous functions for bit-wise operations.
>
> description is not correct. Also the "Copyright (c) 2019-2025" might be
> "Copyright (c) 2025" instead.

Fixed.

> === 8
>
> +++ b/src/port/pg_numa.c
> @@ -0,0 +1,150 @@
> +/*-------------------------------------------------------------------------
> + *
> + * numa.c
> + *             Basic NUMA portability routines
>
> s/numa.c/pg_numa.c/ ?

Fixed.

> === 9
>
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -6,6 +6,7 @@
>   *       contrib/pg_buffercache/pg_buffercache_pages.c
>   *-------------------------------------------------------------------------
>   */
> +#include "pg_config.h"
>  #include "postgres.h"
>
> Is this new include needed?

Removed, don't remember how it arrived here, most have been some
artifact of earlier attempts.

>  #include "access/htup_details.h"
> @@ -13,10 +14,12 @@
>  #include "funcapi.h"
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
>
> not in the right order.

Fixed.

And also those from nearby message:

On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
> > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Why is this done before we even have gotten -2 back? Even if we need it, it
> > > seems like we ought to defer this until necessary.
> >
> > Not fixed yet: maybe we could even do a `static` with
> > `has_this_run_earlier` and just perform this work only once during the
> > first time?
>
> Not sure I get your idea, could you share what the code would look like?

Please see pg_buffercache_pages I've just added static bool firstUseInBackend:

postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
[..]
postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG:  NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
[..]

Same was done to the pg_get_shmem_numa_allocations.

Also CFbot/cirrus was getting:

> [11:01:05.027] In file included from ../../src/include/postgres.h:49,
> [11:01:05.027]                  from pg_buffercache_pages.c:10:
> [11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’:
> [11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3
hastype ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=] 

Fixed with %zu (for size_t) instead of %ld.

> Linux - Debian Bookworm - Autoconf got:
> [10:42:59.216] checking numa.h usability... no
> [10:42:59.268] checking numa.h presence... no
> [10:42:59.286] checking for numa.h... no
> [10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma

I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this.

> [13:50:47.449] gcc -m32 @src/backend/postgres.rsp
> [13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format

I've also got an error in 32-bit build as libnuma.h is there, but
apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE)
and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for
such build.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

04 March, 19:02:17

Hi,

On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote:
> Hi!
> 
> On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > I did some tests and it looks like it's giving correct results. I don't see -2
> > anymore and every backend reports correct distribution (with or without hp,
> > with "small" or "large" shared buffer).
> 
> Cool! Attached is v7

Thanks for the new version!

> > === 2
> >
> > +                       if (query_numa)
> > +                       {
> > +                          blk2page = (int) i * pages_per_blk;
> > +                          j = 0;
> > +                          do
> > +                          {
> >
> > This check is done for every page. I wonder if it would not make sense
> > to create a brand new function for pg_buffercache_numa and just let the
> > current pg_buffercache_pages() as it is. That said it would be great to avoid
> > code duplication as much a possible though, maybe using a shared
> > populate_buffercache_entry() or such helper function?
> 
> Well, I've made query_numa a parameter there simply to avoid that code
> duplication in the first place, look at those TupleDescInitEntry()...

Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
or such function: to put the "duplicated" code in it and then use this
shared function in pg_buffercache_pages() and in the new numa related one.

> IMHO rarely anybody uses pg_buffercache, but we could add unlikely()

I think unlikely() should be used for optimization based on code path likelihood,
not based on how often users might use a feature.

> > === 5
> >
> > Could we also avoid some code duplication with pg_get_shmem_allocations()?
> 
> Not sure I understand do you want to avoid code duplication
> pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
> pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
> true) ?

I meant to say avoid code duplication between pg_get_shmem_allocations() and
pg_get_shmem_numa_allocations(). It might be possible to create a shared 
function for them too. That said, it looks like that the savings (if any), would
not be that much, so maybe just forget about it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

05 March, 12:30:02

On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> > Cool! Attached is v7
> Thanks for the new version!

... and another one: 7b ;)

> > > === 2
[..]
> > Well, I've made query_numa a parameter there simply to avoid that code
> > duplication in the first place, look at those TupleDescInitEntry()...
>
> Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> or such function: to put the "duplicated" code in it and then use this
> shared function in pg_buffercache_pages() and in the new numa related one.

OK, so hastily attempted that in 7b , I had to do a larger refactor
there to avoid code duplication between those two. I don't know which
attempt is better though (7 vs 7b)..

> > IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
>
> I think unlikely() should be used for optimization based on code path likelihood,
> not based on how often users might use a feature.

In 7b I've removed the unlikely() - For a moment I was thinking that
you are concerned about this loop for NBuffers to be as much optimized
as it can and that's the reason for splitting the routines.

> > > === 5
[..]
> I meant to say avoid code duplication between pg_get_shmem_allocations() and
> pg_get_shmem_numa_allocations(). It might be possible to create a shared
> function for them too. That said, it looks like that the savings (if any), would
> not be that much, so maybe just forget about it.

 Yeah, OK, so let's leave it at that.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

07 March, 13:20:03

Hi,
On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>Hi,

> > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > or such function: to put the "duplicated" code in it and then use this
> > shared function in pg_buffercache_pages() and in the new numa related one.
>
> OK, so hastily attempted that in 7b , I had to do a larger refactor
> there to avoid code duplication between those two. I don't know which
> attempt is better though (7 vs 7b)..
>

I'm attaching basically the earlier stuff (v7b) as v8 with the
following minor changes:
- docs are included
- changed int8 to int4 in one function definition for numa_zone_id

-J.

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

07 March, 14:33:27

On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> Hi,
> On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >Hi,
>
> > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > or such function: to put the "duplicated" code in it and then use this
> > > shared function in pg_buffercache_pages() and in the new numa related one.
> >
> > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > there to avoid code duplication between those two. I don't know which
> > attempt is better though (7 vs 7b)..
> >
>
> I'm attaching basically the earlier stuff (v7b) as v8 with the
> following minor changes:
> - docs are included
> - changed int8 to int4 in one function definition for numa_zone_id

.. and v9 attached because cfbot partially complained about
.cirrus.tasks.yml being adjusted recently (it seems master is hot
these days).

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

10 March, 13:14:07

Hi,

On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote:
> On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > Hi,
> > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> > <jakub.wartak@enterprisedb.com> wrote:
> > >Hi,
> >
> > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > > or such function: to put the "duplicated" code in it and then use this
> > > > shared function in pg_buffercache_pages() and in the new numa related one.
> > >
> > > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > > there to avoid code duplication between those two. I don't know which
> > > attempt is better though (7 vs 7b)..
> > >
> >
> > I'm attaching basically the earlier stuff (v7b) as v8 with the
> > following minor changes:
> > - docs are included
> > - changed int8 to int4 in one function definition for numa_zone_id
> 
> .. and v9 attached because cfbot partially complained about
> .cirrus.tasks.yml being adjusted recently (it seems master is hot
> these days).
> 

Thanks for the new version!

Some random comments on 0001:

=== 1

It does not compiles "alone". It's missing:

+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader  /* standard header for all Postgres shmem */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;

and

--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int  ssl_renegotiation_limit;
  */
 int            huge_pages = HUGE_PAGES_TRY;
 int            huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int            huge_pages_status = HUGE_PAGES_UNKNOWN;

That come with 0002. So it has to be in 0001 instead.

=== 2

+else
+  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi

Maybe we should add the same test (checking for numa.h) for meson?

=== 3

+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)

It looks to me that we can remove those 2 lines.

=== 4

+         Only supported on Linux.

s/on Linux/on platforms for which the libnuma library is implemented/?

I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.

=== 5

+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>

I had a closer look to other header files and it looks like it "should" be:

#include "c.h"
#include "postgres.h"

#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif

#include "port/pg_numa.h"
#include "storage/pg_shmem.h"

And is "#include "c.h"" really needed?

=== 6

+/* FIXME not tested, might crash */

That's a bit scary.

=== 7

+        * XXX: for now we issue just WARNING, but long-term that might depend on
+        * numa_set_strict() here

s/here/here./

Did not look carefully at all the comments in 0001, 0002 and 0003 though.

A few random comments regarding 0002:

=== 8

# create extension pg_buffercache;
ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages
 
CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7

While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.

=== 9

+       do
+       {
+               if (os_page_ptrs[blk2page + j] == 0)

blk2page + j will be repeated multiple times, better to store it in a local
variable instead?

=== 10

+                       if (firstUseInBackend == true)


if (firstUseInBackend) instead?

=== 11

+       int                     j = 0,
+                               blk2page = (int) i * pages_per_blk;

I wonder if size_t is more appropriate for blk2page:

size_t blk2page = (size_t)(i * pages_per_blk) maybe?

=== 12

as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?

"
for (size_t j = 0; j < pages_per_blk; j++)
{
    if (os_page_ptrs[blk2page + j] == 0)
    {
"

=== 13

+       if (buf_state & BM_DIRTY)
+               fctx->record[i].isdirty = true;
+       else
+               fctx->record[i].isdirty = false;

  fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?

=== 14

+       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+               fctx->record[i].isvalid = true;
+       else
+               fctx->record[i].isvalid = false;

fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?

=== 15

+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)

I wonder if "i" should get a more descriptive name?

=== 16

s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).

I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.

=== 17

+       static bool firstUseInBackend = true;

maybe we should give it a more descriptive name?

Also I need to think more about how firstUseInBackend is used, for example:

   ==== 17.1

   would it be better to define it as a file scope variable? (at least that
   would avoid to pass it as an extra parameter in some functions).

   === 17.2 

   what happens if firstUseInBackend is set to false and later on the pages
   are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
   called again by a backend that already had set firstUseInBackend to false,
   would that provide accurate information?

=== 18

+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+       FuncCallContext *funcctx;
+       BufferCachePagesContext *fctx;  /* User function context. */
+       bool            query_numa = true;

I don't think we need query_numa anymore in pg_buffercache_numa_pages().

I think that we can just ERROR (or NOTICE and return) here:

+               if (pg_numa_init() == -1)
+               {
+                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable.");
 
+                       query_numa = false;

and fully get rid of query_numa.

=== 19

And then here:

        for (i = 0; i < NBuffers; i++)
        {
            populate_buffercache_entry(i, fctx);
            if (query_numa)
                pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
        }

        if (query_numa)
        {
            if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
                elog(ERROR, "failed NUMA pages inquiry: %m");

            for (i = 0; i < NBuffers; i++)
            {
                int         blk2page = (int) i * pages_per_blk;

                /*
                 * Technically we can get errors too here and pass that to
                 * user. Also we could somehow report single DB block spanning
                 * more than one NUMA zone, but it should be rare.
                 */
                fctx->record[i].numa_zone_id = os_pages_status[blk2page];
            }

maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

A few random comments regarding 0003:

=== 20

+  <indexterm zone="view-pg-shmem-numa-allocations">
+   <primary>pg_shmem_numa_allocations</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes

s/pg_shmem_allocations/pg_shmem_numa_allocations/?

=== 21

+  /* FIXME: should we release LWlock here ? */
+  elog(ERROR, "failed NUMA pages inquiry status: %m");

There is no need, see src/backend/storage/lmgr/README.

=== 22

+#define MAX_NUMA_ZONES 32              /* FIXME? */
+               Size            zones[MAX_NUMA_ZONES];

could we rely on pg_numa_get_max_node() instead?

=== 23

+                       if (s >= 0)
+                               zones[s]++;

should we also check that s is below a limit?

=== 24

Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?


Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I 
did a little in 0001 though)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

12 March, 18:41:15

On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Thanks for the new version!

v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:

> Some random comments on 0001:
>
> === 1
>
> It does not compiles "alone". It's missing:
>
[..]
> +extern PGDLLIMPORT int huge_pages_status;
[..]
> -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> +int            huge_pages_status = HUGE_PAGES_UNKNOWN;
>
> That come with 0002. So it has to be in 0001 instead.

Ugh, good catch, I haven't thought about it in isolation, they are
separate to just ease review, but should be committed together. Fixed.

> === 2
>
> +else
> +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> +fi
>
> Maybe we should add the same test (checking for numa.h) for meson?

TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?

> === 3
>
> +# FIXME: filter-out / with/without with_libnuma?
> +LIBS += $(LIBNUMA_LIBS)
>
> It looks to me that we can remove those 2 lines.

Done.

> === 4
>
> +         Only supported on Linux.
>
> s/on Linux/on platforms for which the libnuma library is implemented/?
>
> I did a quick grep on "Only supported on" and it looks like that could be
> a more consistent wording.

Fixed.

> === 5
>
> +#include "c.h"
> +#include "postgres.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
> +#include <unistd.h>
>
> I had a closer look to other header files and it looks like it "should" be:
>
> #include "c.h"
> #include "postgres.h"
>
> #include <unistd.h>
> #ifdef WIN32
> #include <windows.h>
> #endif
>
> #include "port/pg_numa.h"
> #include "storage/pg_shmem.h"
>
> And is "#include "c.h"" really needed?

Fixed both. It seems to compile without c.h.

> === 6
>
> +/* FIXME not tested, might crash */
>
> That's a bit scary.

When you are in support for long enough it is becoming the norm ;) But
on serious note Andres wanted have numa error/warning handlers (used
by libnuma), but current code has no realistic code-path to hit it
from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
situation won't change until one day in future (I hope!) we start
using some more advanced libnuma functionality for interleaving
memory, please see my earlier reply:
https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
 E.g. numa_available(3) is tiny wrapper , see
https://github.com/numactl/numactl/blob/master/libnuma.c#L872

For now, I've adjusted that FIXME to XXX, but still don't know we
could inject failure to see this triggered...

> === 7
>
> +        * XXX: for now we issue just WARNING, but long-term that might depend on
> +        * numa_set_strict() here
>
> s/here/here./

Done.

> Did not look carefully at all the comments in 0001, 0002 and 0003 though.
>
> A few random comments regarding 0002:
>
> === 8
>
> # create extension pg_buffercache;
> ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> RETURNS SETOF RECORD
> AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> LANGUAGE C PARALLEL SAFE"
> extension script file "pg_buffercache--1.2.sql", near line 7
>
> While that's ok if 0003 is applied. I think that each individual patch should
> "fully" work individually.

STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?

> === 9
>
> +       do
> +       {
> +               if (os_page_ptrs[blk2page + j] == 0)
>
> blk2page + j will be repeated multiple times, better to store it in a local
> variable instead?

Done.

> === 10
>
> +                       if (firstUseInBackend == true)
>
>
> if (firstUseInBackend) instead?

Done everywhere.

> === 11
>
> +       int                     j = 0,
> +                               blk2page = (int) i * pages_per_blk;
>
> I wonder if size_t is more appropriate for blk2page:
>
> size_t blk2page = (size_t)(i * pages_per_blk) maybe?

Sure, done.

> === 12
>
> as we know that we'll iterate until pages_per_blk then would a for loop be more
> appropriate here, something like?
>
> "
> for (size_t j = 0; j < pages_per_blk; j++)
> {
>     if (os_page_ptrs[blk2page + j] == 0)
>     {
> "

Sure.

> === 13
>
> +       if (buf_state & BM_DIRTY)
> +               fctx->record[i].isdirty = true;
> +       else
> +               fctx->record[i].isdirty = false;
>
>   fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
>
> === 14
>
> +       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> +               fctx->record[i].isvalid = true;
> +       else
> +               fctx->record[i].isvalid = false;
>
> fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> maybe?

It is coming from the original pg_buffercache and it is less readable
to me, so I don't want to touch that too much because that might open
refactoring doors too wide.

> === 15
>
> +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
>
> I wonder if "i" should get a more descriptive name?

Done, buffer_id.

> === 16
>
> s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> with pg_stat_io_build_tuples for example).
>
> I now realize that I did propose populate_buffercache_entry() up-thread,
> sorry for changing my mind.

OK, I've tried to create a better name, but all my ideas were too
long, but pg_buffercache_build_tuple sounds nice, so lets use that.

> === 17
>
> +       static bool firstUseInBackend = true;
>
> maybe we should give it a more descriptive name?

I couldn't come up with anything that wouldn't look too long, so
instead I've added a comment explaining the meaning behind this
variable, hope that's good enough.

> Also I need to think more about how firstUseInBackend is used, for example:
>
>    ==== 17.1
>
>    would it be better to define it as a file scope variable? (at least that
>    would avoid to pass it as an extra parameter in some functions).

I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?

>    === 17.2
>
>    what happens if firstUseInBackend is set to false and later on the pages
>    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
>    called again by a backend that already had set firstUseInBackend to false,
>    would that provide accurate information?

It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:

postgres=# select 'create table foo' || s || ' as select
generate_series(1, 100000);' from generate_series(1, 4) s;
postgres=# \gexec
SELECT 100000
SELECT 100000
SELECT 100000
SELECT 100000
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id; -- before:
 numa_zone_id |  count
--------------+---------
            0 |     256
            1 |    4131
              | 8384221
postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
           1451

-- now use another root (!) session to "migratepages(8)" from numactl
will also shift shm segment:
# migratepages 1451 1 3

-- while above will be in progress for a lot of time , but the outcome
is visible much faster in that backend (pg is functioning):
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id;
 numa_zone_id |  count
--------------+---------
            0 |     256
            3 |    4480
              | 8383872

So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.

BTW: migratepages(8) was stuck for 1-2 minutes there on
"__wait_rcu_gp" according to it's wchan, without any sign of activity
on the OS and then out of blue completed just fine, s_b=64GB,HP=on.

> === 18
>
> +Datum
> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> +{
> +       FuncCallContext *funcctx;
> +       BufferCachePagesContext *fctx;  /* User function context. */
> +       bool            query_numa = true;
>
> I don't think we need query_numa anymore in pg_buffercache_numa_pages().
>
> I think that we can just ERROR (or NOTICE and return) here:
>
> +               if (pg_numa_init() == -1)
> +               {
> +                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable."); 
> +                       query_numa = false;
>
> and fully get rid of query_numa.

Right... fixed it here, but it made the tests  blow up, so I've had to
find a way to conditionally launch tests based on NUMA availability
and that's how pg_numa_available() was born. It's in ipc/shmem.c
because I couldn't find a better place for it...

> === 19
>
> And then here:
>
>         for (i = 0; i < NBuffers; i++)
>         {
>             populate_buffercache_entry(i, fctx);
>             if (query_numa)
>                 pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
>         }
>
>         if (query_numa)
>         {
>             if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
>                 elog(ERROR, "failed NUMA pages inquiry: %m");
>
>             for (i = 0; i < NBuffers; i++)
>             {
>                 int         blk2page = (int) i * pages_per_blk;
>
>                 /*
>                  * Technically we can get errors too here and pass that to
>                  * user. Also we could somehow report single DB block spanning
>                  * more than one NUMA zone, but it should be rare.
>                  */
>                 fctx->record[i].numa_zone_id = os_pages_status[blk2page];
>             }
>
> maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
that prepares **os_page_ptrs, which is then used by
pg_numa_query_pages() and then we fill the data. But now after
removing query_numa , it reads smoother anyway. Can you please take a
look again on this, is this up to the project standards?

> A few random comments regarding 0003:
>
> === 20
>
> +   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
>
> s/pg_shmem_allocations/pg_shmem_numa_allocations/?

Fixed.

> === 21
>
> +  /* FIXME: should we release LWlock here ? */
> +  elog(ERROR, "failed NUMA pages inquiry status: %m");
>
> There is no need, see src/backend/storage/lmgr/README.

Thanks, fixed.

> === 22
>
> +#define MAX_NUMA_ZONES 32              /* FIXME? */
> +               Size            zones[MAX_NUMA_ZONES];
>
> could we rely on pg_numa_get_max_node() instead?

Sure, done.

> === 23
>
> +                       if (s >= 0)
> +                               zones[s]++;
>
> should we also check that s is below a limit?

STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.

> === 24
>
> Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> no possible holes? I mean could a system have node 0,1 and 3 but not 2?

I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline. Even *if*
someone would be doing some preventive hardware maintenance, that
still wouldn't hurt, as we are just using the Id of the zone to
display it -- the max would be already higher. I mean, technically one
could use lsmem(1) (it mentions removable flag, after let's pages and
processes migrated away and from there) and then use chcpu(1) to
--disable CPUs on that zone (to prevent new ones coming there and
allocating new local memory there) and then offlining that memory
region via chmem(1) --disable. Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`

> Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> did a little in 0001 though)

That was an attempt to say "Thank You", OK, I've aligned it that way,
so you are still referenced in 0001. I wouldn't find motivation to
work on this if you won't respond to those emails ;)

There is one known issue, CI returned for numa.out test, that I need
to get bottom to (it does not reproduce for me locally) inside
numa.out/sql:

SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90


-J.

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

13 March, 12:22:25

On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>
> > Thanks for the new version!
>
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:
>
> > Some random comments on 0001:
> >
> > === 1
> >
> > It does not compiles "alone". It's missing:
> >
> [..]
> > +extern PGDLLIMPORT int huge_pages_status;
> [..]
> > -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> > +int            huge_pages_status = HUGE_PAGES_UNKNOWN;
> >
> > That come with 0002. So it has to be in 0001 instead.
>
> Ugh, good catch, I haven't thought about it in isolation, they are
> separate to just ease review, but should be committed together. Fixed.
>
> > === 2
> >
> > +else
> > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
>
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?
>
> > === 3
> >
> > +# FIXME: filter-out / with/without with_libnuma?
> > +LIBS += $(LIBNUMA_LIBS)
> >
> > It looks to me that we can remove those 2 lines.
>
> Done.
>
> > === 4
> >
> > +         Only supported on Linux.
> >
> > s/on Linux/on platforms for which the libnuma library is implemented/?
> >
> > I did a quick grep on "Only supported on" and it looks like that could be
> > a more consistent wording.
>
> Fixed.
>
> > === 5
> >
> > +#include "c.h"
> > +#include "postgres.h"
> > +#include "port/pg_numa.h"
> > +#include "storage/pg_shmem.h"
> > +#include <unistd.h>
> >
> > I had a closer look to other header files and it looks like it "should" be:
> >
> > #include "c.h"
> > #include "postgres.h"
> >
> > #include <unistd.h>
> > #ifdef WIN32
> > #include <windows.h>
> > #endif
> >
> > #include "port/pg_numa.h"
> > #include "storage/pg_shmem.h"
> >
> > And is "#include "c.h"" really needed?
>
> Fixed both. It seems to compile without c.h.
>
> > === 6
> >
> > +/* FIXME not tested, might crash */
> >
> > That's a bit scary.
>
> When you are in support for long enough it is becoming the norm ;) But
> on serious note Andres wanted have numa error/warning handlers (used
> by libnuma), but current code has no realistic code-path to hit it
> from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
> situation won't change until one day in future (I hope!) we start
> using some more advanced libnuma functionality for interleaving
> memory, please see my earlier reply:
> https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
>  E.g. numa_available(3) is tiny wrapper , see
> https://github.com/numactl/numactl/blob/master/libnuma.c#L872
>
> For now, I've adjusted that FIXME to XXX, but still don't know we
> could inject failure to see this triggered...
>
> > === 7
> >
> > +        * XXX: for now we issue just WARNING, but long-term that might depend on
> > +        * numa_set_strict() here
> >
> > s/here/here./
>
> Done.
>
> > Did not look carefully at all the comments in 0001, 0002 and 0003 though.
> >
> > A few random comments regarding 0002:
> >
> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
>
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?
>
> > === 9
> >
> > +       do
> > +       {
> > +               if (os_page_ptrs[blk2page + j] == 0)
> >
> > blk2page + j will be repeated multiple times, better to store it in a local
> > variable instead?
>
> Done.
>
> > === 10
> >
> > +                       if (firstUseInBackend == true)
> >
> >
> > if (firstUseInBackend) instead?
>
> Done everywhere.
>
> > === 11
> >
> > +       int                     j = 0,
> > +                               blk2page = (int) i * pages_per_blk;
> >
> > I wonder if size_t is more appropriate for blk2page:
> >
> > size_t blk2page = (size_t)(i * pages_per_blk) maybe?
>
> Sure, done.
>
> > === 12
> >
> > as we know that we'll iterate until pages_per_blk then would a for loop be more
> > appropriate here, something like?
> >
> > "
> > for (size_t j = 0; j < pages_per_blk; j++)
> > {
> >     if (os_page_ptrs[blk2page + j] == 0)
> >     {
> > "
>
> Sure.
>
> > === 13
> >
> > +       if (buf_state & BM_DIRTY)
> > +               fctx->record[i].isdirty = true;
> > +       else
> > +               fctx->record[i].isdirty = false;
> >
> >   fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
> >
> > === 14
> >
> > +       if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > +               fctx->record[i].isvalid = true;
> > +       else
> > +               fctx->record[i].isvalid = false;
> >
> > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > maybe?
>
> It is coming from the original pg_buffercache and it is less readable
> to me, so I don't want to touch that too much because that might open
> refactoring doors too wide.
>
> > === 15
> >
> > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
> >
> > I wonder if "i" should get a more descriptive name?
>
> Done, buffer_id.
>
> > === 16
> >
> > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> > with pg_stat_io_build_tuples for example).
> >
> > I now realize that I did propose populate_buffercache_entry() up-thread,
> > sorry for changing my mind.
>
> OK, I've tried to create a better name, but all my ideas were too
> long, but pg_buffercache_build_tuple sounds nice, so lets use that.
>
> > === 17
> >
> > +       static bool firstUseInBackend = true;
> >
> > maybe we should give it a more descriptive name?
>
> I couldn't come up with anything that wouldn't look too long, so
> instead I've added a comment explaining the meaning behind this
> variable, hope that's good enough.
>
> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> >    ==== 17.1
> >
> >    would it be better to define it as a file scope variable? (at least that
> >    would avoid to pass it as an extra parameter in some functions).
>
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?
>
> >    === 17.2
> >
> >    what happens if firstUseInBackend is set to false and later on the pages
> >    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> >    called again by a backend that already had set firstUseInBackend to false,
> >    would that provide accurate information?
>
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
>
> postgres=# select 'create table foo' || s || ' as select
> generate_series(1, 100000);' from generate_series(1, 4) s;
> postgres=# \gexec
> SELECT 100000
> SELECT 100000
> SELECT 100000
> SELECT 100000
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id; -- before:
>  numa_zone_id |  count
> --------------+---------
>             0 |     256
>             1 |    4131
>               | 8384221
> postgres=# select pg_backend_pid();
>  pg_backend_pid
> ----------------
>            1451
>
> -- now use another root (!) session to "migratepages(8)" from numactl
> will also shift shm segment:
> # migratepages 1451 1 3
>
> -- while above will be in progress for a lot of time , but the outcome
> is visible much faster in that backend (pg is functioning):
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id;
>  numa_zone_id |  count
> --------------+---------
>             0 |     256
>             3 |    4480
>               | 8383872
>
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.
>
> BTW: migratepages(8) was stuck for 1-2 minutes there on
> "__wait_rcu_gp" according to it's wchan, without any sign of activity
> on the OS and then out of blue completed just fine, s_b=64GB,HP=on.
>
> > === 18
> >
> > +Datum
> > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> > +{
> > +       FuncCallContext *funcctx;
> > +       BufferCachePagesContext *fctx;  /* User function context. */
> > +       bool            query_numa = true;
> >
> > I don't think we need query_numa anymore in pg_buffercache_numa_pages().
> >
> > I think that we can just ERROR (or NOTICE and return) here:
> >
> > +               if (pg_numa_init() == -1)
> > +               {
> > +                       elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some
NUMAdata might be unavailable."); 
> > +                       query_numa = false;
> >
> > and fully get rid of query_numa.
>
> Right... fixed it here, but it made the tests  blow up, so I've had to
> find a way to conditionally launch tests based on NUMA availability
> and that's how pg_numa_available() was born. It's in ipc/shmem.c
> because I couldn't find a better place for it...
>
> > === 19
> >
> > And then here:
> >
> >         for (i = 0; i < NBuffers; i++)
> >         {
> >             populate_buffercache_entry(i, fctx);
> >             if (query_numa)
> >                 pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
> >         }
> >
> >         if (query_numa)
> >         {
> >             if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
> >                 elog(ERROR, "failed NUMA pages inquiry: %m");
> >
> >             for (i = 0; i < NBuffers; i++)
> >             {
> >                 int         blk2page = (int) i * pages_per_blk;
> >
> >                 /*
> >                  * Technically we can get errors too here and pass that to
> >                  * user. Also we could somehow report single DB block spanning
> >                  * more than one NUMA zone, but it should be rare.
> >                  */
> >                 fctx->record[i].numa_zone_id = os_pages_status[blk2page];
> >             }
> >
> > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
>
> Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
> that prepares **os_page_ptrs, which is then used by
> pg_numa_query_pages() and then we fill the data. But now after
> removing query_numa , it reads smoother anyway. Can you please take a
> look again on this, is this up to the project standards?
>
> > A few random comments regarding 0003:
> >
> > === 20
> >
> > +   The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
> >
> > s/pg_shmem_allocations/pg_shmem_numa_allocations/?
>
> Fixed.
>
> > === 21
> >
> > +  /* FIXME: should we release LWlock here ? */
> > +  elog(ERROR, "failed NUMA pages inquiry status: %m");
> >
> > There is no need, see src/backend/storage/lmgr/README.
>
> Thanks, fixed.
>
> > === 22
> >
> > +#define MAX_NUMA_ZONES 32              /* FIXME? */
> > +               Size            zones[MAX_NUMA_ZONES];
> >
> > could we rely on pg_numa_get_max_node() instead?
>
> Sure, done.
>
> > === 23
> >
> > +                       if (s >= 0)
> > +                               zones[s]++;
> >
> > should we also check that s is below a limit?
>
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.
>
> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
>
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline. Even *if*
> someone would be doing some preventive hardware maintenance, that
> still wouldn't hurt, as we are just using the Id of the zone to
> display it -- the max would be already higher. I mean, technically one
> could use lsmem(1) (it mentions removable flag, after let's pages and
> processes migrated away and from there) and then use chcpu(1) to
> --disable CPUs on that zone (to prevent new ones coming there and
> allocating new local memory there) and then offlining that memory
> region via chmem(1) --disable. Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`
>
> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
>
> That was an attempt to say "Thank You", OK, I've aligned it that way,
> so you are still referenced in 0001. I wouldn't find motivation to
> work on this if you won't respond to those emails ;)
>
> There is one known issue, CI returned for numa.out test, that I need
> to get bottom to (it does not reproduce for me locally) inside
> numa.out/sql:
>
> SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING:  detected write past chunk end in ExprContext 0x55bb7f1a8f90
>

Hi, ok, so v11 should fix this last problem and make cfbot happy.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

13 March, 17:15:14

Hi,

On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote:
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Thanks for the new version!
> 
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:

Thanks for the new version!

> > === 2
> >
> > +else
> > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
> 
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?

Yeah, idea was to have both in sync. 

> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages
 
> > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
> 
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?

Applying 0001 + 0002 only does produce the issue above. I think that each
sub-patch once applied should pass make check-world, that means:

0001 : should pass
0001 + 0002 : should pass
0001 + 0002 + 0003 : should pass

> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> >    ==== 17.1
> >
> >    would it be better to define it as a file scope variable? (at least that
> >    would avoid to pass it as an extra parameter in some functions).
> 
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?

That would be "just" one more. I think it's better to use it to avoid the "current"
code using "useless" function parameters.

> >    === 17.2
> >
> >    what happens if firstUseInBackend is set to false and later on the pages
> >    are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> >    called again by a backend that already had set firstUseInBackend to false,
> >    would that provide accurate information?
> 
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
> 
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.

Great, thanks for the demo and the explanation!

> > === 19
> >
> 
> Can you please take a look again on this

Sure, will do.

> > === 23
> >
> > +                       if (s >= 0)
> > +                               zones[s]++;
> >
> > should we also check that s is below a limit?
> 
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.

I meant to say, ensure that it is below the max node number.

> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
> 
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline.

Yeah probably.

> Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`

Yeah, I agree, thanks for clearing my doubts on it.

> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
> 
> That was an attempt to say "Thank You",

Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-)

I'll have a look at v11 soon.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

13 March, 18:38:33

Hi,

On Thu, Mar 13, 2025 at 02:15:14PM +0000, Bertrand Drouvot wrote:
> > > === 19
> > >
> > 
> > Can you please take a look again on this
> 
> Sure, will do.

> I'll have a look at v11 soon.

About 0001:

=== 1

git am produces:

.git/rebase-apply/patch:378: new blank line at EOF.
+
.git/rebase-apply/patch:411: new blank line at EOF.
+
warning: 2 lines add whitespace errors.

=== 2

+  Returns true if a <acronym>NUMA</acronym> support is available.

What about "Returns true if the server has been compiled with NUMA support"?

=== 3

+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+       if(pg_numa_init() == -1)
+               PG_RETURN_BOOL(false);
+       PG_RETURN_BOOL(true);
+}

What about PG_RETURN_BOOL(pg_numa_init() != -1)?

Also I wonder if pg_numa.c would not be a better place for it.

=== 4

+{ oid => '5102', descr => 'Is NUMA compilation available?',
+  proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
+  proargtypes => '', prosrc => 'pg_numa_available' },
+

Not sure that 5102 is a good choice.

src/include/catalog/unused_oids is telling me:

Best practice is to start with a random choice in the range 8000-9999.
Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)

So maybe use 9685 instead?

=== 5

@@ -12477,3 +12481,4 @@
   prosrc => 'gist_stratnum_common' },

 ]
+

garbage?

=== 6

Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
and src/port/pg_numa.c.

=== 7

+Size
+pg_numa_get_pagesize(void)
+{
+       Size os_page_size = sysconf(_SC_PAGESIZE);
+       if (huge_pages_status == HUGE_PAGES_ON)
+                GetHugePageSize(&os_page_size, NULL);
+       return os_page_size;
+}

I think that's more usual to:

Size
pg_numa_get_pagesize(void)
{
       Size os_page_size = sysconf(_SC_PAGESIZE);

       if (huge_pages_status == HUGE_PAGES_ON)
                GetHugePageSize(&os_page_size, NULL);

       return os_page_size;
}

I think that makes sense to check huge_pages_status as you did.

=== 8

+extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
+extern void numa_error(char *where);

I wonder if that would make sense to add a comment mentioning
github.com/numactl/numactl/blob/master/libnuma.c here.

I still need to look at 0002 and 0003.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

14 March, 13:05:28

On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi,

Thank you very much for the review! I'm answering to both reviews in
one go and the results is attached v12, seems it all should be solved
now:

> > > === 2
> > >
> > > +else
> > > +  as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > > +fi
> > >
> > > Maybe we should add the same test (checking for numa.h) for meson?
> >
> > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> > between meson and autoconf?
>
> Yeah, idea was to have both in sync.

Added.

> > > === 8
> > >
> > > # create extension pg_buffercache;
> > > ERROR:  could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so":
/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages 
> > > CONTEXT:  SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > > RETURNS SETOF RECORD
> > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > > LANGUAGE C PARALLEL SAFE"
> > > extension script file "pg_buffercache--1.2.sql", near line 7
> > >
> > > While that's ok if 0003 is applied. I think that each individual patch should
> > > "fully" work individually.
> >
> > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> > 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> > pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> > patches or keep them separate?
>
> Applying 0001 + 0002 only does produce the issue above. I think that each
> sub-patch once applied should pass make check-world, that means:
>
> 0001 : should pass
> 0001 + 0002 : should pass
> 0001 + 0002 + 0003 : should pass

OK, I've retested v11 for all three of them. It worked fine (I think
in v10 I've moved one function to 0001, but pg_numa_query_pages() as
per error message above was always in the 0001).

> > > Also I need to think more about how firstUseInBackend is used, for example:
> > >
> > >    ==== 17.1
> > >
> > >    would it be better to define it as a file scope variable? (at least that
> > >    would avoid to pass it as an extra parameter in some functions).
> >
> > I have no strong opinion on this, but I have one doubt (for future):
> > isn't creating global variables making life harder for upcoming
> > multithreading guys ?
>
> That would be "just" one more. I think it's better to use it to avoid the "current"
> code using "useless" function parameters.

Done.

> > > === 23
> > >
> > > +                       if (s >= 0)
> > > +                               zones[s]++;
> > >
> > > should we also check that s is below a limit?
> >
> > STILL OPEN QUESTION: I'm not sure it would give us value to report
> > e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> > could somehow overallocate that array and remember negative ones too.
>
> I meant to say, ensure that it is below the max node number.

Done, but I doubt the kernel would return a value higher than
numa_max_nodes(), but who knows. Additional defense for this array is
now there.

> SECOND REVIEW//v11-0001 review
>
> === 1
>
> git am produces:
>
> .git/rebase-apply/patch:378: new blank line at EOF.
> +
> .git/rebase-apply/patch:411: new blank line at EOF.
> +
> warning: 2 lines add whitespace errors.

Should be gone, but in at least one case (0003/numa.out) we need to
have empty EOF because otherwise expected tests don't pass (even if
numa.sql doesnt have EOF in numa.sql)

> === 2
>
> +  Returns true if a <acronym>NUMA</acronym> support is available.
>
> What about "Returns true if the server has been compiled with NUMA support"?

Done.

> === 3
>
> +Datum
> +pg_numa_available(PG_FUNCTION_ARGS)
> +{
> +       if(pg_numa_init() == -1)
> +               PG_RETURN_BOOL(false);
> +       PG_RETURN_BOOL(true);
> +}
>
> What about PG_RETURN_BOOL(pg_numa_init() != -1)?
>
> Also I wonder if pg_numa.c would not be a better place for it.

Both done.

> === 4
>
> +{ oid => '5102', descr => 'Is NUMA compilation available?',
> +  proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool',
> +  proargtypes => '', prosrc => 'pg_numa_available' },
> +
>
> Not sure that 5102 is a good choice.
>
> src/include/catalog/unused_oids is telling me:
>
> Best practice is to start with a random choice in the range 8000-9999.
> Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here)
>
> So maybe use 9685 instead?

It's because i've got 5101 there earlier (for pg_shm_numa_allocs
view), but I've aligned both (5102@0001 and 5101@0003) to 968[56].

> === 5
>
> @@ -12477,3 +12481,4 @@
>    prosrc => 'gist_stratnum_common' },
>
>  ]
> +
>
> garbage?

Yea, fixed.

> === 6
>
> Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c
> and src/port/pg_numa.c.

Fixed.

> === 7
>
> +Size
> +pg_numa_get_pagesize(void)
> +{
> +       Size os_page_size = sysconf(_SC_PAGESIZE);
> +       if (huge_pages_status == HUGE_PAGES_ON)
> +                GetHugePageSize(&os_page_size, NULL);
> +       return os_page_size;
> +}
>
> I think that's more usual to:
>
> Size
> pg_numa_get_pagesize(void)
> {
>        Size os_page_size = sysconf(_SC_PAGESIZE);
>
>        if (huge_pages_status == HUGE_PAGES_ON)
>                 GetHugePageSize(&os_page_size, NULL);
>
>        return os_page_size;
> }
>
> I think that makes sense to check huge_pages_status as you did.

Done.

> === 8
>
> +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3);
> +extern void numa_error(char *where);
>
> I wonder if that would make sense to add a comment mentioning
> github.com/numactl/numactl/blob/master/libnuma.c here.

Sure thing, added.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

14 March, 15:08:46

Hi,

On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
> On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi,
> 
> Thank you very much for the review! I'm answering to both reviews in
> one go and the results is attached v12, seems it all should be solved
> now:

Thanks for v12! 

I'll review 0001 and 0003 later, but want to share what I've done for 0002.

I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of 
v12 0002).

The main changes are:

=== 1

Some doc wording changes.

=== 2

Some comments, pgindent and friends changes.

=== 3

         relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
-        pinning_backends int4, numa_zone_id int4);
+        pinning_backends int4, zone_id int4);

I changed numa_zone_id to zone_id as we are already in "numa" related functions
and/or views.

=== 4

-                       CHECK_FOR_INTERRUPTS();
                }
+
+               CHECK_FOR_INTERRUPTS();

I think that it's better to put the CFI outside of the if condition, so that it
can be called for every page.

=== 5

-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)

for clarity.

=== 6

-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)

for clarity.

=== 7

-               int                     os_page_count = 0;
+               uint64          os_page_count = 0;

I think that's better.

=== 8

But then, here, we need to change its format to %lu and and to cast to (unsigned long)
to make the CI CompilerWarnings happy.

That's fine because we don't provide NUMA support for 32 bits anyway.

-               elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
-                               os_page_count, os_page_size, pages_per_blk);
+               elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+                        (unsigned long) os_page_count, os_page_size, pages_per_blk);

=== 9

-static bool firstUseInBackend = true;
+static bool firstNumaTouch = true;

Looks better to me but still not 100% convinced by the name.

=== 10

 static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)

as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
a lof of examples in the code that make use of "FunctionCallInfo" for non
fmgr-compatible function.

and also:

=== 11

I don't like the fact that we iterate 2 times over NBuffers in
pg_buffercache_numa_pages().

But I'm having having hard time finding a better approach given the fact that
pg_numa_query_pages() needs all the pointers "prepared" before it can be called...

Those 2 loops are probably the best approach, unless someone has a better idea.

=== 12

Upthread you asked "Can you please take a look again on this, is this up to the
project standards?"

Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?

What do you think? The comments, doc and code changes are just proposals and are
fully open to discussion.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

on_top_of_v12_0002.patch

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

17 March, 10:28:46

On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
> > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you very much for the review! I'm answering to both reviews in
> > one go and the results is attached v12, seems it all should be solved
> > now:
>
> Thanks for v12!
>
> I'll review 0001 and 0003 later, but want to share what I've done for 0002.
>
> I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
> on top of v11 0002 (I just rebased it a bit so that it now applies on top of
> v12 0002).

Hey Bertrand,

all LGTM (good ideas), so here's v13 attached with applied all of that
(rebased, tested). BTW: I'm sending to make cfbot as it still tried to
apply that .patch (on my side it .patch, not .txt)

> === 9
>
> -static bool firstUseInBackend = true;
> +static bool firstNumaTouch = true;
>
> Looks better to me but still not 100% convinced by the name.

IMHO, Yes, it looks much better.

> === 10
>
>  static BufferCachePagesContext *
> -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
> +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
>
> as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
> a lof of examples in the code that make use of "FunctionCallInfo" for non
> fmgr-compatible function.

Cool, thanks.

> and also:
>
> === 11
>
> I don't like the fact that we iterate 2 times over NBuffers in
> pg_buffercache_numa_pages().
>
> But I'm having having hard time finding a better approach given the fact that
> pg_numa_query_pages() needs all the pointers "prepared" before it can be called...
>
> Those 2 loops are probably the best approach, unless someone has a better idea.

IMHO, it doesn't hurt and I've also not been able to think of any better idea.

> === 12
>
> Upthread you asked "Can you please take a look again on this, is this up to the
> project standards?"
>
> Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper?

Yes, this was for an earlier doubt regarding question "19" about
reviewing the code after removal of `query_numa` variable. This is the
same code for 2 loops, IMHO it is good now.

> What do you think? The comments, doc and code changes are just proposals and are
> fully open to discussion.

They are great, thank You!

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

17 March, 19:11:27

Hi,

On Mon, Mar 17, 2025 at 08:28:46AM +0100, Jakub Wartak wrote:
> On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
> > on top of v11 0002 (I just rebased it a bit so that it now applies on top of
> > v12 0002).
> 
> Hey Bertrand,
> 
> all LGTM (good ideas), so here's v13 attached with applied all of that
> (rebased, tested).

Thanks for v13! 

Looking at 0003:

=== 1

+      <entry>NUMA mappings for shared memory allocations</entry>

s/NUMA mappings/NUMA node mappings/ maybe?

=== 2

+  <para>
+   The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
+   assigned allocations made from the server's main shared memory segment.

What about?

"
shows how shared memory allocations in the server's main shared memory segment
are distributed across NUMA nodes" ?

=== 3

+       <structfield>numa_zone_id</structfield> <type>int4</type>

s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
0002.

BTW, I wonder if "node_id" would be better (to match the descriptions...).
If so, would also need to be done in 0002.

=== 4

+      ID of NUMA node

<acronym>NUMA</acronym> node ID? (to be consistent with 0002).

=== 5

+static bool firstUseInBackend = true;

Let's use firstNumaTouch to be consistent with 0002.

=== 6

+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be
unavailable.");;

There is 2 ";" + I think that we should used the same wording as in
pg_buffercache_numa_pages().

=== 7

What about using ERROR instead? (like in pg_buffercache_numa_pages())

=== 8

+       /*
+        * This is for gathering some NUMA statistics. We might be using various
+        * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
+        * various different OS memory pages sizes, so first we need to understand
+        * the OS memory page size before calling move_pages()
+        */
+       os_page_size = pg_numa_get_pagesize();

Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling 
pg_numa_get_pagesize()?

=== 9

+       max_zones = pg_numa_get_max_node();

I think we are mixing "zone" and "node". I think we should standardize on one
and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
vote for node, but zone is fine too if you prefer.

=== 10

+       /*
+        * Preallocate memory all at once without going into details which shared
+        * memory segment is the biggest (technically min s_b can be as low as
+        * 16xBLCKSZ)
+        */

What about? 

"
Allocate memory for page pointers and status based on total shared memory size.
This simplified approach allocates enough space for all pages in shared memory
rather than calculating the exact requirements for each segment.
" instead?

=== 11

+       int                     shm_total_page_count,
+                               shm_ent_page_count,

I think those 2 should be uint64.

=== 12

+       /*
+        * XXX: We are ignoring in NUMA version reporting of the following regions
+        * (compare to pg_get_shmem_allocations() case): 1. output shared memory
+        * allocated but not counted via the shmem index 2. output as-of-yet
+        * unused shared memory
+        */

why XXX?

what about?

"
We are ignoring the following memory regions (as compared to
pg_get_shmem_allocations())....

=== 13

+       page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
+       memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);

maybe we could use palloc0() here?

=== 14

and I realize that we could probably use it in 0002 for os_page_ptrs.

=== 15

I think there is still some multi-lines comments that are missing a period. I
probably also missed some in 0002 during the previous review. I think that's
worth another check.

=== 16

+    * In order to get reliable results we also need to touch memory
+    * pages so that inquiry about NUMA zone doesn't return -2.
+    */

maybe use the same wording as in 0002?

=== 17

The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
but (as for 0002) it looks like we can not avoid it (or at least I don't see
a way to avoid it).

I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
updated.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

18 March, 13:19:32

On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Thanks for v13!

Rebased and fixes inside in the attached v14 (it passes CI too):

> Looking at 0003:
>
> === 1
>
> +      <entry>NUMA mappings for shared memory allocations</entry>
>
> s/NUMA mappings/NUMA node mappings/ maybe?

Done.

> === 2
>
> +  <para>
> +   The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
> +   assigned allocations made from the server's main shared memory segment.
>
> What about?
>
> "
> shows how shared memory allocations in the server's main shared memory segment
> are distributed across NUMA nodes" ?

Done.

> === 3
>
> +       <structfield>numa_zone_id</structfield> <type>int4</type>
>
> s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
> 0002.
>
> BTW, I wonder if "node_id" would be better (to match the descriptions...).
> If so, would also need to be done in 0002.

Somewhat duplicate, please see answer for #9

> === 4
>
> +      ID of NUMA node
>
> <acronym>NUMA</acronym> node ID? (to be consistent with 0002).
>
> === 5
>
> +static bool firstUseInBackend = true;
>
> Let's use firstNumaTouch to be consistent with 0002.

Done.

> === 6
>
> + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be
unavailable.");;
>
> There is 2 ";" + I think that we should used the same wording as in
> pg_buffercache_numa_pages().
>
> === 7
>
> What about using ERROR instead? (like in pg_buffercache_numa_pages())

Both are synced now.

> === 8
>
> +       /*
> +        * This is for gathering some NUMA statistics. We might be using various
> +        * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
> +        * various different OS memory pages sizes, so first we need to understand
> +        * the OS memory page size before calling move_pages()
> +        */
> +       os_page_size = pg_numa_get_pagesize();
>
> Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling
> pg_numa_get_pagesize()?

Done, improved style of the comment there and synced pg_buffercache
one to shmem.c one.

> === 9
>
> +       max_zones = pg_numa_get_max_node();
>
> I think we are mixing "zone" and "node". I think we should standardize on one
> and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> vote for node, but zone is fine too if you prefer.

Given that numa(7) does not use "zone" keyword at all and both
/proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
into zones, we can conclude that "zone" is simply a subdivision within
a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
internal information about zones), therefore we should stay away from
using "zone" within the patch at all, as we are just fetching NUMA
node info. My bad, it's a terminology error on my side from start -
I've probably saw "zone" info in some command output back then when we
had that workshop and started using it and somehow it propagated
through the patchset up to this day... I've adjusted it all and
settled on "numa_node_id" column name.

> === 10
>
> +       /*
> +        * Preallocate memory all at once without going into details which shared
> +        * memory segment is the biggest (technically min s_b can be as low as
> +        * 16xBLCKSZ)
> +        */
>
> What about?
>
> "
> Allocate memory for page pointers and status based on total shared memory size.
> This simplified approach allocates enough space for all pages in shared memory
> rather than calculating the exact requirements for each segment.
> " instead?

Done.

> === 11
>
> +       int                     shm_total_page_count,
> +                               shm_ent_page_count,
>
> I think those 2 should be uint64.

Right...

> === 12
>
> +       /*
> +        * XXX: We are ignoring in NUMA version reporting of the following regions
> +        * (compare to pg_get_shmem_allocations() case): 1. output shared memory
> +        * allocated but not counted via the shmem index 2. output as-of-yet
> +        * unused shared memory
> +        */
>
> why XXX?
>
> what about?
>
> "
> We are ignoring the following memory regions (as compared to
> pg_get_shmem_allocations())....

Fixed , it was apparently leftover of when I was thinking if we should
still report it.

> === 13
>
> +       page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
> +       memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
>
> maybe we could use palloc0() here?

Of course!++

> === 14
>
> and I realize that we could probably use it in 0002 for os_page_ptrs.

Of course!++

> === 15
>
> I think there is still some multi-lines comments that are missing a period. I
> probably also missed some in 0002 during the previous review. I think that's
> worth another check.

Please do such a check, I've tried pgident on all .c files, but I'm
apparently blind to such issues. BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??

> === 16
>
> +    * In order to get reliable results we also need to touch memory
> +    * pages so that inquiry about NUMA zone doesn't return -2.
> +    */
>
> maybe use the same wording as in 0002?

But 0002 used:

  "In order to get reliable results we also need to touch memory pages, so that
  inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
  pages)"

or are you looking at something different?

> === 17
>
> The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> but (as for 0002) it looks like we can not avoid it (or at least I don't see
> a way to avoid it).

Hm, it's literally debug code. Why would we care so much if it is 2
loops rather than 1? (as stated earlier we need to pack ptrs and then
analyze it)

> I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> updated.

Cool, thanks in advance.


-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

18 March, 17:29:44

Hi,

On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
> On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Thanks for v13!
> 
> Rebased and fixes inside in the attached v14 (it passes CI too):

Thanks!

> > === 9
> >
> > +       max_zones = pg_numa_get_max_node();
> >
> > I think we are mixing "zone" and "node". I think we should standardize on one
> > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> > vote for node, but zone is fine too if you prefer.
> 
> Given that numa(7) does not use "zone" keyword at all and both
> /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
> into zones, we can conclude that "zone" is simply a subdivision within
> a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
> ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
> internal information about zones), therefore we should stay away from
> using "zone" within the patch at all, as we are just fetching NUMA
> node info. My bad, it's a terminology error on my side from start -
> I've probably saw "zone" info in some command output back then when we
> had that workshop and started using it and somehow it propagated
> through the patchset up to this day...

Thanks for the explanation.

> I've adjusted it all and settled on "numa_node_id" column name.

Yeah, I can see, things like:

+  <para>
+   The definitions of the columns exposed are identical to the
+   <structname>pg_buffercache</structname> view, except that this one includes
+   one additional <structfield>numa_node_id</structfield> column as defined in
+   <xref linkend="pgbuffercache-numa-columns"/>.

and like:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>numa_size</structfield> <type>int4</type>
+      </para>

I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.

I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.

I think "node_id", "size" as column(s) name should be enough.

Or maybe that re-adding "numa_" was intentional?

> > === 15
> >
> > I think there is still some multi-lines comments that are missing a period. I
> > probably also missed some in 0002 during the previous review. I think that's
> > worth another check.
> 
> Please do such a check,

Found much more:


+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)

is missing the period

and

+       /*
+        * Switch context when allocating stuff to be used in later calls
+        */

should be as before, meaning on current HEAD:

/* Switch context when allocating stuff to be used in later calls */

and

+       /*
+        * Return to original context when allocating transient memory
+        */

should be as before, meaning on current HEAD:

/* Return to original context when allocating transient memory */

and

+       /*
+        * Note if the buffer is valid, and has storage created
+        */

should be as before, meaning on current HEAD:

/* Note if the buffer is valid, and has storage created */

and

+               /*
+                * unused for v1.0 callers, but the array is always long enough
+                */


should be as before, meaning on current HEAD:

/* unused for v1.0 callers, but the array is always long enough */

and

+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings

is missing the period

and

+                * To correctly map between them, we need to: - Determine the OS
+                * memory page size - Calculate how many OS pages are used by all
+                * buffer blocks - Calculate how many OS pages are contained within
+                * each database block

is missing the period (2 times as this comment appears in 0002 and 0003)

and

+               /*
+                * If we ever get 0xff back from kernel inquiry, then we probably have
+                * bug in our buffers to OS page mapping code here


is missing the period (2 times as this comment appears in 0002 and 0003)

and

+       /*
+        * We are ignoring the following memory regions (as compared to
+        * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+        * counted via the shmem index 2. output as-of-yet unused shared memory

is missing the period

> I've tried pgident on all .c files, but I'm
> apparently blind to such issues.

I don't think pgident would report missing period.

> BTW if patch has anything left that
> causes pgident to fix, that is not picked by CI but it is picked by
> buildfarm??

I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.

> > === 16
> >
> > +    * In order to get reliable results we also need to touch memory
> > +    * pages so that inquiry about NUMA zone doesn't return -2.
> > +    */
> >
> > maybe use the same wording as in 0002?
> 
> But 0002 used:
> 
>   "In order to get reliable results we also need to touch memory pages, so that
>   inquiry about NUMA zone doesn't return -2 (which indicates
> unmapped/unallocated
>   pages)"
> 
> or are you looking at something different?

Nope, I meant to say that it could make sense to have the exact same comment.

> > === 17
> >
> > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> > but (as for 0002) it looks like we can not avoid it (or at least I don't see
> > a way to avoid it).
> 
> Hm, it's literally debug code. Why would we care so much if it is 2
> loops rather than 1? (as stated earlier we need to pack ptrs and then
> analyze it)

Yeah, but if we could just loop one time I'm pretty sure we'd have done that.

> > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> > updated.
> :w

> Cool, thanks in advance.

0001 looks in a good shape from my point of view.

For 0002:

=== 1

I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.

Maybe something like:

0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above

We did it that way in c2a50ac678e and ff7c40d7fd6 for example.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

19 March, 12:06:09

On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

Hi! v15 attached, rebased, CI-tested, all fixed incorporated.

> > I've adjusted it all and settled on "numa_node_id" column name.
>
[...]
> I think that we can get rid of the "numa_" stuff in column(s) name as
> the column(s) are part of "numa" relation views/output anyway.
[...]

Done, you are probably right (it was done to keep consistency between
those two views probably), I'm just not that strongly attached to the
naming things.

> > Please do such a check,
>
> Found much more:
>
[.. 9 issues with missing dots at the end of sentences in comments +
fixes to comment structure in relation to HEAD..]

All fixed.

> > BTW if patch has anything left that
> > causes pgident to fix, that is not picked by CI but it is picked by
> > buildfarm??
>
> I think it has to be done manually before each commit and that this is anyway
> done at least once per release cycle.

OK, thanks for clarification.

[..]
> >
> > But 0002 used:
> >
> >   "In order to get reliable results we also need to touch memory pages, so that
> >   inquiry about NUMA zone doesn't return -2 (which indicates
> > unmapped/unallocated
> >   pages)"
> >
> > or are you looking at something different?
>
> Nope, I meant to say that it could make sense to have the exact same comment.

Synced those two.

[..]
>
> 0001 looks in a good shape from my point of view.

Cool!

> For 0002:
>
> === 1
>
> I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
> deserve their own patch. That would ease the review for the "real" numa stuff.
>

Done, 0001+0002 alone passes the meson test.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Nazir Bilal Yavuz

Date:

27 March, 14:30:48

Hi,

Thank you for working on this!

On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>
> Hi! v15 attached, rebased, CI-tested, all fixed incorporated.

This needs to be rebased after 8eadd5c73c.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

27 March, 16:02:03

On Thu, Mar 27, 2025 at 12:31 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Wed, 19 Mar 2025 at 12:06, Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> >
> > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi! v15 attached, rebased, CI-tested, all fixed incorporated.
>
> This needs to be rebased after 8eadd5c73c.

Hi Nazir, thanks for spotting! I've not connected the dots with AIO
going in and my libnuma dependency blowing up... Attached is rebased
v16 that passed my CI run.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Álvaro Herrera

Date:

27 March, 16:15:42

Hello

I think you should remove numa_warn() and numa_error() from 0001.
AFAICS they are dead code (even with all your patches applied), and
furthermore would get you in trouble regarding memory allocation because
src/port is not allowed to use palloc et al.  If you wanted to keep them
you'd have to have them in src/common, but looking at the rest of the
code in that patch, ISTM src/port is the right place for it.  If in the
future you discover that you do need numa_warn(), you can create a
src/common/ file for it then.

Is pg_buffercache really the best place for these NUMA introspection
routines?  I'm not saying that it isn't, maybe we're okay with that
(particularly if we can avoid duplicated code), but it seems a bit weird
to me.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo."                 (Augusto Pinochet a una corte de justicia)

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

27 March, 16:40:29

Hi,

On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
>    setup_additional_packages_script: |
> -    #apt-get update
> -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> +    apt-get update
> +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> +      libnuma1 \
> +      libnuma-dev

I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.

> +#
> +# libnuma
> +#
> +AC_MSG_CHECKING([whether to build with libnuma support])
> +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],

Most other dependencies say "build with libxyz ..."


> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + *      Basic NUMA portability routines
> + *
> + *
> + * Copyright (c) 2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *     src/include/port/pg_numa.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NUMA_H
> +#define PG_NUMA_H
> +
> +#include "c.h"
> +#include "postgres.h"

Headers should never include either of those headers.  Nor should .c files
include both.


> @@ -200,6 +200,8 @@ pgxs_empty = [
>  
>    'ICU_LIBS',
>  
> +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> +
>    'LIBURING_CFLAGS', 'LIBURING_LIBS',
>  ]

Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?


> +Size
> +pg_numa_get_pagesize(void)
> +{
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +
> +    return os_page_size;
> +}

Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?


> +#ifndef FRONTEND
> +/*
> + * XXX: not really tested as there is no way to trigger this in our
> + * current usage of libnuma.
> + *
> + * The libnuma built-in code can be seen here:
> + * https://github.com/numactl/numactl/blob/master/libnuma.c
> + *
> + */
> +void
> +numa_warn(int num, char *fmt,...)
> +{
> +    va_list        ap;
> +    int            olde = errno;
> +    int            needed;
> +    StringInfoData msg;
> +
> +    initStringInfo(&msg);
> +
> +    va_start(ap, fmt);
> +    needed = appendStringInfoVA(&msg, fmt, ap);
> +    va_end(ap);
> +    if (needed > 0)
> +    {
> +        enlargeStringInfo(&msg, needed);
> +        va_start(ap, fmt);
> +        appendStringInfoVA(&msg, fmt, ap);
> +        va_end(ap);
> +    }
> +
> +    ereport(WARNING,
> +            (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
> +             errmsg_internal("libnuma: WARNING: %s", msg.data)));

I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.

> +Size
> +pg_numa_get_pagesize(void)
> +{
> +#ifndef WIN32
> +    Size        os_page_size = sysconf(_SC_PAGESIZE);
> +#else
> +    Size        os_page_size;
> +    SYSTEM_INFO sysinfo;
> +
> +    GetSystemInfo(&sysinfo);
> +    os_page_size = sysinfo.dwPageSize;
> +#endif
> +    if (huge_pages_status == HUGE_PAGES_ON)
> +        GetHugePageSize(&os_page_size, NULL);
> +    return os_page_size;
> +}


I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.


> From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Wed, 19 Mar 2025 09:34:56 +0100
> Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
>  primary function to separate functions.
> 
> This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> and get_buffercache_tuple() that help fill result tuplestores based on the
> buffercache contents. This will be used in a follow-up commit that implements
> NUMA observability in pg_buffercache.
> 
> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
>  1 file changed, 178 insertions(+), 139 deletions(-)
> 
> diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> index 62602af1775..86e0b8afe01 100644
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -14,7 +14,6 @@
>  #include "storage/buf_internals.h"
>  #include "storage/bufmgr.h"
>  
> -
>  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM    8

Independent change.


>  #define NUM_BUFFERCACHE_PAGES_ELEM    9
>  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
>  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
>  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
>  
> -Datum
> -pg_buffercache_pages(PG_FUNCTION_ARGS)
> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> + * memory mapping inquiries to display NUMA node information for each buffer.
> + */

If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?



> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * Build buffer cache information for a single buffer.
> + */
> +static void
> +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> +{

This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.


> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    bufHdr = GetBufferDescriptor(record_id);
> +    /* Lock each buffer header before inspecting. */
> +    buf_state = LockBufHdr(bufHdr);
> +
> +    fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> +    fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> +    fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> +    fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> +    fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> +    fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> +    fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> +    fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);

As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.



> +static Datum
> +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> +{
> +    Datum        values[NUM_BUFFERCACHE_PAGES_ELEM];
> +    bool        nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> +    HeapTuple    tuple;
> +
> +    values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> +    nulls[0] = false;
> +
> +    /*
> +     * Set all fields except the bufferid to null if the buffer is unused or
> +     * not valid.
> +     */
> +    if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> +        fctx->record[record_id].isvalid == false)
> +    {
> +        nulls[1] = true;
> +        nulls[2] = true;
> +        nulls[3] = true;
> +        nulls[4] = true;
> +        nulls[5] = true;
> +        nulls[6] = true;
> +        nulls[7] = true;
> +
> +        /* unused for v1.0 callers, but the array is always long enough */
> +        nulls[8] = true;

I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.


> +    }
> +    else
> +    {
> +        values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> +        nulls[1] = false;
> +        values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> +        nulls[2] = false;
> +        values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> +        nulls[3] = false;
> +        values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> +        nulls[4] = false;
> +        values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> +        nulls[5] = false;
> +        values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> +        nulls[6] = false;
> +        values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> +        nulls[7] = false;

Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable.  Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...


> @@ -0,0 +1,28 @@
> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif

You could avoid the need for an alternative output file if you instead made
the queries do something like
  SELECT NOT pg_numa_available() OR count(*) ...



> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,42 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the new function.
> +DROP VIEW pg_buffercache;
> +DROP FUNCTION pg_buffercache_pages();

I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.


(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

31 March, 12:27:42

On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Hello

Good morning :)

> I think you should remove numa_warn() and numa_error() from 0001.
> AFAICS they are dead code (even with all your patches applied), and
> furthermore would get you in trouble regarding memory allocation because
> src/port is not allowed to use palloc et al.  If you wanted to keep them
> you'd have to have them in src/common, but looking at the rest of the
> code in that patch, ISTM src/port is the right place for it.  If in the
> future you discover that you do need numa_warn(), you can create a
> src/common/ file for it then.

Understood, trimmed it out from the patch. I'm going to respond also
within minutes to Andres' review and I'm going to post a new version
(v17) there.

> Is pg_buffercache really the best place for these NUMA introspection
> routines?  I'm not saying that it isn't, maybe we're okay with that
> (particularly if we can avoid duplicated code), but it seems a bit weird
> to me.

I think it is, because as I understand, Andres wanted to have
observability per single database *page* and to avoid code duplication
we are just putting it there (it's natural fit). Imagine looking up an
8kB root btree memory page being hit hard from CPUs on other NUMA
nodes (this just gives ability to see that, but you could of course
also get aggregation to get e.g. NUMA node balance for single relation
and so on).

-J.

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

31 March, 12:27:50

On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

Hi Andres,

> On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
> >    setup_additional_packages_script: |
> > -    #apt-get update
> > -    #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> > +    apt-get update
> > +    DEBIAN_FRONTEND=noninteractive apt-get -y install \
> > +      libnuma1 \
> > +      libnuma-dev
>
> I think libnuma is installed on the relevant platforms, so you shouldn't need
> to install it manually.

Fixed. Right, you mentioned this earlier, I just didnt know when it went online.

> > +#
> > +# libnuma
> > +#
> > +AC_MSG_CHECKING([whether to build with libnuma support])
> > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
>
> Most other dependencies say "build with libxyz ..."

Done.

> > + * pg_numa.h
[..]
> > +#include "c.h"
> > +#include "postgres.h"
>
> Headers should never include either of those headers.  Nor should .c files
> include both.

Fixed, huh, I've found explanation:
https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us

> > @@ -200,6 +200,8 @@ pgxs_empty = [
> >
> >    'ICU_LIBS',
> >
> > +  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> > +
> >    'LIBURING_CFLAGS', 'LIBURING_LIBS',
> >  ]
>
> Maybe I am missing something, but are you actually defining and using those
> LIBNUMA_* vars anywhere?

OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)`
in configure.ac that would set those *FLAGS. I'm little bit loss
dependent in how to gurantee that meson is synced with autoconf as per
project requirements - trying to use past commits as reference, but I
still could get something wrong here (especially in
src/Makefile.global.in)

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> Should this have a comment or an assertion that it can only be used after
> shared memory startup? Because before that huge_pages_status won't be
> meaningful?

Added both.

> > +#ifndef FRONTEND
> > +/*
> > + * XXX: not really tested as there is no way to trigger this in our
> > + * current usage of libnuma.
> > + *
> > + * The libnuma built-in code can be seen here:
> > + * https://github.com/numactl/numactl/blob/master/libnuma.c
> > + *
> > + */
> > +void
> > +numa_warn(int num, char *fmt,...)
[..]
>
> I think you would at least have to hold interrupts across this, as
> ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
> out of libnuma in case an interrupt has arrived.

On request by Alvaro I've removed it as that code is simply
unreachable/untestable, but point taken - I'm planning to re-add this
with holding interrupting in future when we start using proper
numa_interleave() one day. Anyway, please let me know if you want
still to keep it as deadcode. BTW for context , why this is deadcode
is explained in the latter part of [1] message (TL;DR; unless we use
pining/numa_interleave/local_alloc() we probably never reach that
warnings/error handlers).

"Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine..."

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> I would probably implement this once, outside of the big ifdef, with one more
> ifdef inside, given that you're sharing the same implementation.

Done.

> > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> > From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> > Date: Wed, 19 Mar 2025 09:34:56 +0100
> > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
> >  primary function to separate functions.
> >
> > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> > and get_buffercache_tuple() that help fill result tuplestores based on the
> > buffercache contents. This will be used in a follow-up commit that implements
> > NUMA observability in pg_buffercache.
> >
> > Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> > ---
> >  contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
> >  1 file changed, 178 insertions(+), 139 deletions(-)
> >
> > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> > index 62602af1775..86e0b8afe01 100644
> > --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> > @@ -14,7 +14,6 @@
> >  #include "storage/buf_internals.h"
> >  #include "storage/bufmgr.h"
> >
> > -
> >  #define NUM_BUFFERCACHE_PAGES_MIN_ELEM       8
>
> Independent change.

Fixed.

> >  #define NUM_BUFFERCACHE_PAGES_ELEM   9
> >  #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
> >  PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
> >  PG_FUNCTION_INFO_V1(pg_buffercache_evict);
> >
> > -Datum
> > -pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> > + * memory mapping inquiries to display NUMA node information for each buffer.
> > + */
>
> If it's a helper routine it's probably not identical to
> pg_buffercache_numa_pages()?

Of course not, fixed by removing that comment.

> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * Build buffer cache information for a single buffer.
> > + */
> > +static void
> > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
>
> This isn't really building a tuple tuple? Seems somewhat confusing, because
> get_buffercache_tuple() does actually build one.

s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless
someone wants to come with better name.

> > +     BufferDesc *bufHdr;
> > +     uint32          buf_state;
> > +
> > +     bufHdr = GetBufferDescriptor(record_id);
> > +     /* Lock each buffer header before inspecting. */
> > +     buf_state = LockBufHdr(bufHdr);
> > +
> > +     fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> > +     fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> > +     fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> > +     fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> > +     fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> > +     fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> > +     fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> > +     fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
>
> As above, I think this would be more readable if you put
> fctx->record[record_id] into a local var.

Done.

> > +static Datum
> > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
> > +     Datum           values[NUM_BUFFERCACHE_PAGES_ELEM];
> > +     bool            nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> > +     HeapTuple       tuple;
> > +
> > +     values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> > +     nulls[0] = false;
> > +
> > +     /*
> > +      * Set all fields except the bufferid to null if the buffer is unused or
> > +      * not valid.
> > +      */
> > +     if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> > +             fctx->record[record_id].isvalid == false)
> > +     {
> > +             nulls[1] = true;
> > +             nulls[2] = true;
> > +             nulls[3] = true;
> > +             nulls[4] = true;
> > +             nulls[5] = true;
> > +             nulls[6] = true;
> > +             nulls[7] = true;
> > +
> > +             /* unused for v1.0 callers, but the array is always long enough */
> > +             nulls[8] = true;
>
> I'd probably just memset the entire nulls array to either true or false,
> instead of doing it one-by-one.

Done.

> > +     }
> > +     else
> > +     {
> > +             values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> > +             nulls[1] = false;
> > +             values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> > +             nulls[2] = false;
> > +             values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> > +             nulls[3] = false;
> > +             values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> > +             nulls[4] = false;
> > +             values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> > +             nulls[5] = false;
> > +             values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> > +             nulls[6] = false;
> > +             values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> > +             nulls[7] = false;
>
> Seems like it would end up a lot more readable if you put
> fctx->record[record_id] into a local variable.  Unfortunately that'd probably
> be best done in one more commit ahead of the rest of the this one...

Done, i've put it those refactorig changes into the commit already
dedicated only for a refactor. For the record Bertrand also asked for
something about this, but I was somehow afraid to touch Tom's code.

> > @@ -0,0 +1,28 @@
> > +SELECT NOT(pg_numa_available()) AS skip_test \gset
> > +\if :skip_test
> > +\quit
> > +\endif
>
> You could avoid the need for an alternative output file if you instead made
> the queries do something like
>   SELECT NOT pg_numa_available() OR count(*) ...

ITEM REMAINING: Is this for the future or can it stay like that? I
don't have a hard opinion on this, but I've already wasted lots of
cycles to discover that one can have those ".1" alternative expected
result files.

> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,42 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the new function.
> > +DROP VIEW pg_buffercache;
> > +DROP FUNCTION pg_buffercache_pages();
>
> I don't think we can just drop a view in the upgrade script. That will fail if
> anybody created a view depending on pg_buffercache.

Ugh, fixed, thanks. That must have been some leftover (we later do
CREATE OR REPLACE those anyway).

> (Sorry, ran out of time / energy here, i had originally just wanted to comment
> on the apt-get thing in the tests)

Thanks! AIO intensifies ... :)

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

31 March, 17:59:13

Hi,

On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
> On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > +Size
> > > +pg_numa_get_pagesize(void)
> [..]
> >
> > Should this have a comment or an assertion that it can only be used after
> > shared memory startup? Because before that huge_pages_status won't be
> > meaningful?
> 
> Added both.

Thanks for the updated version!

+       Assert(IsUnderPostmaster);

I wonder if that would make more sense to add an assertion on huge_pages_status
and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
CreateSharedMemoryAndSemaphores()).

=== About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch

Once applied I can see mention to pg_buffercache_numa_pages() while it
only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch.

I think pg_buffercache_numa_pages() should not be mentioned before it's actually
implemented.

=== 1

+               bufRecord->isvalid == false)
        {
                int                     i;

-               funcctx = SRF_FIRSTCALL_INIT();
-
-               /* Switch context when allocating stuff to be used in later calls */
-               oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
-
-               /* Create a user function context for cross-call persistence */
-               fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+               for (i = 1; i <= 9; i++)
+                       nulls[i] = true; 

"i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
is increased to 10).

In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).

That could also make sense to remove the loop and use memset() that way:

"
memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
"

instead. It's done that way in some other places (hbafuncs.c for example).

=== 2

-               if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
-                       TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
-                                                          INT4OID, -1, 0);

+       if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
+               TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+                                                  INT4OID, -1, 0);

I think we should not change the "expected_tupledesc->natts" check here until
v17-0003 is applied.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

01 April, 13:56:06

On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:

> Hi,

Hi Bertrand, happy to see you back, thanks for review and here's v18
attached (an ideal fit for PG18 ;))

> On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote:
> > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > > +Size
> > > > +pg_numa_get_pagesize(void)
> > [..]
> > >
> > > Should this have a comment or an assertion that it can only be used after
> > > shared memory startup? Because before that huge_pages_status won't be
> > > meaningful?
> >
> > Added both.
>
> Thanks for the updated version!
>
> +       Assert(IsUnderPostmaster);
>
> I wonder if that would make more sense to add an assertion on huge_pages_status
> and HUGE_PAGES_UNKNOWN instead (more or less as it is done in
> CreateSharedMemoryAndSemaphores()).

Ok, let's have both just in case (this status is by default
initialized to _UNKNOWN, so I hope you haven't had in mind using
GetConfigOption() as this would need guc.h in port?)

> === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch
[..]
> I think pg_buffercache_numa_pages() should not be mentioned before it's actually
> implemented.

Right, fixed.

> === 1
>
[..]
> "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM
> is increased to 10).
>
> In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM).
>
> That could also make sense to remove the loop and use memset() that way:
>
> "
> memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
> "
>
> instead. It's done that way in some other places (hbafuncs.c for example).

Ouch, good catch.

> === 2
>
> -               if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
> -                       TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
> -                                                          INT4OID, -1, 0);
>
> +       if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
> +               TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
> +                                                  INT4OID, -1, 0);
>
> I think we should not change the "expected_tupledesc->natts" check here until
> v17-0003 is applied.

Right, I've moved that into 003 where it belongs and now 002 has no
single NUMA reference. I've thrown 0001+0002 alone onto CI and it
passed too.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

01 April, 18:13:27

Hi Jakub,

On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
> On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> 
> > Hi,
> 
> Hi Bertrand, happy to see you back, thanks for review and here's v18
> attached (an ideal fit for PG18 ;))

Thanks for the new version!

=== About v18-0002

It looks in a good shape to me. The helper's name might still be debatable
though.

I just have 2 comments:

=== 1

+       if (bufRecord->blocknum == InvalidBlockNumber ||
+               bufRecord->isvalid == false)

It seems to me that this check could now fit in one line.

=== 2

+       {
                SRF_RETURN_DONE(funcctx);
+       }

Extra parentheses are not needed.

=== About v18-0003

=== 3

I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
only the necessary changes when updating from 1.5. It means that it should "only"
create the new objects (views and functions in our case) that come in v18-0003 
and grant appropriate privs.

Also it should mention:

"
\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
"
and not:

"
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
"

The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.

=== 4

+               for (i = 0; i < NBuffers; i++)
+               {
+                       int                     blk2page = (int) i * pages_per_blk;
+

I think that should be:

int blk2page = (int) (i * pages_per_blk);

=== About v18-0004

=== 5

When running:

select c.name, c.size as num_size, s.size as shmem_size
from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss

where c.name = s.name;

I can see:

- pg_shmem_numa_allocations reporting a lot of times the same size
- pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size

Do you observe the same?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

01 April, 23:17:16

Hi,

I've spent a bit of time reviewing this. In general I haven't found
anything I'd call a bug, but here's a couple comments for v18 ... Most
of this is in separate "review" commits, with a couple exceptions.

1) Please update the commit messages, with proper formatting, etc. I
tried to do that in the attached v19, but please go through that, add
relevant details, update list of reviewers, etc. The subject should not
be overly long, etc.

2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
use just "libnuma support" similar to other libraries.

3) I don't think we need pg_numa.h to have this:

extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);

AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
would it be necessary here? It's enough to have a prototype in .c file.

4) Improved .sgml to have acronym/productname in a couple places.

5) I don't think the comment for pg_buffercache_init_entries() is very
useful. That it's helper for pg_buffercache_pages() tells me nothing
about how to use it, what the arguments are, etc.

6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
comment too. I mean, there's no info about what the arguments are, which
arguments are input or output, etc. And it only discussed one option
(block page < memory page), but what happens in the other case? The
formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
saying it's wrong).

However, it seems rather suspicious that pages_per_blk is calculated as
float, and pg_buffercache_numa_prepare_ptrs() then does this:

    for (size_t j = 0; j < pages_per_blk; j++)
    { ... }

I mean, isn't this vulnerable to rounding errors, which might trigger
some weird behavior? If not, it'd be good to add a comment why this is
fine, it confuses me a lot. I personally would probably prefer doing
just integer arithmetic here.


7) This para in the docs seems self-contradictory:

 <para>
  The <function>pg_buffercache_numa_pages()</function> provides the same
information
  as <function>pg_buffercache_pages()</function> but is slower because
it also
  provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
  The <structname>pg_buffercache_numa</structname> view wraps the
function for
  convenient use.
 </para>

I mean, "provides the same information, but is slower because it
provides different information" is strange. I understand the point, but
maybe rephrase somehow?

8) Why is pg_numa_available() marked as volatile? Should not change in a
running cluster, no?

9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
let's not do that.

   <sect2 id="pgbuffercache-pg-buffercache_numa">

10) I think it'd be good to mention/explain why move_pages is used
instead of get_mempolicy - more efficient with batching, etc. This would
be useful both in the commit message and before the move_pages call (and
in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
pointers like this etc.).

11) This could use UINT64_FORMAT, instead of a cast:

    elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
        (unsigned long) os_page_count, os_page_size, pages_per_blk);


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

02 April, 17:45:53

On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi Jakub,
>
> On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote:
> > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >
> > > Hi,
> >
> > Hi Bertrand, happy to see you back, thanks for review and here's v18
> > attached (an ideal fit for PG18 ;))
>
> Thanks for the new version!
>

Hi Bertrand,

I'll be attaching v20 when responding to the follow-up review by Tomas
to avoid double-sending this in the list, but review findings from
here are fixed in v20.

> === About v18-0002
>
> It looks in a good shape to me. The helper's name might still be debatable
> though.
>
> I just have 2 comments:
>
> === 1
>
> +       if (bufRecord->blocknum == InvalidBlockNumber ||
> +               bufRecord->isvalid == false)
>
> It seems to me that this check could now fit in one line.

OK.

> === 2
>
> +       {
>                 SRF_RETURN_DONE(funcctx);
> +       }
>
> Extra parentheses are not needed.

This also should be fixed in v20.

>
> === About v18-0003
>
> === 3
>
> I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain
> only the necessary changes when updating from 1.5. It means that it should "only"
> create the new objects (views and functions in our case) that come in v18-0003
> and grant appropriate privs.
>
> Also it should mention:
>
> "
> \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
> "
> and not:
>
> "
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> "
>
> The already existing pg_buffercache--1.N--1.(N+1).sql are good examples.

Hm, good find, it might be leftover from first attempts where we were
aiming just for adding column numa_node_id to pg_buffercache_pages()
(rather than adding new view), all hopefully fixed.

> === 4
>
> +               for (i = 0; i < NBuffers; i++)
> +               {
> +                       int                     blk2page = (int) i * pages_per_blk;
> +
>
> I think that should be:
>
> int blk2page = (int) (i * pages_per_blk);

OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
it's own... I believe orginal ident, would fix this on it's own?

> === About v18-0004
>
> === 5
>
> When running:
>
> select c.name, c.size as num_size, s.size as shmem_size
> from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss 
> where c.name = s.name;
>
> I can see:
>
> - pg_shmem_numa_allocations reporting a lot of times the same size
> - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size
>
> Do you observe the same?
>

Yes, it is actually by design: the pg_shmem_allocations.size is sum of
page sizes not size of struct, e.g. with "order by 3 desc":

                      name                      | num_size  | shmem_size
------------------------------------------------+-----------+------------
 WaitEventCustomCounterData                     |      4096 |          8
 Archiver Data                                  |      4096 |          8
 SerialControlData                              |      4096 |         16

I was even wondering if it does make sense to output "shm pointer
address" (or at least offset) there to see which shm structures are on
the same page, but we have the same pg_shm_allocations already.

-J.

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

02 April, 17:46:53

On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> I've spent a bit of time reviewing this. In general I haven't found
> anything I'd call a bug, but here's a couple comments for v18 ... Most
> of this is in separate "review" commits, with a couple exceptions.

Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".

> 1) Please update the commit messages, with proper formatting, etc. I
> tried to do that in the attached v19, but please go through that, add
> relevant details, update list of reviewers, etc. The subject should not
> be overly long, etc.

Fixed by you.

> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
> use just "libnuma support" similar to other libraries.

Fixed by you.

> 3) I don't think we need pg_numa.h to have this:
>
> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>
> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
> would it be necessary here? It's enough to have a prototype in .c file.

Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.

> 4) Improved .sgml to have acronym/productname in a couple places.

Great.

> 5) I don't think the comment for pg_buffercache_init_entries() is very
> useful. That it's helper for pg_buffercache_pages() tells me nothing
> about how to use it, what the arguments are, etc.

I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)

> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
> comment too. I mean, there's no info about what the arguments are, which
> arguments are input or output, etc. And it only discussed one option
> (block page < memory page), but what happens in the other case? The
> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
> saying it's wrong).
>
> However, it seems rather suspicious that pages_per_blk is calculated as
> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>
>     for (size_t j = 0; j < pages_per_blk; j++)
>     { ... }
>
> I mean, isn't this vulnerable to rounding errors, which might trigger
> some weird behavior? If not, it'd be good to add a comment why this is
> fine, it confuses me a lot. I personally would probably prefer doing
> just integer arithmetic here.

Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP,  DEBUG:  NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG:  NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906

so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB    4kB                    = 0.5
2kB    2048kb                 = .0009765625
2kB    1024*1024kb # 1GB      = .0000019073486328125 # worst-case?
8kB    4kB                    = 2
8kB    2048kb                 = .003906250 # example from above (x86_64, 2M HP)
8kB    1024*1024kb # 1GB      = .00000762939453
32kB   4kB                    = 8
32kB   2048kb                 = .0156250
32kB   1024*1024kb # 1GB      = .000030517578125

So that loop:
    for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)

Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.

So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)

Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .

OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
    float f = 2097152 * 0.003906250;
 under clang  -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all of the core arithmetics here to integer?
- or simply go with doubles just to be sure? I went with doubles in
v20, comments explaining are not there yet.

> 7) This para in the docs seems self-contradictory:
>
>  <para>
>   The <function>pg_buffercache_numa_pages()</function> provides the same
> information
>   as <function>pg_buffercache_pages()</function> but is slower because
> it also
>   provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
>   The <structname>pg_buffercache_numa</structname> view wraps the
> function for
>   convenient use.
>  </para>
>
> I mean, "provides the same information, but is slower because it
> provides different information" is strange. I understand the point, but
> maybe rephrase somehow?

Oh my... yes, now it looks way better.

> 8) Why is pg_numa_available() marked as volatile? Should not change in a
> running cluster, no?

No it shouldn't, good find, made it 's'table.

> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
> let's not do that.
>
>    <sect2 id="pgbuffercache-pg-buffercache_numa">

Fixed.

> 10) I think it'd be good to mention/explain why move_pages is used
> instead of get_mempolicy - more efficient with batching, etc. This would
> be useful both in the commit message and before the move_pages call

Ok, added in 0001.

> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
> pointers like this etc.).

Added reference to that earlier new comment here too in 0003.

> 11) This could use UINT64_FORMAT, instead of a cast:
>
>     elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
> pages_per_blk=%.2f",
>         (unsigned long) os_page_count, os_page_size, pages_per_blk);

Done.

12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"

OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.

13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"

OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1

Should we add such functionality?


-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

02 April, 18:27:07

Hi Jakub,

On Wed, Apr 02, 2025 at 04:45:53PM +0200, Jakub Wartak wrote:
> On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > === 4
> >
> > +               for (i = 0; i < NBuffers; i++)
> > +               {
> > +                       int                     blk2page = (int) i * pages_per_blk;
> > +
> >
> > I think that should be:
> >
> > int blk2page = (int) (i * pages_per_blk);
> 
> OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
> it's own... I believe orginal ident, would fix this on it's own?

My comment was not about indention but about the fact that I think that the
casting is not a the right place. I think that's the result of the multiplication
that we want to be casted (cast operator has higher precedence than Multiplication
operator).

> >
> > select c.name, c.size as num_size, s.size as shmem_size
> > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss
 
> > where c.name = s.name;
> >
> > I can see:
> >
> > - pg_shmem_numa_allocations reporting a lot of times the same size
> > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size
> >
> > Do you observe the same?
> >
> 
> Yes, it is actually by design: the pg_shmem_allocations.size is sum of
> page sizes not size of struct,

Ok, but then does it make sense to see some num_size < shmem_size?

postgres=# select c.name, c.size as num_size, s.size as shmem_size
 from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss
 
 where c.name = s.name and s.size > c.size;
     name      | num_size  | shmem_size
---------------+-----------+------------
 XLOG Ctl      |   4194304 |    4208200
 Buffer Blocks | 134217728 |  134221824
 AioHandleIOV  |   2097152 |    2850816
(3 rows)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

02 April, 19:40:54


On 4/2/25 16:46, Jakub Wartak wrote:
> On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> I've spent a bit of time reviewing this. In general I haven't found
>> anything I'd call a bug, but here's a couple comments for v18 ... Most
>> of this is in separate "review" commits, with a couple exceptions.
> 
> Hi, thank you very much for help on this, yes I did not anticipate
> this patch to organically grow like that...
> I've squashed those review findings into v20 and provided answers for
> the "review:".
> 

Thanks.

>> 1) Please update the commit messages, with proper formatting, etc. I
>> tried to do that in the attached v19, but please go through that, add
>> relevant details, update list of reviewers, etc. The subject should not
>> be overly long, etc.
> 
> Fixed by you.
> 

OK, so you agree the commit messages are complete / correct?

>> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
>> use just "libnuma support" similar to other libraries.
> 
> Fixed by you.
> 

OK. FWIW if you disagree with some of my proposed changes, feel free to
push back. I'm sure some may be more a matter of personal preference.

>> 3) I don't think we need pg_numa.h to have this:
>>
>> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>>
>> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
>> would it be necessary here? It's enough to have a prototype in .c file.
> 
> Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.
> 
>> 4) Improved .sgml to have acronym/productname in a couple places.
> 
> Great.
> 
>> 5) I don't think the comment for pg_buffercache_init_entries() is very
>> useful. That it's helper for pg_buffercache_pages() tells me nothing
>> about how to use it, what the arguments are, etc.
> 
> I've added an explanation (in 0003 though), so that this is covered.
> I've always assumed that 'static' functions don't need that much of
> that(?)
> 

I think that's mostly true - the (non-static) functions that are part of
the API for a module need better / more detailed docs. But that doesn't
mean static functions shouldn't have at least basic docs (unless the
function is trivial / obvious, but I don't think that's the case here).

If I happen to work a pg_buffercache patch in a couple months, I'll
still need to understand what the function does. It won't save me that
I'm working on the same file ...

I'm not saying this needs a detailed docs, but "helper for X" adds very
little information - I can easily see where it's called from, right?

>> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
>> comment too. I mean, there's no info about what the arguments are, which
>> arguments are input or output, etc. And it only discussed one option
>> (block page < memory page), but what happens in the other case? The
>> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
>> saying it's wrong).
>>
>> However, it seems rather suspicious that pages_per_blk is calculated as
>> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>>
>>     for (size_t j = 0; j < pages_per_blk; j++)
>>     { ... }
>>
>> I mean, isn't this vulnerable to rounding errors, which might trigger
>> some weird behavior? If not, it'd be good to add a comment why this is
>> fine, it confuses me a lot. I personally would probably prefer doing
>> just integer arithmetic here.
> 
> Please bear with me: If you set client_min_messages to debug1 and then
> pg_buffercache_numa will dump:
> a) without HP,  DEBUG:  NUMA: os_page_count=32768 os_page_size=4096
> pages_per_blk=2.00
> b) with HP (2M) DEBUG:  NUMA: os_page_count=64 os_page_size=2097152
> pages_per_blk=0.003906
> 
> so we need to be agile to support two cases as you mention (BLCKSZ >
> PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
> 4kB..1GB, thus we can get in that float the following sample values:
> BLCKSZ pagesize
> 2kB    4kB                    = 0.5
> 2kB    2048kb                 = .0009765625
> 2kB    1024*1024kb # 1GB      = .0000019073486328125 # worst-case?
> 8kB    4kB                    = 2
> 8kB    2048kb                 = .003906250 # example from above (x86_64, 2M HP)
> 8kB    1024*1024kb # 1GB      = .00000762939453
> 32kB   4kB                    = 8
> 32kB   2048kb                 = .0156250
> 32kB   1024*1024kb # 1GB      = .000030517578125
> 
> So that loop:
>     for (size_t j = 0; j < pages_per_blk; j++)
> is quite generic and launches in both cases. I've somehow failed to
> somehow come up with integer-based math and generic code for this
> (without special cases which seem to be no-go here?). So, that loop
> then will:
> a) launch many times to support BLCKSZ > pagesize, that is when single
> DB block spans multiple memory pages
> b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
> example above)
> 

Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
confusing and difficult to reason about ...

I agree we don't want special cases for each possible combination of
page sizes (I'm not sure we even know all the combinations). What I was
thinking about is two branches, one for (block >= page) and another for
(block < page). AFAICK both values have to be 2^k, so this would
guarantee we have either (block/page) or (page/block) as integer.

I wonder if you could even just calculate both, and have one loop that
deals with both.

> Loop touches && stores addresses into os_page_ptrs[] as input to this
> one big move_pages(2) query. So we basically ask for all memory pages
> for NBuffers. Once we get our NUMA information we then use blk2page =
> up_to_NBuffers * pages_per_blk to resolve memory pointers back to
> Buffers, if anywhere it could be a problem here.
> 

IMHO this is the information I'd expect in the function comment.

> So let's say we have s_b=4TB (it wont work for sure for other reasons,
> let's assume we have it), let's also assume we have no huge
> pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
> which multiplied by 2 = INT_MAX (integer overflow bug), so I think
> that int is not big enough there in pg_buffercache_numa_pages() (it
> should be "size_t blk2page" there as in
> pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)
> 
> Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
> NBuffers=2097152 * 0.003906250 = 8192.0 .
> 
> OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
>     float f = 2097152 * 0.003906250;
>  under clang  -Weverything I got "implicit conversion increases
> floating-point precision: 'float' to 'double'", so either it is:
> - we somehow rewrite all of the core arithmetics here to integer?
> - or simply go with doubles just to be sure? I went with doubles in
> v20, comments explaining are not there yet.
> 

When I say "integer arithmetic" I don't mean it should use 32-bit ints,
or any other data type. I mean that it works with non-floating point
values. It could be int64, Size or whatever is large enough to not
overflow. I really don't see how changing stuff to double makes this
easier to understand.

>> 7) This para in the docs seems self-contradictory:
>>
>>  <para>
>>   The <function>pg_buffercache_numa_pages()</function> provides the same
>> information
>>   as <function>pg_buffercache_pages()</function> but is slower because
>> it also
>>   provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
>>   The <structname>pg_buffercache_numa</structname> view wraps the
>> function for
>>   convenient use.
>>  </para>
>>
>> I mean, "provides the same information, but is slower because it
>> provides different information" is strange. I understand the point, but
>> maybe rephrase somehow?
> 
> Oh my... yes, now it looks way better.
> 
>> 8) Why is pg_numa_available() marked as volatile? Should not change in a
>> running cluster, no?
> 
> No it shouldn't, good find, made it 's'table.
> 
>> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
>> let's not do that.
>>
>>    <sect2 id="pgbuffercache-pg-buffercache_numa">
> 
> Fixed.
> 
>> 10) I think it'd be good to mention/explain why move_pages is used
>> instead of get_mempolicy - more efficient with batching, etc. This would
>> be useful both in the commit message and before the move_pages call
> 
> Ok, added in 0001.
> 
>> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
>> pointers like this etc.).
> 
> Added reference to that earlier new comment here too in 0003.
> 

Will take a look in the evening.

>> 11) This could use UINT64_FORMAT, instead of a cast:
>>
>>     elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
>> pages_per_blk=%.2f",
>>         (unsigned long) os_page_count, os_page_size, pages_per_blk);
> 
> Done.
> 
> 12) You have also raised "why not pg_shm_allocations_numa" instead of
> "pg_shm_numa_allocations"
> 
> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
> naming things in general), I can change if you want.
> 

Me neither. I wonder if there's some precedent when adding similar
variants for other catalogs ... can you check? I've been thinking about
pg_stats and pg_stats_ext, but maybe there's a better example?

> 13) In the patch: "review: What if we get multiple pages per buffer
> (the default). Could we get multiple nodes per buffer?"
> 
> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
> output multiple rows per single buffer (with "page_no") then we could
> get this:
> buffer1:..:page0:numanodeID1
> buffer1:..:page1:numanodeID2
> buffer2:..:page0:numanodeID1
> 
> Should we add such functionality?
> 


When you say "today no" does that mean we know all pages will be on the
same node, or that there may be pages from different nodes and we can't
display that? That'd not be great, IMHO.

I'm not a huge fan of returning multiple rows per buffer, with one row
per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
rest of the fields is for the whole buffer, it'd be wrong to duplicate
that for each page.

I wonder if we should have a bitmap of nodes for the buffer (but then
what if there are multiple pages from the same node?), or maybe just an
array of nodes, with one element per page.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

03 April, 09:56:41

On Wed, Apr 2, 2025 at 5:27 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi Jakub,

Hi Bertrand,

> > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on
> > it's own... I believe orginal ident, would fix this on it's own?
>
> My comment was not about indention but about the fact that I think that the
> casting is not a the right place. I think that's the result of the multiplication
> that we want to be casted (cast operator has higher precedence than Multiplication
> operator).

Oh! I've missed that, but v21 got a rewrite (still not polished) just
to show it can be done without float points as Tomas requested.

[..]
> Ok, but then does it make sense to see some num_size < shmem_size?
>
> postgres=# select c.name, c.size as num_size, s.size as shmem_size
>  from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c,
pg_shmem_allocationss 
>  where c.name = s.name and s.size > c.size;
>      name      | num_size  | shmem_size
> ---------------+-----------+------------
>  XLOG Ctl      |   4194304 |    4208200
>  Buffer Blocks | 134217728 |  134221824
>  AioHandleIOV  |   2097152 |    2850816

This was a real bug, fixed in v21 , the ent->allocated_size was not
properly page aligned. Thanks for the attention to detail.

-J.

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

03 April, 10:01:43

On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> OK, so you agree the commit messages are complete / correct?

Yes.

> OK. FWIW if you disagree with some of my proposed changes, feel free to
> push back. I'm sure some may be more a matter of personal preference.

No, it's all fine. I will probably have lots of questions about
setting proper env for development that cares itself about style, but
that's for another day.

> [..floats..]
> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
> confusing and difficult to reason about ...
>
> I agree we don't want special cases for each possible combination of
> page sizes (I'm not sure we even know all the combinations). What I was
> thinking about is two branches, one for (block >= page) and another for
> (block < page). AFAICK both values have to be 2^k, so this would
> guarantee we have either (block/page) or (page/block) as integer.
>
> I wonder if you could even just calculate both, and have one loop that
> deals with both.
>
[..]
> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
> or any other data type. I mean that it works with non-floating point
> values. It could be int64, Size or whatever is large enough to not
> overflow. I really don't see how changing stuff to double makes this
> easier to understand.

I hear you, attached v21 / 0003 is free of float/double arithmetics
and uses non-float point values. It should be more readable too with
those comments. I have not put it into its own function, because now
it fits the whole screen, so hopefully one can follow visually. Please
let me know if that code solves the doubts or feel free to reformat
it. That _numa_prepare_ptrs() is unused and will need to be removed,
but we can still move some code there if necessary.

> > 12) You have also raised "why not pg_shm_allocations_numa" instead of
> > "pg_shm_numa_allocations"
> >
> > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
> > naming things in general), I can change if you want.
> >
>
> Me neither. I wonder if there's some precedent when adding similar
> variants for other catalogs ... can you check? I've been thinking about
> pg_stats and pg_stats_ext, but maybe there's a better example?

Hm, it seems we always go with suffix "_somethingnew":

* pg_stat_database -> pg_stat_database_conflicts
* pg_stat_subscription -> pg_stat_subscription_stats
* even here: pg_buffercache -> pg_buffercache_numa

@Bertrand: do you have anything against pg_shm_allocations_numa
instead of pg_shm_numa_allocations? I don't mind changing it...

> > 13) In the patch: "review: What if we get multiple pages per buffer
> > (the default). Could we get multiple nodes per buffer?"
> >
> > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
> > output multiple rows per single buffer (with "page_no") then we could
> > get this:
> > buffer1:..:page0:numanodeID1
> > buffer1:..:page1:numanodeID2
> > buffer2:..:page0:numanodeID1
> >
> > Should we add such functionality?
>
> When you say "today no" does that mean we know all pages will be on the
> same node, or that there may be pages from different nodes and we can't
> display that? That'd not be great, IMHO.
>
> I'm not a huge fan of returning multiple rows per buffer, with one row
> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
> rest of the fields is for the whole buffer, it'd be wrong to duplicate
> that for each page.

OPEN_QUESTION: With v21 we have all the information available, we are
just unable to display this in pg_buffercache_numa right now. We could
trim the view so that it has 3 columns (and user needs to JOIN to
pg_buffercache for more details like relationoid), but then what the
whole refactor (0002) was for if we would just return bufferId like
below:

buffer1:page0:numanodeID1
buffer1:page1:numanodeID2
buffer2:page0:numanodeID1
buffer2:page1:numanodeID1

There's also the problem that reading/joining could be inconsistent
and even slower.

> I wonder if we should have a bitmap of nodes for the buffer (but then
> what if there are multiple pages from the same node?), or maybe just an
> array of nodes, with one element per page.

AFAIR this has been discussed back in end of January, and the
conclusion was more or less - on Discord - that everything sucks
(bitmaps, BIT() datatype, arrays,...) either from implementation or
user side, but apparently arrays [] would suck the least from
implementation side. So we could probably do something like up to
node_max_nodes():
buffer1:..:{0, 2, 0, 0}
buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
buffer3:..:{0, 0, 0, 2}

Other idea is JSON or even simple string with numa_node_id<->count:
buffer1:..:"1=2"
buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
buffer3:..:"3=2"

I find all of those non-user friendly and I'm afraid I won't be able
to pull that alone in time...

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

03 April, 11:23:01

Hi,

On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
> > OK, so you agree the commit messages are complete / correct?
> 
> Yes.

Not 100% sure on my side. 

=== v21-0002

Says:

"
This introduces three new functions:

    - pg_buffercache_init_entries
    - pg_buffercache_build_tuple
    - get_buffercache_tuple
"

While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
is).

About v21-0002:

=== 1

I can see that the pg_buffercache_init_entries() helper comments are added in
v21-0003 but I think that it would be better to add them in v21-0002
(where the helper is actually created).

About v21-0003:

=== 2

> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values.

+               if (buffers_per_page > 1)
+                       os_page_query_count = NBuffers;
+               else
+                       os_page_query_count = NBuffers * pages_per_buffer;

yeah, that's more elegant. I think that it properly handles the relationships
between buffer and page sizes without relying on float arithmetic.

=== 3

+                       if (buffers_per_page > 1)
+                       {

As buffers_per_page does not change, I think I'd put this check outside of the
for (i = 0; i < NBuffers; i++) loop, something like:

"
if (buffers_per_page > 1) {
    /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
    for (i = 0; i < NBuffers; i++) {
.
.
.
.
} else {
    /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
    for (i = 0; i < NBuffers; i++) {
.
.
.
"

=== 4

> That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.

Yeah I think that it can be simply removed then.

=== 5

> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...

I think that pg_shm_allocations_numa() is better (given the examples you just
shared).

=== 6

> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

Maybe we could add a few words in the doc to mention the "multiple nodes per
buffer" case? And try to improve it for say 19? Also maybe we should just focus
till v21-0003 (and discard v21-0004 for 18).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

03 April, 14:52:04

On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
> Hi Tomas,
> 
>> OK, so you agree the commit messages are complete / correct?
> 
> Yes.
> 
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
> 
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
> 
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
> 
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
> 

IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
> 
> Hm, it seems we always go with suffix "_somethingnew":
> 
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
> 
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
> 

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
> 
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
> 
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
> 
> There's also the problem that reading/joining could be inconsistent
> and even slower.
> 

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.

>> I wonder if we should have a bitmap of nodes for the buffer (but then
>> what if there are multiple pages from the same node?), or maybe just an
>> array of nodes, with one element per page.
> 
> AFAIR this has been discussed back in end of January, and the
> conclusion was more or less - on Discord - that everything sucks
> (bitmaps, BIT() datatype, arrays,...) either from implementation or
> user side, but apparently arrays [] would suck the least from
> implementation side. So we could probably do something like up to
> node_max_nodes():
> buffer1:..:{0, 2, 0, 0}
> buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
> buffer3:..:{0, 0, 0, 2}
> 
> Other idea is JSON or even simple string with numa_node_id<->count:
> buffer1:..:"1=2"
> buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
> buffer3:..:"3=2"
> 
> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().

Attached is v22, with some minor review comments:

1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?

2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).

3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

03 April, 15:15:12

On 4/3/25 10:23, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
>> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi Tomas,
>>
>>> OK, so you agree the commit messages are complete / correct?
>>
>> Yes.
> 
> Not 100% sure on my side. 
> 
> === v21-0002
> 
> Says:
> 
> "
> This introduces three new functions:
> 
>     - pg_buffercache_init_entries
>     - pg_buffercache_build_tuple
>     - get_buffercache_tuple
> "
> 
> While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
> is).
> 

Ah, OK. Jakub, can you correct (and double-check) this in the next
version of the patch?

> About v21-0002:
> 
> === 1
> 
> I can see that the pg_buffercache_init_entries() helper comments are added in
> v21-0003 but I think that it would be better to add them in v21-0002
> (where the helper is actually created).
> 

+1 to that

> About v21-0003:
> 
> === 2
> 
>> I hear you, attached v21 / 0003 is free of float/double arithmetics
>> and uses non-float point values.
> 
> +               if (buffers_per_page > 1)
> +                       os_page_query_count = NBuffers;
> +               else
> +                       os_page_query_count = NBuffers * pages_per_buffer;
> 
> yeah, that's more elegant. I think that it properly handles the relationships
> between buffer and page sizes without relying on float arithmetic.
> 

In the review I just submitted, I changed this to

    os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);

but maybe it's less clear. Feel free to undo my change.

> === 3
> 
> +                       if (buffers_per_page > 1)
> +                       {
> 
> As buffers_per_page does not change, I think I'd put this check outside of the
> for (i = 0; i < NBuffers; i++) loop, something like:
> 
> "
> if (buffers_per_page > 1) {
>     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> .
> } else {
>     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
>     for (i = 0; i < NBuffers; i++) {
> .
> .

I don't know. It's a matter of opinion, but I find the current code
fairly understandable. Maybe if it meaningfully reduced the code
nesting, but even with the extra branch we'd still need the for loop.
I'm not against doing this differently, but I'd have to see how that
looks. Until then I think it's fine to have the code as is.

> .
> "
> 
> === 4
> 
>> That _numa_prepare_ptrs() is unused and will need to be removed,
>> but we can still move some code there if necessary.
> 
> Yeah I think that it can be simply removed then.
> 

I'm not particularly attached on having the _ptrs() function, but why
couldn't it build the os_page_ptrs array as before?

> === 5
> 
>> @Bertrand: do you have anything against pg_shm_allocations_numa
>> instead of pg_shm_numa_allocations? I don't mind changing it...
> 
> I think that pg_shm_allocations_numa() is better (given the examples you just
> shared).
> 
> === 6
> 
>> I find all of those non-user friendly and I'm afraid I won't be able
>> to pull that alone in time...
> 
> Maybe we could add a few words in the doc to mention the "multiple nodes per
> buffer" case? And try to improve it for say 19? Also maybe we should just focus
> till v21-0003 (and discard v21-0004 for 18).
> 

IMHO it's not enough to paper this over by mentioning it in the docs.

I'm a bit confused about which patch you suggest to leave out. I think
0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
We've been discussing how to represent nodes for buffers in 0003.

I understand it'd require a bit of coding, but AFAICS adding an array
would be fairly trivial amount of code. Something like

  values[i++]
    = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));

would do, I think. But if we decide to do the 3-column view, that'd be
even simpler, I think. AFAICS it means we could mostly ditch the
pg_buffercache refactoring in patch 0002, and 0003 would get much
simpler too I think.

If Jakub doesn't have time to work on this, I can take a stab at it
later today.

regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

03 April, 15:36:57

On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,

Hi Bertrand,

> On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote:
[..]

> === v21-0002
> While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple()
> is).

Fixed

> About v21-0002:
>
> === 1
>
> I can see that the pg_buffercache_init_entries() helper comments are added in
> v21-0003 but I think that it would be better to add them in v21-0002
> (where the helper is actually created).

Moved

> About v21-0003:
>
> === 2
>
> > I hear you, attached v21 / 0003 is free of float/double arithmetics
> > and uses non-float point values.
>
> +               if (buffers_per_page > 1)
> +                       os_page_query_count = NBuffers;
> +               else
> +                       os_page_query_count = NBuffers * pages_per_buffer;
>
> yeah, that's more elegant. I think that it properly handles the relationships
> between buffer and page sizes without relying on float arithmetic.

Cool

> === 3
>
> +                       if (buffers_per_page > 1)
> +                       {
>
> As buffers_per_page does not change, I think I'd put this check outside of the
> for (i = 0; i < NBuffers; i++) loop, something like:
>
> "
> if (buffers_per_page > 1) {
>     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> .
> } else {
>     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
>     for (i = 0; i < NBuffers; i++) {
> .
> .
> .
> "

Done.

> === 4
>
> > That _numa_prepare_ptrs() is unused and will need to be removed,
> > but we can still move some code there if necessary.
>
> Yeah I think that it can be simply removed then.

Removed.

> === 5
>
> > @Bertrand: do you have anything against pg_shm_allocations_numa
> > instead of pg_shm_numa_allocations? I don't mind changing it...
>
> I think that pg_shm_allocations_numa() is better (given the examples you just
> shared).

OK, let's go with this name then (in v22).

> === 6
>
> > I find all of those non-user friendly and I'm afraid I won't be able
> > to pull that alone in time...
>
> Maybe we could add a few words in the doc to mention the "multiple nodes per
> buffer" case? And try to improve it for say 19?

Right, we could also put it as a limitation. I would be happy to leave
it as it must be a rare condition, but Tomas stated he's not.

> Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).

Do you mean discard pg_buffercache_numa (0002+0003) and instead go
with pg_shm_allocations_numa (0004) ?

BTW: I've noticed that Tomas responded with his v22 to this after I've
solved all of the above in mine v22, so I'll drop v23 soon and then
let's continue there.


-J.

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

03 April, 15:41:48

Hi Jakub,

On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> Right, we could also put it as a limitation. I would be happy to leave
> it as it must be a rare condition, but Tomas stated he's not.
> 
> > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18).
> 
> Do you mean discard pg_buffercache_numa (0002+0003) and instead go
> with pg_shm_allocations_numa (0004) ?

No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is
confident enough to also focus in addition to 0004, that's fine too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

03 April, 15:42:32

On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra <tomas@vondra.me> wrote:

> Ah, OK. Jakub, can you correct (and double-check) this in the next
> version of the patch?

Done.

> > About v21-0002:
> >
> > === 1
> >
> > I can see that the pg_buffercache_init_entries() helper comments are added in
> > v21-0003 but I think that it would be better to add them in v21-0002
> > (where the helper is actually created).
> >
>
> +1 to that

Done.

> > About v21-0003:
> >
> > === 2
> >
> >> I hear you, attached v21 / 0003 is free of float/double arithmetics
> >> and uses non-float point values.
> >
> > +               if (buffers_per_page > 1)
> > +                       os_page_query_count = NBuffers;
> > +               else
> > +                       os_page_query_count = NBuffers * pages_per_buffer;
> >
> > yeah, that's more elegant. I think that it properly handles the relationships
> > between buffer and page sizes without relying on float arithmetic.
> >
>
> In the review I just submitted, I changed this to
>
>     os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
>
> but maybe it's less clear. Feel free to undo my change.

Cool, thanks, will send shortly v23 with this applied.

>
> > === 3
> >
> > +                       if (buffers_per_page > 1)
> > +                       {
> >
> > As buffers_per_page does not change, I think I'd put this check outside of the
> > for (i = 0; i < NBuffers; i++) loop, something like:
> >
> > "
> > if (buffers_per_page > 1) {
> >     /* BLCKSZ < PAGESIZE: one page hosts many Buffers */
> >     for (i = 0; i < NBuffers; i++) {
> > .
> > .
> > .
> > .
> > } else {
> >     /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */
> >     for (i = 0; i < NBuffers; i++) {
> > .
> > .
>
> I don't know. It's a matter of opinion, but I find the current code
> fairly understandable. Maybe if it meaningfully reduced the code
> nesting, but even with the extra branch we'd still need the for loop.
> I'm not against doing this differently, but I'd have to see how that
> looks. Until then I think it's fine to have the code as is.

v23 will have incorporated Bertrand's idea soon. No hard feelings, but
it's kind of painful to switch like that ;)

> > === 4
> >
> >> That _numa_prepare_ptrs() is unused and will need to be removed,
> >> but we can still move some code there if necessary.
> >
> > Yeah I think that it can be simply removed then.
> >
>
> I'm not particularly attached on having the _ptrs() function, but why
> couldn't it build the os_page_ptrs array as before?

I've removed it in v23, the code for me just didn't have flow...

> > === 5
> >
> >> @Bertrand: do you have anything against pg_shm_allocations_numa
> >> instead of pg_shm_numa_allocations? I don't mind changing it...
> >
> > I think that pg_shm_allocations_numa() is better (given the examples you just
> > shared).

Done.

> > === 6
> >
> >> I find all of those non-user friendly and I'm afraid I won't be able
> >> to pull that alone in time...
> >
> > Maybe we could add a few words in the doc to mention the "multiple nodes per
> > buffer" case? And try to improve it for say 19? Also maybe we should just focus
> > till v21-0003 (and discard v21-0004 for 18).
> >
>
> IMHO it's not enough to paper this over by mentioning it in the docs.

OK

> I'm a bit confused about which patch you suggest to leave out. I think
> 0004 is pg_shmem_allocations_numa, but I think that part is fine, no?
> We've been discussing how to represent nodes for buffers in 0003.

IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but
we settled on arrays right?

> I understand it'd require a bit of coding, but AFAICS adding an array
> would be fairly trivial amount of code. Something like
>
>   values[i++]
>     = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID));
>
> would do, I think. But if we decide to do the 3-column view, that'd be
> even simpler, I think. AFAICS it means we could mostly ditch the
> pg_buffercache refactoring in patch 0002, and 0003 would get much
> simpler too I think.
>
> If Jakub doesn't have time to work on this, I can take a stab at it
> later today.

I won't be able to even start on this today, so if you have cycles for
please do so...

-J.

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

03 April, 16:12:38

On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:

> On 4/3/25 09:01, Jakub Wartak wrote:
> > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

Here's v23 attached (had to rework it because the you sent v22 just
the moment you I wanted to send it) Change include:
- your review should be incorporated
- name change for shm view
- pg_buffercache_numa refactor a little to keep out the if outside
loops (as Bertrand wanted initially)

So let's continue in this subthread.

> I think a view with just 3 columns would be a good solution. It's what
> pg_shmem_allocations_numa already does, so it'd be consistent with that
> part too.
>
> I'm not too worried about the cost of the extra join - it's going to be
> a couple dozen milliseconds at worst, I guess, and that's negligible in
> the bigger scheme of things (e.g. compared to how long the move_pages is
> expected to take). Also, it's not like having everything in the same
> view is free - people would have to do some sort of post-processing, and
> that has a cost too.
>
> So unless someone can demonstrate a use case where this would matter,
> I'd not worry about it too much.

OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
it's just that I don't have cycles left today and probably lack skills
(i've never dealt with arrays so far) thus it would be slow to get it
right... but I can pick up anything tomorrow morning.

> I'm -1 on JSON, I don't see how would that solve anything better than
> e.g. a regular array, and it's going to be harder to work with. So if we
> don't want to go with the 3-column view proposed earlier, I'd stick to a
> simple array. I don't think there's a huge difference between those two
> approaches, it should be easy to convert between those approaches using
> unnest() and array_agg().
>
> Attached is v22, with some minor review comments:
>
> 1) I suggested we should just use "libnuma support" in configure,
> instead of talking about "NUMA awareness support", and AFAICS you
> agreed. But I still see the old text in configure ... is that
> intentional or a bit of forgotten text?

It was my bad, too many rebases and mishaps with git voodoo..

> 2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
> and simplified a couple lines (but that's a matter of preference).

Great, thanks.

> 3) I don't think it's correct for pg_get_shmem_numa_allocations to just
> silently ignore nodes outside the valid range. I suggest we simply do
> elog(ERROR), as it's an internal error we don't expect to happen.

It's there too.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

03 April, 21:53:57

On 4/3/25 15:12, Jakub Wartak wrote:
> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
>> ...
>>
>> So unless someone can demonstrate a use case where this would matter,
>> I'd not worry about it too much.
> 
> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> it's just that I don't have cycles left today and probably lack skills
> (i've never dealt with arrays so far) thus it would be slow to get it
> right... but I can pick up anything tomorrow morning.
> 

OK, I took a stab at reworking/simplifying this the way I proposed.
Here's v24 - needs more polishing, but hopefully enough to show what I
had in mind.

It does these changes:

1) Drops 0002 with the pg_buffercache refactoring, because the new view
is not "extending" the existing one.

2) Reworks pg_buffercache_num to return just three columns, bufferid,
page_num and node_id. page_num is a sequence starting from 0 for each
buffer.

3) It now builds an array of records, with one record per buffer/page.

4) I realized we don't really need to worry about buffers_per_page very
much, except for logging/debugging. There's always "at least one page"
per buffer, even if an incomplete one, so we can do this:

   os_page_count = NBuffers * Max(1, pages_per_buffer);

and then

  for (i = 0; i < NBuffers; i++)
  {
      for (j = 0; j < Max(1, pages_per_buffer); j++)
      {
          ..
      }
  }

and everything just works fine, I think.

Opinions? I personally find this much cleaner / easier to understand.

regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

04 April, 09:50:07

Hi,

On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> On 4/3/25 15:12, Jakub Wartak wrote:
> > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > 
> >> ...
> >>
> >> So unless someone can demonstrate a use case where this would matter,
> >> I'd not worry about it too much.
> > 
> > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > it's just that I don't have cycles left today and probably lack skills
> > (i've never dealt with arrays so far) thus it would be slow to get it
> > right... but I can pick up anything tomorrow morning.
> > 
> 
> OK, I took a stab at reworking/simplifying this the way I proposed.
> Here's v24 - needs more polishing, but hopefully enough to show what I
> had in mind.
> 
> It does these changes:
> 
> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> is not "extending" the existing one.

I think that makes sense. One would just need to join on the pg_buffercache
view to get more information about the buffer if needed.

The pg_buffercache_numa_pages() doc needs an update though as I don't think that
"+  The <function>pg_buffercache_numa_pages()</function> provides the same
information as <function>pg_buffercache_pages()</function>" is still true.

> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> page_num and node_id. page_num is a sequence starting from 0 for each
> buffer.

+1 on the idea

> 3) It now builds an array of records, with one record per buffer/page.
> 
> 4) I realized we don't really need to worry about buffers_per_page very
> much, except for logging/debugging. There's always "at least one page"
> per buffer, even if an incomplete one, so we can do this:
> 
>    os_page_count = NBuffers * Max(1, pages_per_buffer);
> 
> and then
> 
>   for (i = 0; i < NBuffers; i++)
>   {
>       for (j = 0; j < Max(1, pages_per_buffer); j++)

That's a nice simplification as we always need to take care of at least one page
per buffer.

> and everything just works fine, I think.

I think the same.

> Opinions? I personally find this much cleaner / easier to understand.

I agree that's easier to understand and that that looks correct.

A few random comments:

=== 1

It looks like that firstNumaTouch is not set to false anymore.

=== 2

+               pfree(os_page_status);
+               pfree(os_page_ptrs);

Not sure that's needed, we should be in a short-lived memory context here
(ExprContext or such).

=== 3

+       ro_volatile_var = *(uint64 *)ptr

space missing before "ptr"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

04 April, 10:35:53

On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
> > On 4/3/25 15:12, Jakub Wartak wrote:
> > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > >
> > >> ...
> > >>
> > >> So unless someone can demonstrate a use case where this would matter,
> > >> I'd not worry about it too much.
> > >
> > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
> > > it's just that I don't have cycles left today and probably lack skills
> > > (i've never dealt with arrays so far) thus it would be slow to get it
> > > right... but I can pick up anything tomorrow morning.
> > >
> >
> > OK, I took a stab at reworking/simplifying this the way I proposed.
> > Here's v24 - needs more polishing, but hopefully enough to show what I
> > had in mind.
> >
> > It does these changes:
> >
> > 1) Drops 0002 with the pg_buffercache refactoring, because the new view
> > is not "extending" the existing one.
>
> I think that makes sense. One would just need to join on the pg_buffercache
> view to get more information about the buffer if needed.
>
> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
> information as <function>pg_buffercache_pages()</function>" is still true.
>
> > 2) Reworks pg_buffercache_num to return just three columns, bufferid,
> > page_num and node_id. page_num is a sequence starting from 0 for each
> > buffer.
>
> +1 on the idea
>
> > 3) It now builds an array of records, with one record per buffer/page.
> >
> > 4) I realized we don't really need to worry about buffers_per_page very
> > much, except for logging/debugging. There's always "at least one page"
> > per buffer, even if an incomplete one, so we can do this:
> >
> >    os_page_count = NBuffers * Max(1, pages_per_buffer);
> >
> > and then
> >
> >   for (i = 0; i < NBuffers; i++)
> >   {
> >       for (j = 0; j < Max(1, pages_per_buffer); j++)
>
> That's a nice simplification as we always need to take care of at least one page
> per buffer.
>
> > and everything just works fine, I think.
>
> I think the same.
>
> > Opinions? I personally find this much cleaner / easier to understand.
>
> I agree that's easier to understand and that that looks correct.
>
> A few random comments:
>
> === 1
>
> It looks like that firstNumaTouch is not set to false anymore.
>
> === 2
>
> +               pfree(os_page_status);
> +               pfree(os_page_ptrs);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).
>
> === 3
>
> +       ro_volatile_var = *(uint64 *)ptr
>
> space missing before "ptr"?
>

+my feedback as I've noticed that Bertrand already provided a review.

Right, the code is now simple , and that Max() is brilliant. I've
attached some review findings as .txt

0001 100%LGTM
0002 doc fix + pgident + Tomas, you should take Authored-by yourself
there for sure, I couldn't pull this off alone in time! So big thanks!
0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
NUMA on ming32...:))

When started with interleave=all on serious hardware, I'm getting (~5s
for s_b=64GB) from pg_buffercache_numa

     node_id |  count
    ---------+---------
           3 | 2097152
           0 | 2097152
           2 | 2097152
           1 | 2097152

so this is valid result (2097152*4 numa nodes*8192 buffer
size/1024/1024/1024 = 64GB)

Also with pgbench -i -s 20 , after ~8s:
    select c.relname, n.node_id, count(*) from pg_buffercache_numa n
    join pg_buffercache b on (b.bufferid = n.bufferid)
    join pg_class c on (c.relfilenode = b.relfilenode)
    group by c.relname, n.node_id order by count(*) desc;
                        relname                    | node_id | count
    -----------------------------------------------+---------+-------
     pgbench_accounts                              |       2 |  8217
     pgbench_accounts                              |       0 |  8190
     pgbench_accounts                              |       3 |  8189
     pgbench_accounts                              |       1 |  8187
     pg_statistic                                  |       2 |    32
     pg_operator                                   |       2 |    14
     pg_depend                                     |       3 |    14
     [..]

pg_shm_allocations_numa also looks good.

I think v24+tiny fixes is good enough to go in.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

04 April, 17:36:15


On 4/4/25 09:35, Jakub Wartak wrote:
> On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Hi,
>>
>> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
>>> On 4/3/25 15:12, Jakub Wartak wrote:
>>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>>
>>>>> ...
>>>>>
>>>>> So unless someone can demonstrate a use case where this would matter,
>>>>> I'd not worry about it too much.
>>>>
>>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
>>>> it's just that I don't have cycles left today and probably lack skills
>>>> (i've never dealt with arrays so far) thus it would be slow to get it
>>>> right... but I can pick up anything tomorrow morning.
>>>>
>>>
>>> OK, I took a stab at reworking/simplifying this the way I proposed.
>>> Here's v24 - needs more polishing, but hopefully enough to show what I
>>> had in mind.
>>>
>>> It does these changes:
>>>
>>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
>>> is not "extending" the existing one.
>>
>> I think that makes sense. One would just need to join on the pg_buffercache
>> view to get more information about the buffer if needed.
>>
>> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
>> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
>> information as <function>pg_buffercache_pages()</function>" is still true.
>>
>>> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
>>> page_num and node_id. page_num is a sequence starting from 0 for each
>>> buffer.
>>
>> +1 on the idea
>>
>>> 3) It now builds an array of records, with one record per buffer/page.
>>>
>>> 4) I realized we don't really need to worry about buffers_per_page very
>>> much, except for logging/debugging. There's always "at least one page"
>>> per buffer, even if an incomplete one, so we can do this:
>>>
>>>    os_page_count = NBuffers * Max(1, pages_per_buffer);
>>>
>>> and then
>>>
>>>   for (i = 0; i < NBuffers; i++)
>>>   {
>>>       for (j = 0; j < Max(1, pages_per_buffer); j++)
>>
>> That's a nice simplification as we always need to take care of at least one page
>> per buffer.
>>
>>> and everything just works fine, I think.
>>
>> I think the same.
>>
>>> Opinions? I personally find this much cleaner / easier to understand.
>>
>> I agree that's easier to understand and that that looks correct.
>>
>> A few random comments:
>>
>> === 1
>>
>> It looks like that firstNumaTouch is not set to false anymore.
>>
>> === 2
>>
>> +               pfree(os_page_status);
>> +               pfree(os_page_ptrs);
>>
>> Not sure that's needed, we should be in a short-lived memory context here
>> (ExprContext or such).
>>
>> === 3
>>
>> +       ro_volatile_var = *(uint64 *)ptr
>>
>> space missing before "ptr"?
>>
> 
> +my feedback as I've noticed that Bertrand already provided a review.
> 
> Right, the code is now simple , and that Max() is brilliant. I've
> attached some review findings as .txt
> 
> 0001 100%LGTM
> 0002 doc fix + pgident + Tomas, you should take Authored-by yourself
> there for sure, I couldn't pull this off alone in time! So big thanks!
> 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have
> NUMA on ming32...:))
> 

OK

> When started with interleave=all on serious hardware, I'm getting (~5s
> for s_b=64GB) from pg_buffercache_numa
> 
>      node_id |  count
>     ---------+---------
>            3 | 2097152
>            0 | 2097152
>            2 | 2097152
>            1 | 2097152
> 
> so this is valid result (2097152*4 numa nodes*8192 buffer
> size/1024/1024/1024 = 64GB)
> 
> Also with pgbench -i -s 20 , after ~8s:
>     select c.relname, n.node_id, count(*) from pg_buffercache_numa n
>     join pg_buffercache b on (b.bufferid = n.bufferid)
>     join pg_class c on (c.relfilenode = b.relfilenode)
>     group by c.relname, n.node_id order by count(*) desc;
>                         relname                    | node_id | count
>     -----------------------------------------------+---------+-------
>      pgbench_accounts                              |       2 |  8217
>      pgbench_accounts                              |       0 |  8190
>      pgbench_accounts                              |       3 |  8189
>      pgbench_accounts                              |       1 |  8187
>      pg_statistic                                  |       2 |    32
>      pg_operator                                   |       2 |    14
>      pg_depend                                     |       3 |    14
>      [..]
> 
> pg_shm_allocations_numa also looks good.
> 
> I think v24+tiny fixes is good enough to go in.
> 

OK.

Do you have any suggestions regarding the column names in the new view?
I'm not sure I like node_id and page_num.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

04 April, 17:40:28

On 4/4/25 08:50, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote:
>> On 4/3/25 15:12, Jakub Wartak wrote:
>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>>> ...
>>>>
>>>> So unless someone can demonstrate a use case where this would matter,
>>>> I'd not worry about it too much.
>>>
>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me,
>>> it's just that I don't have cycles left today and probably lack skills
>>> (i've never dealt with arrays so far) thus it would be slow to get it
>>> right... but I can pick up anything tomorrow morning.
>>>
>>
>> OK, I took a stab at reworking/simplifying this the way I proposed.
>> Here's v24 - needs more polishing, but hopefully enough to show what I
>> had in mind.
>>
>> It does these changes:
>>
>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view
>> is not "extending" the existing one.
> 
> I think that makes sense. One would just need to join on the pg_buffercache
> view to get more information about the buffer if needed.
> 
> The pg_buffercache_numa_pages() doc needs an update though as I don't think that
> "+  The <function>pg_buffercache_numa_pages()</function> provides the same
> information as <function>pg_buffercache_pages()</function>" is still true.
> 

Right, thanks for checking the docs.

>> 2) Reworks pg_buffercache_num to return just three columns, bufferid,
>> page_num and node_id. page_num is a sequence starting from 0 for each
>> buffer.
> 
> +1 on the idea
> 
>> 3) It now builds an array of records, with one record per buffer/page.
>>
>> 4) I realized we don't really need to worry about buffers_per_page very
>> much, except for logging/debugging. There's always "at least one page"
>> per buffer, even if an incomplete one, so we can do this:
>>
>>    os_page_count = NBuffers * Max(1, pages_per_buffer);
>>
>> and then
>>
>>   for (i = 0; i < NBuffers; i++)
>>   {
>>       for (j = 0; j < Max(1, pages_per_buffer); j++)
> 
> That's a nice simplification as we always need to take care of at least one page
> per buffer.
> 

OK. I think I'll consider moving some of this code "building" the
entries into a separate function, to keep the main function easier to
understand.

>> and everything just works fine, I think.
> 
> I think the same.
> 
>> Opinions? I personally find this much cleaner / easier to understand.
> 
> I agree that's easier to understand and that that looks correct.
> 
> A few random comments:
> 
> === 1
> 
> It looks like that firstNumaTouch is not set to false anymore.
> 

Damn, my mistake.

> === 2
> 
> +               pfree(os_page_status);
> +               pfree(os_page_ptrs);
> 
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).
> 

Yeah, maybe. It's not allocated in the multi-call context, but I wasn't
sure. Will check.

> === 3
> 
> +       ro_volatile_var = *(uint64 *)ptr
> 
> space missing before "ptr"?
> 

Interesting the pgindent didn't tweak this.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

04 April, 20:07:12

On Fri, Apr 4, 2025 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> Do you have any suggestions regarding the column names in the new view?
> I'm not sure I like node_id and page_num.

They actually look good to me. We've discussed earlier dropping
s/numa_//g for column names (after all views contain it already) so
they are fine in this regard.
There's also the question of consistency: (bufferid, page_num,
node_id) -- maybe should just drop "_" and that's it?
Well I would even possibly consider page_num -> ospagenumber, but that's ugly.

-J.

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

04 April, 22:25:57

OK,

here's v25 after going through the patches once more, fixing the issues
mentioned by Bertrand, etc. I think 0001 and 0002 are fine, I have a
couple minor questions about 0003.

0002
----
- Adds the new types to typedefs.list, to make pgindent happy.
- Improves comment for pg_buffercache_numa_pages
- Minor formatting tweaks.

- I was wondering if maybe we should have some "global ID" of memory
page, so that with large memory pages it's indicated the buffers are on
the same memory page. Right now each buffer starts page_num from 0, but
it should not be very hard to have a global counter. Opinions?


0003
----
- Minor formatting tweaks, comment improvements.
- Isn't this comment a bit confusing / misleading?

    /* Get number of OS aligned pages */

AFAICS the point is to adjust the allocated_size to be a multiple of
os-page_size, to get "all" memory pages the segment uses. But that's not
what I understand by "aligned page" (which is about there the page is
expected to start). Or did I get this wrong?

- There's a comment at the end which talks about "ignored segments".
IMHO that type of information should be in the function comment, but I'm
also not quite sure I understand what "output shared memory" is ...


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

05 April, 12:37:51

Hi,

On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
> OK,
> 
> here's v25 after going through the patches once more, fixing the issues
> mentioned by Bertrand, etc.

Thanks!

> I think 0001 and 0002 are fine,

Agree, I just have some cosmetic nits comments: please find them in
nit-bertrand-0002.txt attached.

> I have a
> couple minor questions about 0003.
> 
> - I was wondering if maybe we should have some "global ID" of memory
> page, so that with large memory pages it's indicated the buffers are on
> the same memory page. Right now each buffer starts page_num from 0, but
> it should not be very hard to have a global counter. Opinions?

I think that's a good idea. We could then add a new column (say os_page_id) that
would help identify which buffers are sharing the same "physical" page.

> 0003
> ----
> - Minor formatting tweaks, comment improvements.
> - Isn't this comment a bit confusing / misleading?
> 
>     /* Get number of OS aligned pages */
> 
> AFAICS the point is to adjust the allocated_size to be a multiple of
> os-page_size, to get "all" memory pages the segment uses. But that's not
> what I understand by "aligned page" (which is about there the page is
> expected to start).

Agree, what about? 

"
Align the start of the allocated size to an OS page size boundary and then get
the total number of OS pages used by this segment"
"

> - There's a comment at the end which talks about "ignored segments".
> IMHO that type of information should be in the function comment,
> but I'm
> also not quite sure I understand what "output shared memory" is ...

I think that comes from the comments that are already in
pg_get_shmem_allocations().

I think that those are located here and worded that way to ease to understand 
what is not in with pg_get_shmem_allocations_numa() if one look at both
functions. That said, I'm +1 to put this kind of comments in the function comment.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

nit-bertrand-0002.txt

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

05 April, 16:23:38

On 4/5/25 11:37, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
>> OK,
>>
>> here's v25 after going through the patches once more, fixing the issues
>> mentioned by Bertrand, etc.
> 
> Thanks!
> 
>> I think 0001 and 0002 are fine,
> 
> Agree, I just have some cosmetic nits comments: please find them in
> nit-bertrand-0002.txt attached.
> 
>> I have a
>> couple minor questions about 0003.
>>
>> - I was wondering if maybe we should have some "global ID" of memory
>> page, so that with large memory pages it's indicated the buffers are on
>> the same memory page. Right now each buffer starts page_num from 0, but
>> it should not be very hard to have a global counter. Opinions?
> 
> I think that's a good idea. We could then add a new column (say os_page_id) that
> would help identify which buffers are sharing the same "physical" page.
> 

I was thinking we'd change the definition of the existing page_num
column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
page ID. But I don't know if this would be useful in practice.

>> 0003
>> ----
>> - Minor formatting tweaks, comment improvements.
>> - Isn't this comment a bit confusing / misleading?
>>
>>     /* Get number of OS aligned pages */
>>
>> AFAICS the point is to adjust the allocated_size to be a multiple of
>> os-page_size, to get "all" memory pages the segment uses. But that's not
>> what I understand by "aligned page" (which is about there the page is
>> expected to start).
> 
> Agree, what about? 
> 
> "
> Align the start of the allocated size to an OS page size boundary and then get
> the total number of OS pages used by this segment"
> "
> 

Something like that. But I think it should be "align the size of ...",
we're not aligning the start.

>> - There's a comment at the end which talks about "ignored segments".
>> IMHO that type of information should be in the function comment,
>> but I'm
>> also not quite sure I understand what "output shared memory" is ...
> 
> I think that comes from the comments that are already in
> pg_get_shmem_allocations().
> 
> I think that those are located here and worded that way to ease to understand 
> what is not in with pg_get_shmem_allocations_numa() if one look at both
> functions. That said, I'm +1 to put this kind of comments in the function comment.
> 

OK. But I'm still not sure what "output shared memory" is about. Can you
explain what shmem segments are not included?


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

05 April, 17:33:28

On 4/5/25 15:23, Tomas Vondra wrote:
> On 4/5/25 11:37, Bertrand Drouvot wrote:
>> Hi,
>>
>> On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote:
>>> OK,
>>>
>>> here's v25 after going through the patches once more, fixing the issues
>>> mentioned by Bertrand, etc.
>>
>> Thanks!
>>
>>> I think 0001 and 0002 are fine,
>>
>> Agree, I just have some cosmetic nits comments: please find them in
>> nit-bertrand-0002.txt attached.
>>
>>> I have a
>>> couple minor questions about 0003.
>>>
>>> - I was wondering if maybe we should have some "global ID" of memory
>>> page, so that with large memory pages it's indicated the buffers are on
>>> the same memory page. Right now each buffer starts page_num from 0, but
>>> it should not be very hard to have a global counter. Opinions?
>>
>> I think that's a good idea. We could then add a new column (say os_page_id) that
>> would help identify which buffers are sharing the same "physical" page.
>>
> 
> I was thinking we'd change the definition of the existing page_num
> column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
> page ID. But I don't know if this would be useful in practice.
> 

See the attached v25 with a draft of this in patch 0003.

While working on this, I realized it's probably wrong to use TYPEALIGN()
to calculate the OS page pointer. The code did this:

    os_page_ptrs[idx]
        = (char *) TYPEALIGN(os_page_size,
                             buffptr + (os_page_size * j));

but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB
memory pages, and that the first buffer is aligned to 4kB (i.e. it
starts right at the beginning of a memory page). Then we expect to get
page_num sequence:

    0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...

with 4 buffers per memory page. But we get this:

    0, 1, 1, 1, 1, 2, 2, 2, 2, ...

So I've changed this to TYPEALIGN_DOWN(), which fixes the result.

The pg_shmem_allocations_numa had a variant of this issue when
calculating the number of pages, I think. I believe the shmem segment
may start half-way through a page, and the allocated_size may not be a
multiple of os_page_size (otherwise why use TYPEALIGN). In which case we
might skip the first page, I think.

The one question I have about this is whether we know the pointer
returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or
before the first shared buffer) ...

regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

05 April, 19:58:32

Hi,

On Sat, Apr 05, 2025 at 04:33:28PM +0200, Tomas Vondra wrote:
> On 4/5/25 15:23, Tomas Vondra wrote:
> > I was thinking we'd change the definition of the existing page_num
> > column, i.e. it wouldn't be 0..N sequence for each buffer, but a global
> > page ID. But I don't know if this would be useful in practice.
> > 
> 
> See the attached v25 with a draft of this in patch 0003.

I see, thanks for sharing. I think that's useful because that could help
identify which buffers share the same OS page.

> While working on this, I realized it's probably wrong to use TYPEALIGN()
> to calculate the OS page pointer. The code did this:
> 
>     os_page_ptrs[idx]
>         = (char *) TYPEALIGN(os_page_size,
>                              buffptr + (os_page_size * j));
> 
> but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB
> memory pages, and that the first buffer is aligned to 4kB (i.e. it
> starts right at the beginning of a memory page). Then we expect to get
> page_num sequence:
> 
>     0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...
> 
> with 4 buffers per memory page. But we get this:
> 
>     0, 1, 1, 1, 1, 2, 2, 2, 2, ...
> 
> So I've changed this to TYPEALIGN_DOWN(), which fixes the result.

Good catch, that makes fully sense.

But now I can see some page_num < 0 :

postgres=# select page_num,node_id,count(*) from pg_buffercache_numa group by page_num,node_id order by 1 limit 4;
 page_num | node_id | count
----------+---------+-------
       -1 |       0 |   386
        0 |       1 |  1024
        1 |       0 |  1024
        2 |       1 |  1024

I think that can be solved that way:

-  startptr = (char *) BufferGetBlock(1);
+  startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));

so that startptr is aligned to the same boundaries. But I guess that we'll
have the same question as the following one:

> The one question I have about this is whether we know the pointer
> returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or
> before the first shared buffer) ...

Yeah, I'm not 100% sure about that... Maybe for safety we could use TYPEALIGN_DOWN()
for the reporting and use the actual buffer address when pg_numa_touch_mem_if_required()
is called? (to avoid touching "invalid" memory).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

05 April, 20:14:32

Hi,

On Sat, Apr 05, 2025 at 03:23:38PM +0200, Tomas Vondra wrote:
> Something like that. But I think it should be "align the size of ...",
> we're not aligning the start.
> 
> >> - There's a comment at the end which talks about "ignored segments".
> >> IMHO that type of information should be in the function comment,
> >> but I'm
> >> also not quite sure I understand what "output shared memory" is ...
> > 
> > I think that comes from the comments that are already in
> > pg_get_shmem_allocations().
> > 
> > I think that those are located here and worded that way to ease to understand 
> > what is not in with pg_get_shmem_allocations_numa() if one look at both
> > functions. That said, I'm +1 to put this kind of comments in the function comment.
> > 
> 
> OK. But I'm still not sure what "output shared memory" is about. Can you
> explain what shmem segments are not included?

Looking at pg_get_shmem_allocations() and the pg_shmem_allocations view
documentation, I would say the wording is linked to "anonymous allocations" and
"unused memory" (i.e the ones reported with <anonymous> or NULL as name in the
pg_shmem_allocations view). Using output in this comment sounds confusing while
it makes sense in pg_get_shmem_allocations() because it really reports those.

I think that we could just mention in the function comment that
pg_get_shmem_allocations_numa() does not handle anonymous allocations and unused
memory.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

06 April, 01:29:22

Hi,

I just played around with this for a bit.  As noted somewhere further down,
pg_buffercache_numa.page_num ends up wonky in different ways for the different
pages.

I think one thing that the docs should mention is that calling the numa
functions/views will force the pages to be allocated, even if they're
currently unused.

Newly started server, with s_b of 32GB an 2MB huge pages:

  grep ^Huge /proc/meminfo
  HugePages_Total:   34802
  HugePages_Free:    34448
  HugePages_Rsvd:    16437
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:        76517376 kB

run
  SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;

Now the pages that previously were marked as reserved are actually allocated:

  grep ^Huge /proc/meminfo
  HugePages_Total:   34802
  HugePages_Free:    18012
  HugePages_Rsvd:        1
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:        76517376 kB


I don't see how we can avoid that right now, but at the very least we ought to
document it.


On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
> The libnuma library is not available on 32-bit builds (there's no shared
> object for i386), so we disable it in that case. The i386 is very memory
> limited anyway, even with PAE, so NUMA is mostly irrelevant.

Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
library.


> +###############################################################
> +# Library: libnuma
> +###############################################################
> +
> +libnumaopt = get_option('libnuma')
> +if not libnumaopt.disabled()
> +  # via pkg-config
> +  libnuma = dependency('numa', required: libnumaopt)
> +  if not libnuma.found()
> +    libnuma = cc.find_library('numa', required: libnumaopt)
> +  endif

This fallback isn't going to work if -dlibnuma=enabled is used, as
dependency() will error out, due to not finding a required dependency. You'd
need to use required: false there.

Do we actually need a non-dependency() fallback here?  It's linux only and a
new dependency, so just requiring a .pc file seems like it'd be fine?


> +#ifdef USE_LIBNUMA
> +
> +/*
> + * This is required on Linux, before pg_numa_query_pages() as we
> + * need to page-fault before move_pages(2) syscall returns valid results.
> + */
> +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
> +    ro_volatile_var = *(uint64 *) ptr

Does it really work that way? A volatile variable to assign the result of
dereferencing ptr ensures that the *store* isn't removed by the compiler, but
it doesn't at all guarantee that the *load* isn't removed, since that memory
isn't marked as volatile.

I think you'd need to cast the source pointer to a volatile uint64* to ensure
the load happens.


> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
> +
> +-- Register the new functions.
> +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
> +RETURNS SETOF RECORD
> +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
> +LANGUAGE C PARALLEL SAFE;
> +
> +-- Create a view for convenient access.
> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
> +    SELECT P.* FROM pg_buffercache_numa_pages() AS P
> +    (bufferid integer, page_num int4, node_id int4);


Why CREATE OR REPLACE?


> +Datum
> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> +{
> ...
> +
> +        /*
> +         * To smoothly support upgrades from version 1.0 of this extension
> +         * transparently handle the (non-)existence of the pinning_backends
> +         * column. We unfortunately have to get the result type for that... -
> +         * we can't use the result type determined by the function definition
> +         * without potentially crashing when somebody uses the old (or even
> +         * wrong) function definition though.
> +         */
> +        if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
> +            elog(ERROR, "return type must be a row type");
> +

Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?


> +
> +        if (firstNumaTouch)
> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");

Over the patchseries the related code is duplicated. Seems like it'd be good
to put it into pg_numa instead?  This seems like the thing that's good to
abstract away in one central spot.


> +        /*
> +         * Scan through all the buffers, saving the relevant fields in the
> +         * fctx->record structure.
> +         *
> +         * We don't hold the partition locks, so we don't get a consistent
> +         * snapshot across all buffers, but we do grab the buffer header
> +         * locks, so the information of each buffer is self-consistent.
> +         *
> +         * This loop touches and stores addresses into os_page_ptrs[] as input
> +         * to one big big move_pages(2) inquiry system call. Basically we ask
> +         * for all memory pages for NBuffers.
> +         */
> +        idx = 0;
> +        for (i = 0; i < NBuffers; i++)
> +        {
> +            BufferDesc *bufHdr;
> +            uint32        buf_state;
> +            uint32        bufferid;
> +
> +            CHECK_FOR_INTERRUPTS();
> +
> +            bufHdr = GetBufferDescriptor(i);
> +
> +            /* Lock each buffer header before inspecting. */
> +            buf_state = LockBufHdr(bufHdr);
> +            bufferid = BufferDescriptorGetBuffer(bufHdr);
> +
> +            UnlockBufHdr(bufHdr, buf_state);

Given that the only thing you're getting here is the buffer id, it's probably
not worth getting the spinlock, a single 4 byte read is always atomic on our
platforms.


> +            /*
> +             * If we have multiple OS pages per buffer, fill those in too. We
> +             * always want at least one OS page, even if there are multiple
> +             * buffers per page.
> +             *
> +             * Altough we could query just once per each OS page, we do it
> +             * repeatably for each Buffer and hit the same address as
> +             * move_pages(2) requires page aligment. This also simplifies
> +             * retrieval code later on. Also NBuffers starts from 1.
> +             */
> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
> +            {
> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
> +
> +                fctx->record[idx].bufferid = bufferid;
> +                fctx->record[idx].numa_page = j;
> +
> +                os_page_ptrs[idx]
> +                    = (char *) TYPEALIGN(os_page_size,
> +                                         buffptr + (os_page_size * j));

FWIW, this bit here is the most expensive part of the function itself, as the
compiler has no choice than to implement it as an actual division, as
os_page_size is runtime variable.

It'd be fine to leave it like that, the call to numa_move_pages() is way more
expensive. But it shouldn't be too hard to do that alignment once, rather than
having to do it over and over.


FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
works quite right for me.

This definition afaict is always 0 when using huge pages and just 0 and 1 for
4k pages.  But my understanding of numa_page is that it's the "id" of the numa
pages, which isn't that?

With "adjust page_num" I get a number that starts at -1 and then increments
from there. More correct, but doesn't quite seem right either.


> +    <tbody>
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>bufferid</structfield> <type>integer</type>
> +      </para>
> +      <para>
> +       ID, in the range 1..<varname>shared_buffers</varname>
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>page_num</structfield> <type>int</type>
> +      </para>
> +      <para>
> +       number of OS memory page for this buffer
> +      </para></entry>
> +     </row>

"page_num" doesn't really seem very descriptive for "number of OS memory page
for this buffer".  To me "number of" sounds like it's counting the number of
associated pages, but it's really just a "page id" or something like that.

Maybe rename page_num to "os_page_id" and rephrase the comment a bit?




> From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Date: Fri, 21 Feb 2025 14:20:18 +0100
> Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
>
> Introduce new pg_shmem_alloctions_numa view with information about how
> shared memory is distributed across NUMA nodes.

Nice.


> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
> Reviewed-by: Tomas Vondra <tomas@vondra.me>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
>  doc/src/sgml/system-views.sgml           |  79 ++++++++++++
>  src/backend/catalog/system_views.sql     |   8 ++
>  src/backend/storage/ipc/shmem.c          | 152 +++++++++++++++++++++++
>  src/include/catalog/pg_proc.dat          |   8 ++
>  src/test/regress/expected/numa.out       |  12 ++
>  src/test/regress/expected/numa_1.out     |   3 +
>  src/test/regress/expected/privileges.out |  16 ++-
>  src/test/regress/expected/rules.out      |   4 +
>  src/test/regress/parallel_schedule       |   2 +-
>  src/test/regress/sql/numa.sql            |   9 ++
>  src/test/regress/sql/privileges.sql      |   6 +-
>  11 files changed, 294 insertions(+), 5 deletions(-)
>  create mode 100644 src/test/regress/expected/numa.out
>  create mode 100644 src/test/regress/expected/numa_1.out
>  create mode 100644 src/test/regress/sql/numa.sql
>
> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
> index 4f336ee0adf..a83365ae24a 100644
> --- a/doc/src/sgml/system-views.sgml
> +++ b/doc/src/sgml/system-views.sgml
> @@ -181,6 +181,11 @@
>        <entry>shared memory allocations</entry>
>       </row>
>
> +     <row>
> +      <entry><link
linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
> +      <entry>NUMA node mappings for shared memory allocations</entry>
> +     </row>
> +
>       <row>
>        <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
>        <entry>planner statistics</entry>
> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
>    </para>
>   </sect1>
>
> + <sect1 id="view-pg-shmem-allocations-numa">
> +  <title><structname>pg_shmem_allocations_numa</structname></title>
> +
> +  <indexterm zone="view-pg-shmem-allocations-numa">
> +   <primary>pg_shmem_allocations_numa</primary>
> +  </indexterm>
> +
> +  <para>
> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
> +   memory allocations in the server's main shared memory segment are distributed
> +   across NUMA nodes. This includes both memory allocated by
> +   <productname>PostgreSQL</productname> itself and memory allocated
> +   by extensions using the mechanisms detailed in
> +   <xref linkend="xfunc-shared-addin" />.
> +  </para>

I think it'd be good to describe that the view will include multiple rows for
each name if spread across multiple numa nodes.

Perhaps also that querying this view is expensive and that
pg_shmem_allocations should be used if numa information isn't needed?


> +    /*
> +     * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> +     * the OS may have different memory page sizes.
> +     *
> +     * To correctly map between them, we need to: 1. Determine the OS memory
> +     * page size 2. Calculate how many OS pages are used by all buffer blocks
> +     * 3. Calculate how many OS pages are contained within each database
> +     * block.
> +     *
> +     * This information is needed before calling move_pages() for NUMA memory
> +     * node inquiry.
> +     */
> +    os_page_size = pg_numa_get_pagesize();
> +
> +    /*
> +     * Allocate memory for page pointers and status based on total shared
> +     * memory size. This simplified approach allocates enough space for all
> +     * pages in shared memory rather than calculating the exact requirements
> +     * for each segment.
> +     *
> +     * XXX Isn't this wasteful? But there probably is one large segment of
> +     * shared memory, much larger than the rest anyway.
> +     */
> +    shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> +    page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> +    pages_status = palloc(sizeof(int) * shm_total_page_count);

There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
could more be moved to a shared helper function?


> +    hash_seq_init(&hstat, ShmemIndex);
> +
> +    /* output all allocated entries */
> +    memset(nulls, 0, sizeof(nulls));
> +    while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> +    {

One thing that's interesting with using ShmemIndex is that this won't get
anonymous allocations. I think that's ok for now, it'd be complicated to
figure out the unmapped "regions".  But I guess it' be good to mention it in
the ocs?




> +        int            i;
> +
> +        /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
> +         * It's a bit misleading to call that "aligned", no? */
> +
> +        /* Get number of OS aligned pages */
> +        shm_ent_page_count
> +            = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
> +
> +        /*
> +         * If we get ever 0xff back from kernel inquiry, then we probably have
> +         * bug in our buffers to OS page mapping code here.
> +         */
> +        memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);

There's obviously no guarantee that shm_ent_page_count is a multiple of
os_page_size.  I think it'd be interesting to show in the view when one shmem
allocation shares a page with the prior allocation - that can contribute a bit
to contention.  What about showing a start_os_page_id and end_os_page_id or
something?  That could be a feature for later though.


> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif
> +-- switch to superuser
> +\c -
> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> + ok
> +----
> + t
> +(1 row)

Could it be worthwhile to run the test if !pg_numa_available(), to test that
we do the right thing in that case? We need an alternative output anyway, so
that might be fine?


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

06 April, 02:00:13

Hi,

On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
> I think one thing that the docs should mention is that calling the numa
> functions/views will force the pages to be allocated, even if they're
> currently unused.
> 
> Newly started server, with s_b of 32GB an 2MB huge pages:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    34448
>   HugePages_Rsvd:    16437
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> run
>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> 
> Now the pages that previously were marked as reserved are actually allocated:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    18012
>   HugePages_Rsvd:        1
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> 
> I don't see how we can avoid that right now, but at the very least we ought to
> document it.

The only allocation where that really matters is shared_buffers. I wonder if
we could special case the logic for that, by only probing if at least one of
the buffers in the range is valid.

Then we could treat a page status of -ENOENT as "page is not mapped" and
display NULL for the node_id?

Of course that would mean that we'd always need to
pg_numa_touch_mem_if_required(), not just the first time round, because we
previously might not have for a page that is now valid.  But compared to the
cost of actually allocating pages, the cost for that seems small.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

06 April, 14:51:34

On 4/6/25 00:29, Andres Freund wrote:
> Hi,
> 
> I just played around with this for a bit.  As noted somewhere further down,
> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> pages.
> 
> I think one thing that the docs should mention is that calling the numa
> functions/views will force the pages to be allocated, even if they're
> currently unused.
> 
> Newly started server, with s_b of 32GB an 2MB huge pages:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    34448
>   HugePages_Rsvd:    16437
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> run
>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> 
> Now the pages that previously were marked as reserved are actually allocated:
> 
>   grep ^Huge /proc/meminfo
>   HugePages_Total:   34802
>   HugePages_Free:    18012
>   HugePages_Rsvd:        1
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   Hugetlb:        76517376 kB
> 
> 
> I don't see how we can avoid that right now, but at the very least we ought to
> document it.
> 

+1 to documenting this

> 
> On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
>> The libnuma library is not available on 32-bit builds (there's no shared
>> object for i386), so we disable it in that case. The i386 is very memory
>> limited anyway, even with PAE, so NUMA is mostly irrelevant.
> 
> Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
> library.
> 
> 
>> +###############################################################
>> +# Library: libnuma
>> +###############################################################
>> +
>> +libnumaopt = get_option('libnuma')
>> +if not libnumaopt.disabled()
>> +  # via pkg-config
>> +  libnuma = dependency('numa', required: libnumaopt)
>> +  if not libnuma.found()
>> +    libnuma = cc.find_library('numa', required: libnumaopt)
>> +  endif
> 
> This fallback isn't going to work if -dlibnuma=enabled is used, as
> dependency() will error out, due to not finding a required dependency. You'd
> need to use required: false there.
> 
> Do we actually need a non-dependency() fallback here?  It's linux only and a
> new dependency, so just requiring a .pc file seems like it'd be fine?
> 

No idea.

> 
>> +#ifdef USE_LIBNUMA
>> +
>> +/*
>> + * This is required on Linux, before pg_numa_query_pages() as we
>> + * need to page-fault before move_pages(2) syscall returns valid results.
>> + */
>> +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
>> +    ro_volatile_var = *(uint64 *) ptr
> 
> Does it really work that way? A volatile variable to assign the result of
> dereferencing ptr ensures that the *store* isn't removed by the compiler, but
> it doesn't at all guarantee that the *load* isn't removed, since that memory
> isn't marked as volatile.
> 
> I think you'd need to cast the source pointer to a volatile uint64* to ensure
> the load happens.
> 
> 
>> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
>> +
>> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
>> +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
>> +
>> +-- Register the new functions.
>> +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
>> +RETURNS SETOF RECORD
>> +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
>> +LANGUAGE C PARALLEL SAFE;
>> +
>> +-- Create a view for convenient access.
>> +CREATE OR REPLACE VIEW pg_buffercache_numa AS
>> +    SELECT P.* FROM pg_buffercache_numa_pages() AS P
>> +    (bufferid integer, page_num int4, node_id int4);
> 
> 
> Why CREATE OR REPLACE?
> 

I think this is simply due to copy-pasting the code, a plain CREATE
would be enough here.

> 
>> +Datum
>> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
>> +{
>> ...
>> +
>> +        /*
>> +         * To smoothly support upgrades from version 1.0 of this extension
>> +         * transparently handle the (non-)existence of the pinning_backends
>> +         * column. We unfortunately have to get the result type for that... -
>> +         * we can't use the result type determined by the function definition
>> +         * without potentially crashing when somebody uses the old (or even
>> +         * wrong) function definition though.
>> +         */
>> +        if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
>> +            elog(ERROR, "return type must be a row type");
>> +
> 
> Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?
> 

Yes, good catch.

> 
>> +
>> +        if (firstNumaTouch)
>> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> 
> Over the patchseries the related code is duplicated. Seems like it'd be good
> to put it into pg_numa instead?  This seems like the thing that's good to
> abstract away in one central spot.
> 

Abstract away which part, exactly? I thought about moving some of the
code to port/pg_numa.c, but it didn't seem worth it.

> 
>> +        /*
>> +         * Scan through all the buffers, saving the relevant fields in the
>> +         * fctx->record structure.
>> +         *
>> +         * We don't hold the partition locks, so we don't get a consistent
>> +         * snapshot across all buffers, but we do grab the buffer header
>> +         * locks, so the information of each buffer is self-consistent.
>> +         *
>> +         * This loop touches and stores addresses into os_page_ptrs[] as input
>> +         * to one big big move_pages(2) inquiry system call. Basically we ask
>> +         * for all memory pages for NBuffers.
>> +         */
>> +        idx = 0;
>> +        for (i = 0; i < NBuffers; i++)
>> +        {
>> +            BufferDesc *bufHdr;
>> +            uint32        buf_state;
>> +            uint32        bufferid;
>> +
>> +            CHECK_FOR_INTERRUPTS();
>> +
>> +            bufHdr = GetBufferDescriptor(i);
>> +
>> +            /* Lock each buffer header before inspecting. */
>> +            buf_state = LockBufHdr(bufHdr);
>> +            bufferid = BufferDescriptorGetBuffer(bufHdr);
>> +
>> +            UnlockBufHdr(bufHdr, buf_state);
> 
> Given that the only thing you're getting here is the buffer id, it's probably
> not worth getting the spinlock, a single 4 byte read is always atomic on our
> platforms.
> 

Fine with me. I don't expect this to make a measurable difference so I
kept the spinlock, but if we want to remove it, I won't obect.

> 
>> +            /*
>> +             * If we have multiple OS pages per buffer, fill those in too. We
>> +             * always want at least one OS page, even if there are multiple
>> +             * buffers per page.
>> +             *
>> +             * Altough we could query just once per each OS page, we do it
>> +             * repeatably for each Buffer and hit the same address as
>> +             * move_pages(2) requires page aligment. This also simplifies
>> +             * retrieval code later on. Also NBuffers starts from 1.
>> +             */
>> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
>> +            {
>> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
>> +
>> +                fctx->record[idx].bufferid = bufferid;
>> +                fctx->record[idx].numa_page = j;
>> +
>> +                os_page_ptrs[idx]
>> +                    = (char *) TYPEALIGN(os_page_size,
>> +                                         buffptr + (os_page_size * j));
> 
> FWIW, this bit here is the most expensive part of the function itself, as the
> compiler has no choice than to implement it as an actual division, as
> os_page_size is runtime variable.
> 
> It'd be fine to leave it like that, the call to numa_move_pages() is way more
> expensive. But it shouldn't be too hard to do that alignment once, rather than
> having to do it over and over.
> 

Division? It's entirely possible I'm missing something obvious, but I
don't see any divisions in this code. You're however right we could get
rid of most of this, because we could get the buffer pointer once (it's
a bit silly we get it for each page), align that, and then simply add
the page size. Something like this:

    /* align to start of OS page, determine pointer to end of buffer */
    char    *buffptr = (char *) BufferGetBlock(i + 1);
    char    *ptr = buffptr - (buffptr % os_page_size);
    char    *endptr = buffptr + BLCKSZ;

    while (ptr < endptr)
    {
        os_page_ptrs[idx] = ptr;
        ...
        ptr += os_page_size;
    }

This also made me think a bit more about how the blocks and pages might
align / overlap. AFAIK the buffers are aligned to PG_IO_ALIGN_SIZE,
which on x86 is 4K, i.e. the same as OS page size. But let's say the OS
page size is larger, say 1MB. AFAIK it could happen a buffer could span
multiple larger memory pages.

For example, 8K buffer could start 4K before the 1MB page boundary, and
use 4K from the next memory page. This would mean the current formulas
for buffer_per_page and pages_per_buffer can be off by 1.

This would complicate calculating os_page_count a bit, because only some
buffers would actually need the +1 (in the array / view output).

Or what do I miss? It there something that guarantees this won't happen?

> 
> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> works quite right for me.
> 
> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
> pages, which isn't that?
> 
> With "adjust page_num" I get a number that starts at -1 and then increments
> from there. More correct, but doesn't quite seem right either.
> >
>> +    <tbody>
>> +     <row>
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>bufferid</structfield> <type>integer</type>
>> +      </para>
>> +      <para>
>> +       ID, in the range 1..<varname>shared_buffers</varname>
>> +      </para></entry>
>> +     </row>
>> +
>> +     <row>
>> +      <entry role="catalog_table_entry"><para role="column_definition">
>> +       <structfield>page_num</structfield> <type>int</type>
>> +      </para>
>> +      <para>
>> +       number of OS memory page for this buffer
>> +      </para></entry>
>> +     </row>
> 
> "page_num" doesn't really seem very descriptive for "number of OS memory page
> for this buffer".  To me "number of" sounds like it's counting the number of
> associated pages, but it's really just a "page id" or something like that.
> 
> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> 

Yeah, I haven't updated the docs in 0003 when adjusting the page_num
definition. It was more an experiment to see if others like this change.

> 
> 
> 
>> From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001
>> From: Jakub Wartak <jakub.wartak@enterprisedb.com>
>> Date: Fri, 21 Feb 2025 14:20:18 +0100
>> Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view
>>
>> Introduce new pg_shmem_alloctions_numa view with information about how
>> shared memory is distributed across NUMA nodes.
> 
> Nice.
> 
> 
>> Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
>> Reviewed-by: Andres Freund <andres@anarazel.de>
>> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
>> Reviewed-by: Tomas Vondra <tomas@vondra.me>
>> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
>> ---
>>  doc/src/sgml/system-views.sgml           |  79 ++++++++++++
>>  src/backend/catalog/system_views.sql     |   8 ++
>>  src/backend/storage/ipc/shmem.c          | 152 +++++++++++++++++++++++
>>  src/include/catalog/pg_proc.dat          |   8 ++
>>  src/test/regress/expected/numa.out       |  12 ++
>>  src/test/regress/expected/numa_1.out     |   3 +
>>  src/test/regress/expected/privileges.out |  16 ++-
>>  src/test/regress/expected/rules.out      |   4 +
>>  src/test/regress/parallel_schedule       |   2 +-
>>  src/test/regress/sql/numa.sql            |   9 ++
>>  src/test/regress/sql/privileges.sql      |   6 +-
>>  11 files changed, 294 insertions(+), 5 deletions(-)
>>  create mode 100644 src/test/regress/expected/numa.out
>>  create mode 100644 src/test/regress/expected/numa_1.out
>>  create mode 100644 src/test/regress/sql/numa.sql
>>
>> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
>> index 4f336ee0adf..a83365ae24a 100644
>> --- a/doc/src/sgml/system-views.sgml
>> +++ b/doc/src/sgml/system-views.sgml
>> @@ -181,6 +181,11 @@
>>        <entry>shared memory allocations</entry>
>>       </row>
>>
>> +     <row>
>> +      <entry><link
linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
>> +      <entry>NUMA node mappings for shared memory allocations</entry>
>> +     </row>
>> +
>>       <row>
>>        <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
>>        <entry>planner statistics</entry>
>> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
>>    </para>
>>   </sect1>
>>
>> + <sect1 id="view-pg-shmem-allocations-numa">
>> +  <title><structname>pg_shmem_allocations_numa</structname></title>
>> +
>> +  <indexterm zone="view-pg-shmem-allocations-numa">
>> +   <primary>pg_shmem_allocations_numa</primary>
>> +  </indexterm>
>> +
>> +  <para>
>> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
>> +   memory allocations in the server's main shared memory segment are distributed
>> +   across NUMA nodes. This includes both memory allocated by
>> +   <productname>PostgreSQL</productname> itself and memory allocated
>> +   by extensions using the mechanisms detailed in
>> +   <xref linkend="xfunc-shared-addin" />.
>> +  </para>
> 
> I think it'd be good to describe that the view will include multiple rows for
> each name if spread across multiple numa nodes.
> 
> Perhaps also that querying this view is expensive and that
> pg_shmem_allocations should be used if numa information isn't needed?
> 
> 
>> +    /*
>> +     * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
>> +     * the OS may have different memory page sizes.
>> +     *
>> +     * To correctly map between them, we need to: 1. Determine the OS memory
>> +     * page size 2. Calculate how many OS pages are used by all buffer blocks
>> +     * 3. Calculate how many OS pages are contained within each database
>> +     * block.
>> +     *
>> +     * This information is needed before calling move_pages() for NUMA memory
>> +     * node inquiry.
>> +     */
>> +    os_page_size = pg_numa_get_pagesize();
>> +
>> +    /*
>> +     * Allocate memory for page pointers and status based on total shared
>> +     * memory size. This simplified approach allocates enough space for all
>> +     * pages in shared memory rather than calculating the exact requirements
>> +     * for each segment.
>> +     *
>> +     * XXX Isn't this wasteful? But there probably is one large segment of
>> +     * shared memory, much larger than the rest anyway.
>> +     */
>> +    shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
>> +    page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
>> +    pages_status = palloc(sizeof(int) * shm_total_page_count);
> 
> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> could more be moved to a shared helper function?
> 
> 
>> +    hash_seq_init(&hstat, ShmemIndex);
>> +
>> +    /* output all allocated entries */
>> +    memset(nulls, 0, sizeof(nulls));
>> +    while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
>> +    {
> 
> One thing that's interesting with using ShmemIndex is that this won't get
> anonymous allocations. I think that's ok for now, it'd be complicated to
> figure out the unmapped "regions".  But I guess it' be good to mention it in
> the ocs?
> 

Agreed.

> 
> 
> 
>> +        int            i;
>> +
>> +        /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
>> +         * It's a bit misleading to call that "aligned", no? */
>> +
>> +        /* Get number of OS aligned pages */
>> +        shm_ent_page_count
>> +            = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
>> +
>> +        /*
>> +         * If we get ever 0xff back from kernel inquiry, then we probably have
>> +         * bug in our buffers to OS page mapping code here.
>> +         */
>> +        memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
> 
> There's obviously no guarantee that shm_ent_page_count is a multiple of
> os_page_size.  I think it'd be interesting to show in the view when one shmem
> allocation shares a page with the prior allocation - that can contribute a bit
> to contention.  What about showing a start_os_page_id and end_os_page_id or
> something?  That could be a feature for later though.
> 

Yeah, adding first/last page might be interesting.

> 
>> +SELECT NOT(pg_numa_available()) AS skip_test \gset
>> +\if :skip_test
>> +\quit
>> +\endif
>> +-- switch to superuser
>> +\c -
>> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
>> + ok
>> +----
>> + t
>> +(1 row)
> 
> Could it be worthwhile to run the test if !pg_numa_available(), to test that
> we do the right thing in that case? We need an alternative output anyway, so
> that might be fine?
> 

+1


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

06 April, 14:56:54


On 4/6/25 01:00, Andres Freund wrote:
> Hi,
> 
> On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
>> I think one thing that the docs should mention is that calling the numa
>> functions/views will force the pages to be allocated, even if they're
>> currently unused.
>>
>> Newly started server, with s_b of 32GB an 2MB huge pages:
>>
>>   grep ^Huge /proc/meminfo
>>   HugePages_Total:   34802
>>   HugePages_Free:    34448
>>   HugePages_Rsvd:    16437
>>   HugePages_Surp:        0
>>   Hugepagesize:       2048 kB
>>   Hugetlb:        76517376 kB
>>
>> run
>>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
>>
>> Now the pages that previously were marked as reserved are actually allocated:
>>
>>   grep ^Huge /proc/meminfo
>>   HugePages_Total:   34802
>>   HugePages_Free:    18012
>>   HugePages_Rsvd:        1
>>   HugePages_Surp:        0
>>   Hugepagesize:       2048 kB
>>   Hugetlb:        76517376 kB
>>
>>
>> I don't see how we can avoid that right now, but at the very least we ought to
>> document it.
> 
> The only allocation where that really matters is shared_buffers. I wonder if
> we could special case the logic for that, by only probing if at least one of
> the buffers in the range is valid.
> 
> Then we could treat a page status of -ENOENT as "page is not mapped" and
> display NULL for the node_id?
> 
> Of course that would mean that we'd always need to
> pg_numa_touch_mem_if_required(), not just the first time round, because we
> previously might not have for a page that is now valid.  But compared to the
> cost of actually allocating pages, the cost for that seems small.
> 

I don't think this would be a good trade off. The buffers already have a
NUMA node, and users would be interested in that. It's just that we
don't have the memory mapped in the current backend, so I'd bet people
would not be happy with NULL, and would proceed to force the allocation
in some other way (say, a large query of some sort). Which obviously
causes a lot of other problems.

I can imagine having a flag that makes the allocation optional, but
there's no convenient way to pass that to a view, and I think most
people want the allocation anyway.

Especially for monitoring purposes, which usually happens in a new
connection, so the backend has little opportunity to allocate the pages
"naturally."

regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

06 April, 15:57:05

On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

Hi Andres/Tomas,

I've noticed that Tomas responded to this while writing this, so I'm
attaching git-am patches based on his v25 (no squash) and there's only
one new (last one contains fixes based on this review) + slight commit
amendment to 0001.

> I just played around with this for a bit.  As noted somewhere further down,
> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> pages.

I think page_num is under heavy work in progress... I'm still not sure
is it worth exposing this (is it worth the hassle). If we scratch it
it won't be perfect, but we have everything , otherwise we risk this
feature as we are going into a feature freeze literally tomorrow.

> I think one thing that the docs should mention is that calling the numa
> functions/views will force the pages to be allocated, even if they're
> currently unused.
[..]
> I don't see how we can avoid that right now, but at the very least we ought to
> document it.

Added statement about this.

> On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
> > The libnuma library is not available on 32-bit builds (there's no shared
> > object for i386), so we disable it in that case. The i386 is very memory
> > limited anyway, even with PAE, so NUMA is mostly irrelevant.
>
> Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
> library.

OK, removed from the commit message, as yeah google states it really
exists (somehow I couldn't find it back then at least here)...

> > +###############################################################
> > +# Library: libnuma
> > +###############################################################
> > +
> > +libnumaopt = get_option('libnuma')
> > +if not libnumaopt.disabled()
> > +  # via pkg-config
> > +  libnuma = dependency('numa', required: libnumaopt)
> > +  if not libnuma.found()
> > +    libnuma = cc.find_library('numa', required: libnumaopt)
> > +  endif
>
> This fallback isn't going to work if -dlibnuma=enabled is used, as
> dependency() will error out, due to not finding a required dependency. You'd
> need to use required: false there.
>
> Do we actually need a non-dependency() fallback here?  It's linux only and a
> new dependency, so just requiring a .pc file seems like it'd be fine?

I'm not sure pkg-config is present everywhere, but I'm not expernt and
AFAIR we are not consistent how various libs are handled there. It's
quite late, but for now i've now just follwed Your's recommendation
for dependency() with false.

> > +#ifdef USE_LIBNUMA
> > +
> > +/*
> > + * This is required on Linux, before pg_numa_query_pages() as we
> > + * need to page-fault before move_pages(2) syscall returns valid results.
> > + */
> > +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
> > +     ro_volatile_var = *(uint64 *) ptr
>
> Does it really work that way? A volatile variable to assign the result of
> dereferencing ptr ensures that the *store* isn't removed by the compiler, but
> it doesn't at all guarantee that the *load* isn't removed, since that memory
> isn't marked as volatile.
>
> I think you'd need to cast the source pointer to a volatile uint64* to ensure
> the load happens.

OK, thanks, good finding, I was not aware compiler can bypass it like that.

[..]
> > +CREATE OR REPLACE VIEW pg_buffercache_numa AS
>
> Why CREATE OR REPLACE?

Fixed.

> > +Datum
> > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> > +{
> > ...
> > +
> > +             /*
> > +              * To smoothly support upgrades from version 1.0 of this extension
> > +              * transparently handle the (non-)existence of the pinning_backends
> > +              * column. We unfortunately have to get the result type for that... -
> > +              * we can't use the result type determined by the function definition
> > +              * without potentially crashing when somebody uses the old (or even
> > +              * wrong) function definition though.
> > +              */
> > +             if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
> > +                     elog(ERROR, "return type must be a row type");
> > +
>
> Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?

Removed.

> > +
> > +             if (firstNumaTouch)
> > +                     elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
>
> Over the patchseries the related code is duplicated. Seems like it'd be good
> to put it into pg_numa instead?  This seems like the thing that's good to
> abstract away in one central spot.

I hear you, but we are using those statistics for per-localized-code
(shm.c's firstTouch != pgbuffercache.c's firstTouch).

> > +             /*
> > +              * Scan through all the buffers, saving the relevant fields in the
> > +              * fctx->record structure.
> > +              *
> > +              * We don't hold the partition locks, so we don't get a consistent
> > +              * snapshot across all buffers, but we do grab the buffer header
> > +              * locks, so the information of each buffer is self-consistent.
> > +              *
> > +              * This loop touches and stores addresses into os_page_ptrs[] as input
> > +              * to one big big move_pages(2) inquiry system call. Basically we ask
> > +              * for all memory pages for NBuffers.
> > +              */
> > +             idx = 0;
> > +             for (i = 0; i < NBuffers; i++)
> > +             {
> > +                     BufferDesc *bufHdr;
> > +                     uint32          buf_state;
> > +                     uint32          bufferid;
> > +
> > +                     CHECK_FOR_INTERRUPTS();
> > +
> > +                     bufHdr = GetBufferDescriptor(i);
> > +
> > +                     /* Lock each buffer header before inspecting. */
> > +                     buf_state = LockBufHdr(bufHdr);
> > +                     bufferid = BufferDescriptorGetBuffer(bufHdr);
> > +
> > +                     UnlockBufHdr(bufHdr, buf_state);
>
> Given that the only thing you're getting here is the buffer id, it's probably
> not worth getting the spinlock, a single 4 byte read is always atomic on our
> platforms.

Well, I think this is just copy&pasted from original function, so we
follow the pattern for consistency.

> > +                     /*
> > +                      * If we have multiple OS pages per buffer, fill those in too. We
> > +                      * always want at least one OS page, even if there are multiple
> > +                      * buffers per page.
> > +                      *
> > +                      * Altough we could query just once per each OS page, we do it
> > +                      * repeatably for each Buffer and hit the same address as
> > +                      * move_pages(2) requires page aligment. This also simplifies
> > +                      * retrieval code later on. Also NBuffers starts from 1.
> > +                      */
> > +                     for (j = 0; j < Max(1, pages_per_buffer); j++)
> > +                     {
> > +                             char       *buffptr = (char *) BufferGetBlock(i + 1);
> > +
> > +                             fctx->record[idx].bufferid = bufferid;
> > +                             fctx->record[idx].numa_page = j;
> > +
> > +                             os_page_ptrs[idx]
> > +                                     = (char *) TYPEALIGN(os_page_size,
> > +                                                                              buffptr + (os_page_size * j));
>
> FWIW, this bit here is the most expensive part of the function itself, as the
> compiler has no choice than to implement it as an actual division, as
> os_page_size is runtime variable.
>
> It'd be fine to leave it like that, the call to numa_move_pages() is way more
> expensive. But it shouldn't be too hard to do that alignment once, rather than
> having to do it over and over.

TBH, I don't think we should spend lot of time optimizing, after all
it's debugging view (at the start I was actually considering putting
it as developer-only compile time option, but with shm view it is
actually usuable for others too and well... we want to have it as
foundation for real NUMA optimizations)

> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> works quite right for me.
>
> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
> pages, which isn't that?
>
> With "adjust page_num" I get a number that starts at -1 and then increments
> from there. More correct, but doesn't quite seem right either.

Apparently handling this special case of splitted buffers edge-cases
was Pandora box ;) Tomas what do we do about it? Does that has chance
to get in before freeze?

> > +    <tbody>
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>bufferid</structfield> <type>integer</type>
> > +      </para>
> > +      <para>
> > +       ID, in the range 1..<varname>shared_buffers</varname>
> > +      </para></entry>
> > +     </row>
> > +
> > +     <row>
> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>page_num</structfield> <type>int</type>
> > +      </para>
> > +      <para>
> > +       number of OS memory page for this buffer
> > +      </para></entry>
> > +     </row>
>
> "page_num" doesn't really seem very descriptive for "number of OS memory page
> for this buffer".  To me "number of" sounds like it's counting the number of
> associated pages, but it's really just a "page id" or something like that.
>
> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?

Tomas, are you good with rename? I think I've would also prefer os_page_id.

> > diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
[..]
> > +  <para>
> > +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
> > +   memory allocations in the server's main shared memory segment are distributed
> > +   across NUMA nodes. This includes both memory allocated by
> > +   <productname>PostgreSQL</productname> itself and memory allocated
> > +   by extensions using the mechanisms detailed in
> > +   <xref linkend="xfunc-shared-addin" />.
> > +  </para>
>
> I think it'd be good to describe that the view will include multiple rows for
> each name if spread across multiple numa nodes.

Added.

> Perhaps also that querying this view is expensive and that
> pg_shmem_allocations should be used if numa information isn't needed?

Already covered by 1st finding fix.

> > +     /*
> > +      * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> > +      * the OS may have different memory page sizes.
> > +      *
> > +      * To correctly map between them, we need to: 1. Determine the OS memory
> > +      * page size 2. Calculate how many OS pages are used by all buffer blocks
> > +      * 3. Calculate how many OS pages are contained within each database
> > +      * block.
> > +      *
> > +      * This information is needed before calling move_pages() for NUMA memory
> > +      * node inquiry.
> > +      */
> > +     os_page_size = pg_numa_get_pagesize();
> > +
> > +     /*
> > +      * Allocate memory for page pointers and status based on total shared
> > +      * memory size. This simplified approach allocates enough space for all
> > +      * pages in shared memory rather than calculating the exact requirements
> > +      * for each segment.
> > +      *
> > +      * XXX Isn't this wasteful? But there probably is one large segment of
> > +      * shared memory, much larger than the rest anyway.
> > +      */
> > +     shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> > +     page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> > +     pages_status = palloc(sizeof(int) * shm_total_page_count);
>
> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> could more be moved to a shared helper function?

-> Tomas?

> > +     hash_seq_init(&hstat, ShmemIndex);
> > +
> > +     /* output all allocated entries */
> > +     memset(nulls, 0, sizeof(nulls));
> > +     while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> > +     {
>
> One thing that's interesting with using ShmemIndex is that this won't get
> anonymous allocations. I think that's ok for now, it'd be complicated to
> figure out the unmapped "regions".  But I guess it' be good to mention it in
> the ocs?

Added.

> > +             int                     i;
> > +
> > +             /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
> > +              * It's a bit misleading to call that "aligned", no? */
> > +
> > +             /* Get number of OS aligned pages */
> > +             shm_ent_page_count
> > +                     = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
> > +
> > +             /*
> > +              * If we get ever 0xff back from kernel inquiry, then we probably have
> > +              * bug in our buffers to OS page mapping code here.
> > +              */
> > +             memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
>
> There's obviously no guarantee that shm_ent_page_count is a multiple of
> os_page_size.  I think it'd be interesting to show in the view when one shmem
> allocation shares a page with the prior allocation - that can contribute a bit
> to contention.  What about showing a start_os_page_id and end_os_page_id or
> something?  That could be a feature for later though.

I was thinking about it, but it could be done when analyzing this
together with data from pg_shmem_allocations(?) My worry is timing :(
Anyway, we could extend this view in future revisions.

> > +SELECT NOT(pg_numa_available()) AS skip_test \gset
> > +\if :skip_test
> > +\quit
> > +\endif
> > +-- switch to superuser
> > +\c -
> > +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> > + ok
> > +----
> > + t
> > +(1 row)
>
> Could it be worthwhile to run the test if !pg_numa_available(), to test that
> we do the right thing in that case? We need an alternative output anyway, so
> that might be fine?

Added. the meson test passes, but I'm sending it as fast as possible
to avoid a clash with Tomas.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

06 April, 16:52:20


On 4/6/25 14:57, Jakub Wartak wrote:
> On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
> 
> Hi Andres/Tomas,
> 
> I've noticed that Tomas responded to this while writing this, so I'm
> attaching git-am patches based on his v25 (no squash) and there's only
> one new (last one contains fixes based on this review) + slight commit
> amendment to 0001.
> 

I'm not working on this at the moment. I may have a bit of time in the
evening, but more likely I'll get back to this on Monday.

>> I just played around with this for a bit.  As noted somewhere further down,
>> pg_buffercache_numa.page_num ends up wonky in different ways for the different
>> pages.
> 
> I think page_num is under heavy work in progress... I'm still not sure
> is it worth exposing this (is it worth the hassle). If we scratch it
> it won't be perfect, but we have everything , otherwise we risk this
> feature as we are going into a feature freeze literally tomorrow.
> 

IMHO it's not difficult to change the definition of page_num this way,
it's pretty much a one line change. It's more a question of whether we
actually want to expose this.

[snip]

>>> +
>>> +             if (firstNumaTouch)
>>> +                     elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
>>
>> Over the patchseries the related code is duplicated. Seems like it'd be good
>> to put it into pg_numa instead?  This seems like the thing that's good to
>> abstract away in one central spot.
> 
> I hear you, but we are using those statistics for per-localized-code
> (shm.c's firstTouch != pgbuffercache.c's firstTouch).
> 

Yeah, I don't moving this is quite possible / useful. We could pass the
flag somewhere, but we still need to check & update it in local code.

>>> +                     /*
>>> +                      * If we have multiple OS pages per buffer, fill those in too. We
>>> +                      * always want at least one OS page, even if there are multiple
>>> +                      * buffers per page.
>>> +                      *
>>> +                      * Altough we could query just once per each OS page, we do it
>>> +                      * repeatably for each Buffer and hit the same address as
>>> +                      * move_pages(2) requires page aligment. This also simplifies
>>> +                      * retrieval code later on. Also NBuffers starts from 1.
>>> +                      */
>>> +                     for (j = 0; j < Max(1, pages_per_buffer); j++)
>>> +                     {
>>> +                             char       *buffptr = (char *) BufferGetBlock(i + 1);
>>> +
>>> +                             fctx->record[idx].bufferid = bufferid;
>>> +                             fctx->record[idx].numa_page = j;
>>> +
>>> +                             os_page_ptrs[idx]
>>> +                                     = (char *) TYPEALIGN(os_page_size,
>>> +                                                                              buffptr + (os_page_size * j));
>>
>> FWIW, this bit here is the most expensive part of the function itself, as the
>> compiler has no choice than to implement it as an actual division, as
>> os_page_size is runtime variable.
>>
>> It'd be fine to leave it like that, the call to numa_move_pages() is way more
>> expensive. But it shouldn't be too hard to do that alignment once, rather than
>> having to do it over and over.
> 
> TBH, I don't think we should spend lot of time optimizing, after all
> it's debugging view (at the start I was actually considering putting
> it as developer-only compile time option, but with shm view it is
> actually usuable for others too and well... we want to have it as
> foundation for real NUMA optimizations)
> 

I agree with this.

>> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
>> works quite right for me.
>>
>> This definition afaict is always 0 when using huge pages and just 0 and 1 for
>> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
>> pages, which isn't that?
>>
>> With "adjust page_num" I get a number that starts at -1 and then increments
>> from there. More correct, but doesn't quite seem right either.
> 
> Apparently handling this special case of splitted buffers edge-cases
> was Pandora box ;) Tomas what do we do about it? Does that has chance
> to get in before freeze?
> 

I don't think the split buffers are pandora box on their own, it's more
that it made us notice other issues / questions. I don't think handling
it is particularly complex - the most difficult part seems to be
figuring out how many rows we'll return, and mapping them to pages.
But that's not very difficult, IMO.

The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
which may return a pointer from before the first buffer. But I guess
that's OK, thanks to how memory is allocated - at least, that's what all
the move_pages() examples I found do, so unless those are all broken,
that's OK.

>>> +    <tbody>
>>> +     <row>
>>> +      <entry role="catalog_table_entry"><para role="column_definition">
>>> +       <structfield>bufferid</structfield> <type>integer</type>
>>> +      </para>
>>> +      <para>
>>> +       ID, in the range 1..<varname>shared_buffers</varname>
>>> +      </para></entry>
>>> +     </row>
>>> +
>>> +     <row>
>>> +      <entry role="catalog_table_entry"><para role="column_definition">
>>> +       <structfield>page_num</structfield> <type>int</type>
>>> +      </para>
>>> +      <para>
>>> +       number of OS memory page for this buffer
>>> +      </para></entry>
>>> +     </row>
>>
>> "page_num" doesn't really seem very descriptive for "number of OS memory page
>> for this buffer".  To me "number of" sounds like it's counting the number of
>> associated pages, but it's really just a "page id" or something like that.
>>
>> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> 
> Tomas, are you good with rename? I think I've would also prefer os_page_id.
> 

Fine with me.

>>> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
> [..]
>>> +  <para>
>>> +   The <structname>pg_shmem_allocations_numa</structname> shows how shared
>>> +   memory allocations in the server's main shared memory segment are distributed
>>> +   across NUMA nodes. This includes both memory allocated by
>>> +   <productname>PostgreSQL</productname> itself and memory allocated
>>> +   by extensions using the mechanisms detailed in
>>> +   <xref linkend="xfunc-shared-addin" />.
>>> +  </para>
>>
>> I think it'd be good to describe that the view will include multiple rows for
>> each name if spread across multiple numa nodes.
> 
> Added.
> 
>> Perhaps also that querying this view is expensive and that
>> pg_shmem_allocations should be used if numa information isn't needed?
> 
> Already covered by 1st finding fix.
> 
>>> +     /*
>>> +      * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
>>> +      * the OS may have different memory page sizes.
>>> +      *
>>> +      * To correctly map between them, we need to: 1. Determine the OS memory
>>> +      * page size 2. Calculate how many OS pages are used by all buffer blocks
>>> +      * 3. Calculate how many OS pages are contained within each database
>>> +      * block.
>>> +      *
>>> +      * This information is needed before calling move_pages() for NUMA memory
>>> +      * node inquiry.
>>> +      */
>>> +     os_page_size = pg_numa_get_pagesize();
>>> +
>>> +     /*
>>> +      * Allocate memory for page pointers and status based on total shared
>>> +      * memory size. This simplified approach allocates enough space for all
>>> +      * pages in shared memory rather than calculating the exact requirements
>>> +      * for each segment.
>>> +      *
>>> +      * XXX Isn't this wasteful? But there probably is one large segment of
>>> +      * shared memory, much larger than the rest anyway.
>>> +      */
>>> +     shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
>>> +     page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
>>> +     pages_status = palloc(sizeof(int) * shm_total_page_count);
>>
>> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
>> could more be moved to a shared helper function?
> 
> -> Tomas?
> 

I'm not against that in principle, but when I tried it didn't quite help
that much. But maybe it's better with the current patch version.

>>> +     hash_seq_init(&hstat, ShmemIndex);
>>> +
>>> +     /* output all allocated entries */
>>> +     memset(nulls, 0, sizeof(nulls));
>>> +     while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
>>> +     {
>>
>> One thing that's interesting with using ShmemIndex is that this won't get
>> anonymous allocations. I think that's ok for now, it'd be complicated to
>> figure out the unmapped "regions".  But I guess it' be good to mention it in
>> the ocs?
> 
> Added.
> 
>>> +             int                     i;
>>> +
>>> +             /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
>>> +              * It's a bit misleading to call that "aligned", no? */
>>> +
>>> +             /* Get number of OS aligned pages */
>>> +             shm_ent_page_count
>>> +                     = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
>>> +
>>> +             /*
>>> +              * If we get ever 0xff back from kernel inquiry, then we probably have
>>> +              * bug in our buffers to OS page mapping code here.
>>> +              */
>>> +             memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
>>
>> There's obviously no guarantee that shm_ent_page_count is a multiple of
>> os_page_size.  I think it'd be interesting to show in the view when one shmem
>> allocation shares a page with the prior allocation - that can contribute a bit
>> to contention.  What about showing a start_os_page_id and end_os_page_id or
>> something?  That could be a feature for later though.
> 
> I was thinking about it, but it could be done when analyzing this
> together with data from pg_shmem_allocations(?) My worry is timing :(
> Anyway, we could extend this view in future revisions.
> 

I'd leave this out for now. It's not difficult, but let's focus on the
other issues.

>>> +SELECT NOT(pg_numa_available()) AS skip_test \gset
>>> +\if :skip_test
>>> +\quit
>>> +\endif
>>> +-- switch to superuser
>>> +\c -
>>> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
>>> + ok
>>> +----
>>> + t
>>> +(1 row)
>>
>> Could it be worthwhile to run the test if !pg_numa_available(), to test that
>> we do the right thing in that case? We need an alternative output anyway, so
>> that might be fine?
> 
> Added. the meson test passes, but I'm sending it as fast as possible
> to avoid a clash with Tomas.
> 

Please keep working on this. I may hava a bit of time in the evening,
but in the worst case I'll merge it into your patch.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

07 April, 11:09:26

On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 4/6/25 14:57, Jakub Wartak wrote:
> > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote:
> >>
> >> Hi,
> >
> > Hi Andres/Tomas,
> >
> > I've noticed that Tomas responded to this while writing this, so I'm
> > attaching git-am patches based on his v25 (no squash) and there's only
> > one new (last one contains fixes based on this review) + slight commit
> > amendment to 0001.
> >
>
> I'm not working on this at the moment. I may have a bit of time in the
> evening, but more likely I'll get back to this on Monday.

OK, tried to fix all outstanding issues (except maybe some tiny code
refactors for beatification)

> >> I just played around with this for a bit.  As noted somewhere further down,
> >> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> >> pages.
> >
> > I think page_num is under heavy work in progress... I'm still not sure
> > is it worth exposing this (is it worth the hassle). If we scratch it
> > it won't be perfect, but we have everything , otherwise we risk this
> > feature as we are going into a feature freeze literally tomorrow.
> >
>
> IMHO it's not difficult to change the definition of page_num this way,
> it's pretty much a one line change. It's more a question of whether we
> actually want to expose this.

Bertrand noticed this first in
https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal
:

-               startptr = (char *) BufferGetBlock(1);
+               startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
*) BufferGetBlock(1));

With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.

> [snip]
>
> >>> +
> >>> +             if (firstNumaTouch)
> >>> +                     elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> >>
> >> Over the patchseries the related code is duplicated. Seems like it'd be good
> >> to put it into pg_numa instead?  This seems like the thing that's good to
> >> abstract away in one central spot.
> >
> > I hear you, but we are using those statistics for per-localized-code
> > (shm.c's firstTouch != pgbuffercache.c's firstTouch).
> >
>
> Yeah, I don't moving this is quite possible / useful. We could pass the
> flag somewhere, but we still need to check & update it in local code.

No idea how such code should be looking, but it would be more code?
Are you thinking about some code still touching local &firstNumaTouch
?

> >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> >> works quite right for me.
> >>
> >> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> >> 4k pages.  But my understanding of numa_page is that it's the "id" of the numa
> >> pages, which isn't that?
> >>
> >> With "adjust page_num" I get a number that starts at -1 and then increments
> >> from there. More correct, but doesn't quite seem right either.
> >
> > Apparently handling this special case of splitted buffers edge-cases
> > was Pandora box ;) Tomas what do we do about it? Does that has chance
> > to get in before freeze?
> >
>
> I don't think the split buffers are pandora box on their own, it's more
> that it made us notice other issues / questions. I don't think handling
> it is particularly complex - the most difficult part seems to be
> figuring out how many rows we'll return, and mapping them to pages.
> But that's not very difficult, IMO.

OK, with a fresh week, and fresh mind and a different name (ospageid)
it looks better to me now.

> The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
> which may return a pointer from before the first buffer. But I guess
> that's OK, thanks to how memory is allocated - at least, that's what all
> the move_pages() examples I found do, so unless those are all broken,
> that's OK.

I agree. This took some more time, but in case of

a) pg_buffercache_numa and HP=off view we shouldn't access ptr below
buffercache, because my understanding is that shm memory would be page
aligned anyway as per BufferManagerShmemInit() which uses
TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query
pg_buffercache_numa, one gets the following pages:

# strace -fp 14364 -e move_pages
strace: Process 14364 attached
move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,...

(gdb) print BufferBlocks
$1 = 0x7f8bfe4b5000 ""

while BufferBlocks actually starts there, so we are good without HP.
With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb)  print
BufferBlocks
=> $1 = 0x7ff33c879000
so we are actually accessing earlier pointer (!), but Buffer Blocks is
like 15-th structure there (and there's always going be something
earlier to my understanding):

select row_number() over (order by off),* from pg_shmem_allocations
order by off asc limit 15;
 row_number |          name          |   off   |   size    | allocated_size
------------+------------------------+---------+-----------+----------------
[..]
         13 | Shared MultiXact State | 5735936 |      1148 |           1152
         14 | Buffer Descriptors     | 5737088 |   1048576 |        1048576
         15 | Buffer Blocks          | 6785664 | 134221824 |      134221824

To me this is finding in itself, with HP shared_buffers being located
on page with something else...

b) in pg_shmem_allocations_numa view without HP we don't even get to
as low as ShmemBase for move_pages(), but with HP=on we hit as low as
ShmemBase but not lower, so we are good IMHO.

Wouldn't buildfarm actually tell us this if that's bad as an insurance policy?

> >>> +    <tbody>
> >>> +     <row>
> >>> +      <entry role="catalog_table_entry"><para role="column_definition">
> >>> +       <structfield>bufferid</structfield> <type>integer</type>
> >>> +      </para>
> >>> +      <para>
> >>> +       ID, in the range 1..<varname>shared_buffers</varname>
> >>> +      </para></entry>
> >>> +     </row>
> >>> +
> >>> +     <row>
> >>> +      <entry role="catalog_table_entry"><para role="column_definition">
> >>> +       <structfield>page_num</structfield> <type>int</type>
> >>> +      </para>
> >>> +      <para>
> >>> +       number of OS memory page for this buffer
> >>> +      </para></entry>
> >>> +     </row>
> >>
> >> "page_num" doesn't really seem very descriptive for "number of OS memory page
> >> for this buffer".  To me "number of" sounds like it's counting the number of
> >> associated pages, but it's really just a "page id" or something like that.
> >>
> >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> >
> > Tomas, are you good with rename? I think I've would also prefer os_page_id.
> >
>
> Fine with me.

Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_"
anywhere, so let's stick to that. Done that for nodeid too.

> >>> +     /*
> >>> +      * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> >>> +      * the OS may have different memory page sizes.
> >>> +      *
> >>> +      * To correctly map between them, we need to: 1. Determine the OS memory
> >>> +      * page size 2. Calculate how many OS pages are used by all buffer blocks
> >>> +      * 3. Calculate how many OS pages are contained within each database
> >>> +      * block.
> >>> +      *
> >>> +      * This information is needed before calling move_pages() for NUMA memory
> >>> +      * node inquiry.
> >>> +      */
> >>> +     os_page_size = pg_numa_get_pagesize();
> >>> +
> >>> +     /*
> >>> +      * Allocate memory for page pointers and status based on total shared
> >>> +      * memory size. This simplified approach allocates enough space for all
> >>> +      * pages in shared memory rather than calculating the exact requirements
> >>> +      * for each segment.
> >>> +      *
> >>> +      * XXX Isn't this wasteful? But there probably is one large segment of
> >>> +      * shared memory, much larger than the rest anyway.
> >>> +      */
> >>> +     shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> >>> +     page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> >>> +     pages_status = palloc(sizeof(int) * shm_total_page_count);
> >>
> >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> >> could more be moved to a shared helper function?
> >
> > -> Tomas?
> >
>
> I'm not against that in principle, but when I tried it didn't quite help
> that much. But maybe it's better with the current patch version.
>
> >>> +     hash_seq_init(&hstat, ShmemIndex);
> >>> +
> >>> +     /* output all allocated entries */
> >>> +     memset(nulls, 0, sizeof(nulls));
> >>> +     while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> >>> +     {
> >>
> >> One thing that's interesting with using ShmemIndex is that this won't get
> >> anonymous allocations. I think that's ok for now, it'd be complicated to
> >> figure out the unmapped "regions".  But I guess it' be good to mention it in
> >> the ocs?
> >
> > Added.
> >

BTW: it's also in the function comment too. now.

> > [..], but I'm sending it as fast as possible
> > to avoid a clash with Tomas.
> >
>
> Please keep working on this. I may hava a bit of time in the evening,
> but in the worst case I'll merge it into your patch.

So, attached is still patchset v25, still with one-off patches (please
git am most, but just `patch -p1 < file` last two and just squash it
if you are happy. I've intentionally not squash it to provide
changelog). This LGTM.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

07 April, 12:53:23

Hi,

On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote:
> Bertrand noticed this first in
> https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal
> :
> 
> -               startptr = (char *) BufferGetBlock(1);
> +               startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
> *) BufferGetBlock(1));
> 
> With the above I'm also not getting wonky (-1) results anymore. The
> rest of reply assumes we are using this.

yeah, I can see that you added it in v25-0007. In the same message I mentioned
to "use the actual buffer address when pg_numa_touch_mem_if_required()
is called?"

So, to be extra cautious we could do something like:

@@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)

                                /* Only need to touch memory once per backend process lifetime */
                                if (firstNumaTouch)
-                                       pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+                                       pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size));


what do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

07 April, 13:32:52

On Mon, Apr 7, 2025 at 11:53 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote:
> > Bertrand noticed this first in
> > https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal
> > :
> >
> > -               startptr = (char *) BufferGetBlock(1);
> > +               startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
> > *) BufferGetBlock(1));
> >
> > With the above I'm also not getting wonky (-1) results anymore. The
> > rest of reply assumes we are using this.
>
> yeah, I can see that you added it in v25-0007. In the same message I mentioned
> to "use the actual buffer address when pg_numa_touch_mem_if_required()
> is called?"
>
> So, to be extra cautious we could do something like:
>
> @@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
>
>                                 /* Only need to touch memory once per backend process lifetime */
>                                 if (firstNumaTouch)
> -                                       pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
> +                                       pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size));
>
>
> what do you think?

Yeah, I think we could include this too as it looks safer (sry I've
missed that one). Attached v25 as it was , with this little tweak.

-J.

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

07 April, 18:11:11

Hi,

Here's a v26 of this patch series, merging the various fixup patches.
I've also reordered the patches so that the pg_buffercache part is last.
The two other patches are ready to go, and it seems better to push the
built-in catalog before the pg_buffercache contrib module.

For 0001 and 0002, I only have some minor tweaks - comment rewordings
etc. I kept them in separate patches to make it obvious, but I think
it's fine. The one specific tweak is to account for the +1 page, which
might happen if the buffers/pages are shifted in some way.

For 0003, I made some more serious changes - I reworked how buffers and
pages are mapped. I concluded that relying on pages_per_buffer and
buffer_per_pages is way too fiddly, if we can't guarantee the buffers
are not "shifted" in some way. Which I think we can't.

So I reworked that so that os_page_ptrs always points to actual pages
without duplicate pointers. And then for each buffer we calculate the
first page / last page, and iterate over those. A comment suggested the
old code made the retrieval simpler, but I find the new approach much
easier. And it doesn't have to worry how are the pages/buffers shifted
and so on.

The one caveat is that we can't know how many entries the function will
produce until after going through the buffers. We can say only calculate
the maximum number of entries. I think that's fine - we may allocate a
bit more space than needed, but we also save space in os_page_status.

I intend to push 0001 and 0002 shortly, and 0003 after a bit more review
and testing, unless I hear objections.


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

07 April, 18:51:29

Hi,

On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
> On 4/6/25 01:00, Andres Freund wrote:
> > On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
> >> I think one thing that the docs should mention is that calling the numa
> >> functions/views will force the pages to be allocated, even if they're
> >> currently unused.
> >>
> >> Newly started server, with s_b of 32GB an 2MB huge pages:
> >>
> >>   grep ^Huge /proc/meminfo
> >>   HugePages_Total:   34802
> >>   HugePages_Free:    34448
> >>   HugePages_Rsvd:    16437
> >>   HugePages_Surp:        0
> >>   Hugepagesize:       2048 kB
> >>   Hugetlb:        76517376 kB
> >>
> >> run
> >>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> >>
> >> Now the pages that previously were marked as reserved are actually allocated:
> >>
> >>   grep ^Huge /proc/meminfo
> >>   HugePages_Total:   34802
> >>   HugePages_Free:    18012
> >>   HugePages_Rsvd:        1
> >>   HugePages_Surp:        0
> >>   Hugepagesize:       2048 kB
> >>   Hugetlb:        76517376 kB
> >>
> >>
> >> I don't see how we can avoid that right now, but at the very least we ought to
> >> document it.
> > 
> > The only allocation where that really matters is shared_buffers. I wonder if
> > we could special case the logic for that, by only probing if at least one of
> > the buffers in the range is valid.
> > 
> > Then we could treat a page status of -ENOENT as "page is not mapped" and
> > display NULL for the node_id?
> > 
> > Of course that would mean that we'd always need to
> > pg_numa_touch_mem_if_required(), not just the first time round, because we
> > previously might not have for a page that is now valid.  But compared to the
> > cost of actually allocating pages, the cost for that seems small.
> > 
> 
> I don't think this would be a good trade off. The buffers already have a
> NUMA node, and users would be interested in that.

The thing is that the buffer might *NOT* have a numa node.  That's e.g. the
case in the above example - otherwise we wouldn't initially have seen the
large HugePages_Rsvd.

Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that.  The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.


> It's just that we don't have the memory mapped in the current backend, so
> I'd bet people would not be happy with NULL, and would proceed to force the
> allocation in some other way (say, a large query of some sort). Which
> obviously causes a lot of other problems.

I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

07 April, 19:27:46

Hi,

On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote:
> On 4/6/25 00:29, Andres Freund wrote:
> >> +
> >> +        if (firstNumaTouch)
> >> +            elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> > 
> > Over the patchseries the related code is duplicated. Seems like it'd be good
> > to put it into pg_numa instead?  This seems like the thing that's good to
> > abstract away in one central spot.
> > 
> 
> Abstract away which part, exactly? I thought about moving some of the
> code to port/pg_numa.c, but it didn't seem worth it.

The easiest would be to just have a single function that does this for the
whole shared memory allocation, without having to integrate it with the
per-allocation or per-buffer loop.


> > 
> >> +            /*
> >> +             * If we have multiple OS pages per buffer, fill those in too. We
> >> +             * always want at least one OS page, even if there are multiple
> >> +             * buffers per page.
> >> +             *
> >> +             * Altough we could query just once per each OS page, we do it
> >> +             * repeatably for each Buffer and hit the same address as
> >> +             * move_pages(2) requires page aligment. This also simplifies
> >> +             * retrieval code later on. Also NBuffers starts from 1.
> >> +             */
> >> +            for (j = 0; j < Max(1, pages_per_buffer); j++)
> >> +            {
> >> +                char       *buffptr = (char *) BufferGetBlock(i + 1);
> >> +
> >> +                fctx->record[idx].bufferid = bufferid;
> >> +                fctx->record[idx].numa_page = j;
> >> +
> >> +                os_page_ptrs[idx]
> >> +                    = (char *) TYPEALIGN(os_page_size,
> >> +                                         buffptr + (os_page_size * j));
> > 
> > FWIW, this bit here is the most expensive part of the function itself, as the
> > compiler has no choice than to implement it as an actual division, as
> > os_page_size is runtime variable.
> > 
> > It'd be fine to leave it like that, the call to numa_move_pages() is way more
> > expensive. But it shouldn't be too hard to do that alignment once, rather than
> > having to do it over and over.
> > 
> 
> Division? It's entirely possible I'm missing something obvious, but I
> don't see any divisions in this code.

Oops. The division was only added in a subsequent patch, not the quoted
code. At the time it was:

+                               /* calculate ID of the OS memory page */
+                               fctx->record[idx].numa_page
+                                       = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

07 April, 19:36:24

On 4/7/25 17:51, Andres Freund wrote:
> Hi,
> 
> On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
>> On 4/6/25 01:00, Andres Freund wrote:
>>> On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
>>>> I think one thing that the docs should mention is that calling the numa
>>>> functions/views will force the pages to be allocated, even if they're
>>>> currently unused.
>>>>
>>>> Newly started server, with s_b of 32GB an 2MB huge pages:
>>>>
>>>>   grep ^Huge /proc/meminfo
>>>>   HugePages_Total:   34802
>>>>   HugePages_Free:    34448
>>>>   HugePages_Rsvd:    16437
>>>>   HugePages_Surp:        0
>>>>   Hugepagesize:       2048 kB
>>>>   Hugetlb:        76517376 kB
>>>>
>>>> run
>>>>   SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
>>>>
>>>> Now the pages that previously were marked as reserved are actually allocated:
>>>>
>>>>   grep ^Huge /proc/meminfo
>>>>   HugePages_Total:   34802
>>>>   HugePages_Free:    18012
>>>>   HugePages_Rsvd:        1
>>>>   HugePages_Surp:        0
>>>>   Hugepagesize:       2048 kB
>>>>   Hugetlb:        76517376 kB
>>>>
>>>>
>>>> I don't see how we can avoid that right now, but at the very least we ought to
>>>> document it.
>>>
>>> The only allocation where that really matters is shared_buffers. I wonder if
>>> we could special case the logic for that, by only probing if at least one of
>>> the buffers in the range is valid.
>>>
>>> Then we could treat a page status of -ENOENT as "page is not mapped" and
>>> display NULL for the node_id?
>>>
>>> Of course that would mean that we'd always need to
>>> pg_numa_touch_mem_if_required(), not just the first time round, because we
>>> previously might not have for a page that is now valid.  But compared to the
>>> cost of actually allocating pages, the cost for that seems small.
>>>
>>
>> I don't think this would be a good trade off. The buffers already have a
>> NUMA node, and users would be interested in that.
> 
> The thing is that the buffer might *NOT* have a numa node.  That's e.g. the
> case in the above example - otherwise we wouldn't initially have seen the
> large HugePages_Rsvd.
> 
> Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
> itself wouldn't be too bad - in fact I'd rather like to have an explicit way
> of doing that.  The problem is that that leads to all those allocations to
> happen on the *current* numa node (unless you have started postgres with
> numactl --interleave=all or such), rather than the node where the normal first
> use woul have allocated it.
> 

I agree, forcing those allocations to happen on a single node seems
rather unfortunate. But really, how likely is it that someone will run
this function on a cluster that hasn't already allocated this memory?

I'm not saying it can't happen, but we already have this issue if you
start and do a warmup from a single connection ...

> 
>> It's just that we don't have the memory mapped in the current backend, so
>> I'd bet people would not be happy with NULL, and would proceed to force the
>> allocation in some other way (say, a large query of some sort). Which
>> obviously causes a lot of other problems.
> 
> I don't think that really would be the case with what I proposed? If any
> buffer in the region were valid, we would force the allocation to become known
> to the current backend.
> 

It's not quite clear to me what exactly are you proposing :-(

I believe you're referring to this:

> The only allocation where that really matters is shared_buffers. I wonder if
> we could special case the logic for that, by only probing if at least one of
> the buffers in the range is valid.
> 
> Then we could treat a page status of -ENOENT as "page is not mapped" and
> display NULL for the node_id?
> 
> Of course that would mean that we'd always need to
> pg_numa_touch_mem_if_required(), not just the first time round, because we
> previously might not have for a page that is now valid.  But compared to the
> cost of actually allocating pages, the cost for that seems small.

I suppose by "range" you mean buffers on a given memory page, and
"valid" means BufferIsValid. Yeah, that probably means the memory page
is allocated. But if the buffer is invalid, it does not mean the memory
is not allocated, right? So does it make the buffer not interesting?

I'd find this ambiguity rather confusing, i.e. we'd never know if NULL
means just "invalid buffer" or "not allocated". Maybe we should simply
return rows only for valid buffers, to make it more explicit that we say
nothing about NUMA nodes for the invalid ones.



I think we need to decide whether the current patches are good enough
for PG18, with the current behavior, and then maybe improve that in
PG19. Or whether this is so serious we have to leave all of it for PG19.
I'd go with the former, but perhaps I'm wrong. I don't feel like I want
to be reworking this less than a day before the feature freeze.

Attached is v27, which I planned to push, but I'll hold off.


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

07 April, 19:42:21

Hi,

On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
> > Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
> > itself wouldn't be too bad - in fact I'd rather like to have an explicit way
> > of doing that.  The problem is that that leads to all those allocations to
> > happen on the *current* numa node (unless you have started postgres with
> > numactl --interleave=all or such), rather than the node where the normal first
> > use woul have allocated it.
> > 
> 
> I agree, forcing those allocations to happen on a single node seems
> rather unfortunate. But really, how likely is it that someone will run
> this function on a cluster that hasn't already allocated this memory?

I think it's not at all unlikely to have parts of shared buffers unused at the
start of a benchmark, e.g. because the table sizes grow over time.


> I'm not saying it can't happen, but we already have this issue if you
> start and do a warmup from a single connection ...

Indeed!  We really need to fix this...


> > 
> >> It's just that we don't have the memory mapped in the current backend, so
> >> I'd bet people would not be happy with NULL, and would proceed to force the
> >> allocation in some other way (say, a large query of some sort). Which
> >> obviously causes a lot of other problems.
> > 
> > I don't think that really would be the case with what I proposed? If any
> > buffer in the region were valid, we would force the allocation to become known
> > to the current backend.
> > 
> 
> It's not quite clear to me what exactly are you proposing :-(
> 
> I believe you're referring to this:
> 
> > The only allocation where that really matters is shared_buffers. I wonder if
> > we could special case the logic for that, by only probing if at least one of
> > the buffers in the range is valid.
> > 
> > Then we could treat a page status of -ENOENT as "page is not mapped" and
> > display NULL for the node_id?
> > 
> > Of course that would mean that we'd always need to
> > pg_numa_touch_mem_if_required(), not just the first time round, because we
> > previously might not have for a page that is now valid.  But compared to the
> > cost of actually allocating pages, the cost for that seems small.
> 
> I suppose by "range" you mean buffers on a given memory page

Correct.

> and  "valid" means BufferIsValid.

I was thinking of checking if the BufferDesc indicates BM_VALID or
BM_TAG_VALID.

BufferIsValid() just does a range check :(.


> Yeah, that probably means the memory page is allocated. But if the buffer is
> invalid, it does not mean the memory is not allocated, right? So does it
> make the buffer not interesting?

Well, you don't have contents in it it can't really affect performance.  But
yea, I agree, it's not perfect either.


> I think we need to decide whether the current patches are good enough
> for PG18, with the current behavior, and then maybe improve that in
> PG19.

I think as long as the docs mention this with <note> or <warning> it's ok for
now.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

07 April, 20:16:58

On 4/7/25 18:42, Andres Freund wrote:
> ...
>>>
>>> Of course that would mean that we'd always need to
>>> pg_numa_touch_mem_if_required(), not just the first time round, because we
>>> previously might not have for a page that is now valid.  But compared to the
>>> cost of actually allocating pages, the cost for that seems small.
>>
>> I suppose by "range" you mean buffers on a given memory page
> 
> Correct.
> 
>> and  "valid" means BufferIsValid.
> 
> I was thinking of checking if the BufferDesc indicates BM_VALID or
> BM_TAG_VALID.
> 
> BufferIsValid() just does a range check :(.
> 

Well, I guess BufferIsValid() seems a tad confusing ...

> 
>> Yeah, that probably means the memory page is allocated. But if the buffer is
>> invalid, it does not mean the memory is not allocated, right? So does it
>> make the buffer not interesting?
> 
> Well, you don't have contents in it it can't really affect performance.  But
> yea, I agree, it's not perfect either.
> 
> 
>> I think we need to decide whether the current patches are good enough
>> for PG18, with the current behavior, and then maybe improve that in
>> PG19.
> 
> I think as long as the docs mention this with <note> or <warning> it's ok for
> now.
> 

OK, I'll add a warning explaining this.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

07 April, 20:24:20

On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote:
> They actually look good to me. We've discussed earlier dropping
> s/numa_//g for column names (after all views contain it already) so
> they are fine in this regard.
> There's also the question of consistency: (bufferid, page_num,
> node_id) -- maybe should just drop "_" and that's it?
> Well I would even possibly consider page_num -> ospagenumber, but that's ugly.

I'd go for os_page_num.

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

07 April, 20:59:59

On 4/7/25 19:24, Andres Freund wrote:
> On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote:
>> They actually look good to me. We've discussed earlier dropping
>> s/numa_//g for column names (after all views contain it already) so
>> they are fine in this regard.
>> There's also the question of consistency: (bufferid, page_num,
>> node_id) -- maybe should just drop "_" and that's it?
>> Well I would even possibly consider page_num -> ospagenumber, but that's ugly.
> 
> I'd go for os_page_num.

WFM. I've renamed "ospageid" to "os_page_num" in 0003, and I've also
renamed "node_id" to "numa_node" in 0002+0003, to make it clearer what
kind of node this is.

This reminds me whether it's fine to have "os_page_num" as int. Should
we make it bigint, perhaps?

Attached is v28, with the commit messages updated, added <warning> about
allocation of the memory, etc. I'll let the CI run the tests on it, and
then will push, unless someone has more comments.

regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

07 April, 21:11:28

Hi,

On 2025-04-07 19:59:59 +0200, Tomas Vondra wrote:
> This reminds me whether it's fine to have "os_page_num" as int. Should
> we make it bigint, perhaps?

Yes, that's better. Seems very unlikely anybody will encounter this in the
next few years, but it's basically free to use the larger range here.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

07 April, 21:11:44

Hi,

On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
> 
> I was thinking of checking if the BufferDesc indicates BM_VALID or
> BM_TAG_VALID.

Yeah, that's what I did propose in [1] (when we were speaking about get_mempolicy())
and I think that would make sense as future improvement.

> 
> 
> > I think we need to decide whether the current patches are good enough
> > for PG18, with the current behavior, and then maybe improve that in
> > PG19.
> 
> I think as long as the docs mention this with <note> or <warning> it's ok for
> now.

+1

A few comments on v27:

=== 1

pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa()
reports it as node_id. Maybe we should use the same "naming" in both.

=== 2

postgres=# select count(*) from pg_buffercache;
 count
-------
 65536
(1 row)

but 

postgres=# select count(*) from pg_buffercache_numa;
 count
-------
    64
(1 row)

with:

postgres=# show block_size;
 block_size
------------
 2048

and Hugepagesize:       2048 kB.

and

postgres=#  show shared_buffers;
 shared_buffers
----------------
 128MB
(1 row)

And even if for testing I set:

-               funcctx->max_calls = idx;
+               funcctx->max_calls = 65536;

then I start to see weird results:

postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache);
 count
-------
 65472
(1 row)

So it looks like that the new way to iterate on the buffers that has been introduced
in v26/v27 has some issue?

[1]: https://www.postgresql.org/message-id/Z64Pr8CTG0RTrGR3%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

07 April, 22:51:17


On 4/7/25 20:11, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote:
>> Hi,
>>
>> On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
>>
>> I was thinking of checking if the BufferDesc indicates BM_VALID or
>> BM_TAG_VALID.
> 
> Yeah, that's what I did propose in [1] (when we were speaking about get_mempolicy())
> and I think that would make sense as future improvement.
> 
>>
>>
>>> I think we need to decide whether the current patches are good enough
>>> for PG18, with the current behavior, and then maybe improve that in
>>> PG19.
>>
>> I think as long as the docs mention this with <note> or <warning> it's ok for
>> now.
> 
> +1
> 
> A few comments on v27:
> 
> === 1
> 
> pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa()
> reports it as node_id. Maybe we should use the same "naming" in both.
> 

This was renamed in v28 to "numa_node" in both parts.

> === 2
> 
> postgres=# select count(*) from pg_buffercache;
>  count
> -------
>  65536
> (1 row)
> 
> but 
> 
> postgres=# select count(*) from pg_buffercache_numa;
>  count
> -------
>     64
> (1 row)
> 
> with:
> 
> postgres=# show block_size;
>  block_size
> ------------
>  2048
> 
> and Hugepagesize:       2048 kB.
> 
> and
> 
> postgres=#  show shared_buffers;
>  shared_buffers
> ----------------
>  128MB
> (1 row)
> 
> And even if for testing I set:
> 
> -               funcctx->max_calls = idx;
> +               funcctx->max_calls = 65536;
> 
> then I start to see weird results:
> 
> postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache);
>  count
> -------
>  65472
> (1 row)
> 
> So it looks like that the new way to iterate on the buffers that has been introduced
> in v26/v27 has some issue?
> 

Yeah, the calculations of the end pointers were wrong - we need to round
up (using TYPEALIGN()) when calculating number of pages, and just add
BLCKSZ (without any rounding) when calculating end of buffer. The 0004
fixes this for me (I tried this with various blocksizes / page sizes).

Thanks for noticing this!


regards

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

08 April, 00:01:17

On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote:

> > So it looks like that the new way to iterate on the buffers that has been introduced
> > in v26/v27 has some issue?
> >
>
> Yeah, the calculations of the end pointers were wrong - we need to round
> up (using TYPEALIGN()) when calculating number of pages, and just add
> BLCKSZ (without any rounding) when calculating end of buffer. The 0004
> fixes this for me (I tried this with various blocksizes / page sizes).
>
> Thanks for noticing this!

Hi,

v28-0001 LGTM
v28-0002 got this warning Andres was talking about, so LGTM
v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
some time we have 2nd bug there, but it appears that PG never properly
aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
count(*) pg_buffercache_numa == count(*) pg_buffercache.

So before anybody else reports this as bug about duplicate bufferids:

# select * from pg_buffercache_numa where os_page_num <= 2;
 bufferid | os_page_num | numa_node
----------+-------------+-----------
[..]
      195 |           0 |         0
      196 |           0 |         0 <-- duplicate?
      196 |           1 |         0 <-- duplicate?
      197 |           1 |         0
      198 |           1 |         0

That is strange because on first look one could assume we get 257x
8192 blocks per os_page (2^21) that way, which is impossible.
Exercises in pointers show this:

# select * from pg_buffercache_numa where os_page_num <= 2;
DEBUG:  NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
-- custom elog(DEBUG1)
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
buffptr=0x7f8661079000
[..]
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
buffptr=0x7f86611fb000
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
buffptr=0x7f86611fd000
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
buffptr=0x7f86611ff000 (!)
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
buffptr=0x7f86611ff000 (!)
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
buffptr=0x7f8661201000
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
buffptr=0x7f8661203000

so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
(and ends up at 0x7f8661201000), while we also have HP that hosts it
between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
hugepages. Open question for another day is shouldn't (of course
outside of this $thread) align s_b to HP size or not? As per above
even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
(that's 495616 bytes difference).

-J.

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

08 April, 00:27:11

Hi,

I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).

On 4/7/25 23:01, Jakub Wartak wrote:
> On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote:
> 
>>> So it looks like that the new way to iterate on the buffers that has been introduced
>>> in v26/v27 has some issue?
>>>
>>
>> Yeah, the calculations of the end pointers were wrong - we need to round
>> up (using TYPEALIGN()) when calculating number of pages, and just add
>> BLCKSZ (without any rounding) when calculating end of buffer. The 0004
>> fixes this for me (I tried this with various blocksizes / page sizes).
>>
>> Thanks for noticing this!
> 
> Hi,
> 
> v28-0001 LGTM
> v28-0002 got this warning Andres was talking about, so LGTM
> v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
> some time we have 2nd bug there, but it appears that PG never properly
> aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
> count(*) pg_buffercache_numa == count(*) pg_buffercache.
> 

AFAIK v29 fixed this, the end pointer calculations were wrong. With that
it passed for me with/without THP, different blocks sizes etc.

We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
which is 4kB or so. And it's determined at compile time, while THP is
determined when starting the cluster.

> So before anybody else reports this as bug about duplicate bufferids:
> 
> # select * from pg_buffercache_numa where os_page_num <= 2;
>  bufferid | os_page_num | numa_node
> ----------+-------------+-----------
> [..]
>       195 |           0 |         0
>       196 |           0 |         0 <-- duplicate?
>       196 |           1 |         0 <-- duplicate?
>       197 |           1 |         0
>       198 |           1 |         0
> 
> That is strange because on first look one could assume we get 257x
> 8192 blocks per os_page (2^21) that way, which is impossible.
> Exercises in pointers show this:
>> # select * from pg_buffercache_numa where os_page_num <= 2;
> DEBUG:  NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
> DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
> -- custom elog(DEBUG1)
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
> buffptr=0x7f8661079000
> [..]
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
> buffptr=0x7f86611fb000
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
> buffptr=0x7f86611fd000
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
> buffptr=0x7f86611ff000 (!)
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
> buffptr=0x7f86611ff000 (!)
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
> endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
> buffptr=0x7f8661201000
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
> endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
> buffptr=0x7f8661203000
> 
> so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
> (and ends up at 0x7f8661201000), while we also have HP that hosts it
> between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
> hugepages. Open question for another day is shouldn't (of course
> outside of this $thread) align s_b to HP size or not? As per above
> even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
> (that's 495616 bytes difference).
>

Right, this is because that's where the THP boundary happens to be. And
that one "duplicate" entry is for a buffer that happens to span two
pages. This is *exactly* the misalignment of blocks and pages that I was
wondering about earlier, and with the fixed endptr calculation we handle
that just fine.

No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

08 April, 00:50:07

On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> I've pushed all three parts of v29, with some additional corrections
> (picked lower OIDs, bumped catversion, fixed commit messages).

Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
such incredible support on the last mile of this and also to Bertrand
(for persistence!), Andres and Alvaro for lots of babysitting.

> AFAIK v29 fixed this, the end pointer calculations were wrong. With that
> it passed for me with/without THP, different blocks sizes etc.

Yeah, that was a typo, I've started writing about v28, but then in the
middle of that v29 landed and I still was chasing that finding, I've
just forgotten to bump this.

> We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
> which is 4kB or so. And it's determined at compile time, while THP is
> determined when starting the cluster.
[..]
> Right, this is because that's where the THP boundary happens to be. And
> that one "duplicate" entry is for a buffer that happens to span two
> pages. This is *exactly* the misalignment of blocks and pages that I was
> wondering about earlier, and with the fixed endptr calculation we handle
> that just fine.
>
> No opinion on the aligment - maybe we should do that, but it's not
> something this patch needs to worry about.

Agreed.I was wondering even if there are other drawbacks of the
situation, but other than not reporting duplicates here in this
pg_buffercache view, I cannot identify anything worthwhile.

-J.

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

08 April, 00:58:58


On 4/7/25 23:50, Jakub Wartak wrote:
> On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> I've pushed all three parts of v29, with some additional corrections
>> (picked lower OIDs, bumped catversion, fixed commit messages).
> 
> Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for
> such incredible support on the last mile of this and also to Bertrand
> (for persistence!), Andres and Alvaro for lots of babysitting.
> 

Glad I could help, thanks for the patch.

>> AFAIK v29 fixed this, the end pointer calculations were wrong. With that
>> it passed for me with/without THP, different blocks sizes etc.
> 
> Yeah, that was a typo, I've started writing about v28, but then in the
> middle of that v29 landed and I still was chasing that finding, I've
> just forgotten to bump this.
> 
>> We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
>> which is 4kB or so. And it's determined at compile time, while THP is
>> determined when starting the cluster.
> [..]
>> Right, this is because that's where the THP boundary happens to be. And
>> that one "duplicate" entry is for a buffer that happens to span two
>> pages. This is *exactly* the misalignment of blocks and pages that I was
>> wondering about earlier, and with the fixed endptr calculation we handle
>> that just fine.
>>
>> No opinion on the aligment - maybe we should do that, but it's not
>> something this patch needs to worry about.
> 
> Agreed.I was wondering even if there are other drawbacks of the
> situation, but other than not reporting duplicates here in this
> pg_buffercache view, I cannot identify anything worthwhile.
> 

Well, the drawback is that accessing the buffer may require hitting two
different NUMA nodes. I'm not 100% sure it can actually happen, though.
the buffer should be initialized as a whole, so it should got to the
same node. But maybe it could be "split" by THP migration, or something
like that.

In any case, that's not caused by this patch, and it's less serious with
huge pages - it's only affect buffers on the boundaries. But with the
small 4K pages it can happen for *every* buffer.


regards

-- 
Tomas Vondra

RE: Draft for basic NUMA observability

From

"Shinoda, Noriyoshi (SXD Japan FSI)"

Date:

08 April, 02:26:26

Hi, 

Thanks for developing this great feature. 
The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is
'int8'.
 
The attached small patch fixes the manual.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Tomas Vondra <tomas@vondra.me> 
Sent: Tuesday, April 8, 2025 6:59 AM
To: Jakub Wartak <jakub.wartak@enterprisedb.com>
Cc: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>; Andres Freund <andres@anarazel.de>; Alvaro Herrera
<alvherre@alvh.no-ip.org>;Nazir Bilal Yavuz <byavuz81@gmail.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>
 
Subject: Re: Draft for basic NUMA observability



On 4/7/25 23:50, Jakub Wartak wrote:
> On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi,
>>
>> I've pushed all three parts of v29, with some additional corrections 
>> (picked lower OIDs, bumped catversion, fixed commit messages).
> 
> Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for 
> such incredible support on the last mile of this and also to Bertrand 
> (for persistence!), Andres and Alvaro for lots of babysitting.
> 

Glad I could help, thanks for the patch.

>> AFAIK v29 fixed this, the end pointer calculations were wrong. With 
>> that it passed for me with/without THP, different blocks sizes etc.
> 
> Yeah, that was a typo, I've started writing about v28, but then in the 
> middle of that v29 landed and I still was chasing that finding, I've 
> just forgotten to bump this.
> 
>> We don't align buffers to os_page_size, we align them 
>> PG_IO_ALIGN_SIZE, which is 4kB or so. And it's determined at compile 
>> time, while THP is determined when starting the cluster.
> [..]
>> Right, this is because that's where the THP boundary happens to be. 
>> And that one "duplicate" entry is for a buffer that happens to span 
>> two pages. This is *exactly* the misalignment of blocks and pages 
>> that I was wondering about earlier, and with the fixed endptr 
>> calculation we handle that just fine.
>>
>> No opinion on the aligment - maybe we should do that, but it's not 
>> something this patch needs to worry about.
> 
> Agreed.I was wondering even if there are other drawbacks of the 
> situation, but other than not reporting duplicates here in this 
> pg_buffercache view, I cannot identify anything worthwhile.
> 

Well, the drawback is that accessing the buffer may require hitting two different NUMA nodes. I'm not 100% sure it can
actuallyhappen, though.
 
the buffer should be initialized as a whole, so it should got to the same node. But maybe it could be "split" by THP
migration,or something like that.
 

In any case, that's not caused by this patch, and it's less serious with huge pages - it's only affect buffers on the
boundaries.But with the small 4K pages it can happen for *every* buffer.
 


regards

--
Tomas Vondra

Attachment

pg_shmem_allocations_numa_doc_v1.diff

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

08 April, 13:46:16

On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote:
> Hi, 
> 
> Thanks for developing this great feature. 
> The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is
'int8'.

> The attached small patch fixes the manual.
> 

Thank you for noticing this and for the fix! Pushed.

This also reminded me we agreed to change page_num to bigint, which I
forgot to change before commit. So I adjusted that too, separately.

regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Kirill Reshke

Date:

08 April, 15:44:19

On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
> I'll let the CI run the tests on it, and
> then will push, unless someone has more comments.
>


Hi! I noticed strange failure after this commit[0]

Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48

In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
                 from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
 [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
 [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
directory
   79 | #include  [01;31m [K"utils/errcodes.h" [m [K
      |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
compilation terminated.




[0] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dogfish&dt=2025-04-08%2011%3A25%3A11



-- 
Best regards,
Kirill Reshke

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

08 April, 16:06:09

Hi,

On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
> > I'll let the CI run the tests on it, and
> > then will push, unless someone has more comments.
> >
> 
> 
> Hi! I noticed strange failure after this commit[0]
> 
> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
> 
> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
> directory
>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
> compilation terminated.

$ ninja -t missingdeps
Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
Processed 2384 nodes.
Error: There are 2 missing dependency paths.
2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to the
generator.
There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory.

I think it's not right that something in src/port defines an SQL callable
function. The set of headers for that pull in a lot of things.

Since the file relies on things like GUCs, I suspect this should be in
src/backend/port or such instead.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

08 April, 16:21:57

On 4/8/25 15:06, Andres Freund wrote:
> Hi,
> 
> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
>>> I'll let the CI run the tests on it, and
>>> then will push, unless someone has more comments.
>>>
>>
>>
>> Hi! I noticed strange failure after this commit[0]
>>
>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
>>
>> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
>>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
>>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
>>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
>> directory
>>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
>>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
>> compilation terminated.
> 
> $ ninja -t missingdeps
> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by
CUSTOM_COMMAND)
> Processed 2384 nodes.
> Error: There are 2 missing dependency paths.
> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to
thegenerator.
 
> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory.
 
> 
> 
> I think it's not right that something in src/port defines an SQL callable
> function. The set of headers for that pull in a lot of things.
> 
> Since the file relies on things like GUCs, I suspect this should be in
> src/backend/port or such instead.
> 

Yeah, I think you're right, src/backend/port seems like a better place
for this. I'll look into moving that in the evening.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

08 April, 16:35:37

Hi,

On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
>On 4/8/25 15:06, Andres Freund wrote:
>> Hi,
>>
>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
>>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
>>>> I'll let the CI run the tests on it, and
>>>> then will push, unless someone has more comments.
>>>>
>>>
>>>
>>> Hi! I noticed strange failure after this commit[0]
>>>
>>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
>>>
>>> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
>>>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
>>>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
>>>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
>>> directory
>>>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
>>>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
>>> compilation terminated.
>>
>> $ ninja -t missingdeps
>> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
>> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by
CUSTOM_COMMAND)
>> Processed 2384 nodes.
>> Error: There are 2 missing dependency paths.
>> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to
thegenerator. 
>> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory. 
>>
>>
>> I think it's not right that something in src/port defines an SQL callable
>> function. The set of headers for that pull in a lot of things.
>>
>> Since the file relies on things like GUCs, I suspect this should be in
>> src/backend/port or such instead.
>>
>
>Yeah, I think you're right, src/backend/port seems like a better place
>for this. I'll look into moving that in the evening.

On a second look I wonder if just the SQL function and perhaps the page size function should be moved. There are FE
programsthat could potentially benefit from num a awareness (e.g. pgbench). 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

08 April, 17:59:25

Hi,

On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
> >On 4/8/25 15:06, Andres Freund wrote:
> >> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
> >> I think it's not right that something in src/port defines an SQL callable
> >> function. The set of headers for that pull in a lot of things.
> >>
> >> Since the file relies on things like GUCs, I suspect this should be in
> >> src/backend/port or such instead.
> >>
> >
> >Yeah, I think you're right, src/backend/port seems like a better place
> >for this. I'll look into moving that in the evening.
>
> On a second look I wonder if just the SQL function and perhaps the page size
> function should be moved. There are FE programs that could potentially
> benefit from num a awareness (e.g. pgbench).

I would move pg_numa_available() to something like
src/backend/storage/ipc/shmem.c.

I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
with shmem.c or such.

Or we could just make it work in FE code by making this part

    Assert(IsUnderPostmaster);
    Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);

    if (huge_pages_status == HUGE_PAGES_ON)
        GetHugePageSize(&os_page_size, NULL);

#ifndef FRONTEND - we don't currently support using huge pages in FE programs
after all. But querying the page size might still be useful.

Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
needed?

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 01:47:59


On 4/8/25 16:59, Andres Freund wrote:
> Hi,
> 
> On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
>> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
>>> On 4/8/25 15:06, Andres Freund wrote:
>>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
>>>> I think it's not right that something in src/port defines an SQL callable
>>>> function. The set of headers for that pull in a lot of things.
>>>>
>>>> Since the file relies on things like GUCs, I suspect this should be in
>>>> src/backend/port or such instead.
>>>>
>>>
>>> Yeah, I think you're right, src/backend/port seems like a better place
>>> for this. I'll look into moving that in the evening.
>>
>> On a second look I wonder if just the SQL function and perhaps the page size
>> function should be moved. There are FE programs that could potentially
>> benefit from num a awareness (e.g. pgbench).
> 
> I would move pg_numa_available() to something like
> src/backend/storage/ipc/shmem.c.
> 

Makes sense, done in the attached patch.

> I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
> not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
> with shmem.c or such.
> 

True. It's true it's not really "NUMA page", but page size for the shmem
segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
same as pg_numa_available().

The rest of pg_numa.c is moved to src/backend/port

> Or we could just make it work in FE code by making this part
> 
>     Assert(IsUnderPostmaster);
>     Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
> 
>     if (huge_pages_status == HUGE_PAGES_ON)
>         GetHugePageSize(&os_page_size, NULL);
> 
> #ifndef FRONTEND - we don't currently support using huge pages in FE programs
> after all. But querying the page size might still be useful.
> 

I don't really like this. Why shouldn't the FE program simply call
sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also
verify huge page status.

> 
> Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
> needed?
> 

Right, that was left there after moving the prototype into the .c file.


regards

-- 
Tomas Vondra

Attachment

numa-fixes.patch

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 02:10:09


On 4/8/25 15:06, Andres Freund wrote:
> Hi,
> 
> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
>>> I'll let the CI run the tests on it, and
>>> then will push, unless someone has more comments.
>>>
>>
>>
>> Hi! I noticed strange failure after this commit[0]
>>
>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
>>
>> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
>>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
>>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
>>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
>> directory
>>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
>>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
>> compilation terminated.
> 
> $ ninja -t missingdeps
> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by
CUSTOM_COMMAND)
> Processed 2384 nodes.
> Error: There are 2 missing dependency paths.
> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to
thegenerator.
 
> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory.
 
> 

Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
ran those tests many times, and had it failed at least once I'd have
fixed it before commit.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

09 April, 02:27:26

Hi,

On 2025-04-09 00:47:59 +0200, Tomas Vondra wrote:
> On 4/8/25 16:59, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
> >> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
> >>> On 4/8/25 15:06, Andres Freund wrote:
> >>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
> >>>> I think it's not right that something in src/port defines an SQL callable
> >>>> function. The set of headers for that pull in a lot of things.
> >>>>
> >>>> Since the file relies on things like GUCs, I suspect this should be in
> >>>> src/backend/port or such instead.
> >>>>
> >>>
> >>> Yeah, I think you're right, src/backend/port seems like a better place
> >>> for this. I'll look into moving that in the evening.
> >>
> >> On a second look I wonder if just the SQL function and perhaps the page size
> >> function should be moved. There are FE programs that could potentially
> >> benefit from num a awareness (e.g. pgbench).
> > 
> > I would move pg_numa_available() to something like
> > src/backend/storage/ipc/shmem.c.
> > 
> 
> Makes sense, done in the attached patch.
> 
> > I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
> > not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
> > with shmem.c or such.
> > 
> 
> True. It's true it's not really "NUMA page", but page size for the shmem
> segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
> same as pg_numa_available().

Cool.


> The rest of pg_numa.c is moved to src/backend/port

Why move the remainder? It shouldn't have any dependency on postgres.h
afterwards? And I think it's not backend-specific.


Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

09 April, 02:29:08

Hi,

On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote:
> On 4/8/25 15:06, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
> >> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
> >>> I'll let the CI run the tests on it, and
> >>> then will push, unless someone has more comments.
> >>>
> >>
> >>
> >> Hi! I noticed strange failure after this commit[0]
> >>
> >> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
> >>
> >> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
> >>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
> >>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
> >>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
> >> directory
> >>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
> >>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
> >> compilation terminated.
> > 
> > $ ninja -t missingdeps
> > Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
> > Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by
CUSTOM_COMMAND)
> > Processed 2384 nodes.
> > Error: There are 2 missing dependency paths.
> > 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to
thegenerator.
 
> > There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory.
 
> > 
> 
> Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
> ran those tests many times, and had it failed at least once I'd have
> fixed it before commit.

Yes, we should. It's a somewhat newer feature, so we originally couldn't.
There only was a very clunky and slow python script when I was doing most of
the meson work.

I was actually thinking that it might make sense as a meson-registered test,
that way one quickly can find the issue both locally and on CI.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Jakub Wartak

Date:

09 April, 12:53:59

On Wed, Apr 9, 2025 at 12:48 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 4/8/25 16:59, Andres Freund wrote:
> > Hi,
> >
> > On 2025-04-08 09:35:37 -0400, Andres Freund wrote:
> >> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote:
> >>> On 4/8/25 15:06, Andres Freund wrote:
> >>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
> >>>> I think it's not right that something in src/port defines an SQL callable
> >>>> function. The set of headers for that pull in a lot of things.
> >>>>
> >>>> Since the file relies on things like GUCs, I suspect this should be in
> >>>> src/backend/port or such instead.
> >>>>
> >>>
> >>> Yeah, I think you're right, src/backend/port seems like a better place
> >>> for this. I'll look into moving that in the evening.
> >>
> >> On a second look I wonder if just the SQL function and perhaps the page size
> >> function should be moved. There are FE programs that could potentially
> >> benefit from num a awareness (e.g. pgbench).
> >
> > I would move pg_numa_available() to something like
> > src/backend/storage/ipc/shmem.c.
> >
>
> Makes sense, done in the attached patch.
>
> > I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's
> > not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in
> > with shmem.c or such.
> >
>
> True. It's true it's not really "NUMA page", but page size for the shmem
> segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c,
> same as pg_numa_available().
>
> The rest of pg_numa.c is moved to src/backend/port

Hi Tomas, sorry for this. I've tested the numa-fixes.patch, CI says it
is ok to me (but it was always like that), meson test also looks good
here, and some other quick use attempts are OK. I can confirm that
`ninja -t missingdeps` is clean with this (without it really wasn't;
+1 to adding this check to CI).

> > Or we could just make it work in FE code by making this part
> >
> >       Assert(IsUnderPostmaster);
> >       Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
> >
> >       if (huge_pages_status == HUGE_PAGES_ON)
> >               GetHugePageSize(&os_page_size, NULL);
> >
> > #ifndef FRONTEND - we don't currently support using huge pages in FE programs
> > after all. But querying the page size might still be useful.
> >
>
> I don't really like this. Why shouldn't the FE program simply call
> sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also
> verify huge page status.

True, the pg_shm_get_page_size() looks like a great middleground (not
in numa but still for shm w/ hugepages).

> >
> > Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is
> > needed?
> >
>
> Right, that was left there after moving the prototype into the .c file.


-J.

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 15:07:41

On 4/9/25 01:29, Andres Freund wrote:
> Hi,
> 
> On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote:
>> On 4/8/25 15:06, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote:
>>>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote:
>>>>> I'll let the CI run the tests on it, and
>>>>> then will push, unless someone has more comments.
>>>>>
>>>>
>>>>
>>>> Hi! I noticed strange failure after this commit[0]
>>>>
>>>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48
>>>>
>>>> In file included from  [01m [K../pgsql/src/include/postgres.h:49 [m [K,
>>>>                  from  [01m [K../pgsql/src/port/pg_numa.c:16 [m [K:
>>>>  [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K
>>>>  [01;31m [Kfatal error:  [m [Kutils/errcodes.h: No such file or
>>>> directory
>>>>    79 | #include  [01;31m [K"utils/errcodes.h" [m [K
>>>>       |           [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K
>>>> compilation terminated.
>>>
>>> $ ninja -t missingdeps
>>> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND)
>>> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by
CUSTOM_COMMAND)
>>> Processed 2384 nodes.
>>> Error: There are 2 missing dependency paths.
>>> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules)  without a non-depfile dep path to
thegenerator.
 
>>> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean
outputdirectory.
 
>>>
>>
>> Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I
>> ran those tests many times, and had it failed at least once I'd have
>> fixed it before commit.
> 
> Yes, we should. It's a somewhat newer feature, so we originally couldn't.
> There only was a very clunky and slow python script when I was doing most of
> the meson work.
> 
> I was actually thinking that it might make sense as a meson-registered test,
> that way one quickly can find the issue both locally and on CI.
> 

OK, here are two patches, where 0001 adds the missingdeps check to the
Debian meson build. It just adds that to the build script.

0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to
src/backend/port). It still needs to include c.h because of PGDLLIMPORT,
but I think that's fine.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 15:08:35

On 4/9/25 14:07, Tomas Vondra wrote:
> ...
> 
> OK, here are two patches, where 0001 adds the missingdeps check to the
> Debian meson build. It just adds that to the build script.
> 
> 0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to
> src/backend/port). It still needs to include c.h because of PGDLLIMPORT,
> but I think that's fine.
> 

Forgot to attach the patches ...

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 17:33:14

Updated patches with proper commit messages etc.

-- 
Tomas Vondra

Attachment

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

09 April, 18:14:32

Hi,

On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote:
> From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Tue, 8 Apr 2025 23:31:29 +0200
> Subject: [PATCH 1/2] Cleanup of pg_numa.c
> 
> This moves/renames some of the functions defined in pg_numa.c:
> 
> * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
>   moved to src/backend/storage/ipc/shmem.c. The new name better reflects
>   that the page size is not related to NUMA, and it's specifically about
>   the page size used for the main shared memory segment.
> 
> * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
>   the backend (which more appropriate for functions callable from SQL).
>   While at it, improve the comment to explain what page size it returns.
> 
> * remove unnecessary includes from src/port/pg_numa.c, adding
>   unnecessary dependencies (src/port should be suitable for frontent).
>   These were leftovers from earlier patch versions.

I don't think the include in src/port/pg_numa.c were leftover? Just the one in
pg_numa.h, right?

I'd mention that the includes of postgres.h/fmgr.h is what caused missing
build-time dependencies and via that failures on buildfarm member dogfish.


> diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
> index 5e2523cf798..63dff799436 100644
> --- a/src/port/pg_numa.c
> +++ b/src/port/pg_numa.c
> @@ -13,17 +13,14 @@
>   *-------------------------------------------------------------------------
>   */
>  
> -#include "postgres.h"
> +#include "c.h"
>  #include <unistd.h>
>  
>  #ifdef WIN32
>  #include <windows.h>
>  #endif

I think this may not be needed anymore, that was just there for
GetSystemInfo(), right?  Conversely, I suspect it may now be needed in the new
location of pg_numa_get_pagesize()?


> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Wed, 9 Apr 2025 13:29:31 +0200
> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build
> 
> Extends the meson build on Debian to also check for missing dependencies
> by executing
> 
>     ninja -t missingdeps
> 
> right after the build. This highlights unindended dependencies.
> 
> Reviewed-by: Andres Freund <andres@anarazel.de>
> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com

FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
have it just in CI until we have that.  I would however also add it to the
windows job, as that's the most "different" type of build / source of missed
dependencies that wouldn't show up on our development systems.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 18:28:31

On 4/9/25 17:14, Andres Freund wrote:
> Hi,
> 
> On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote:
>> From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas@vondra.me>
>> Date: Tue, 8 Apr 2025 23:31:29 +0200
>> Subject: [PATCH 1/2] Cleanup of pg_numa.c
>>
>> This moves/renames some of the functions defined in pg_numa.c:
>>
>> * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
>>   moved to src/backend/storage/ipc/shmem.c. The new name better reflects
>>   that the page size is not related to NUMA, and it's specifically about
>>   the page size used for the main shared memory segment.
>>
>> * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
>>   the backend (which more appropriate for functions callable from SQL).
>>   While at it, improve the comment to explain what page size it returns.
>>
>> * remove unnecessary includes from src/port/pg_numa.c, adding
>>   unnecessary dependencies (src/port should be suitable for frontent).
>>   These were leftovers from earlier patch versions.
> 
> I don't think the include in src/port/pg_numa.c were leftover? Just the one in
> pg_numa.h, right?
> 

Right, that wasn't quite accurate. The miscadmin.h and pg_shmem.h are
unnecessary thanks to moving stuff to shmem.c.

> I'd mention that the includes of postgres.h/fmgr.h is what caused missing
> build-time dependencies and via that failures on buildfarm member dogfish.
> 

Not really, I also need to include "c.h" instead of "postgres.h" (which
is also causing the same failure).

> 
>> diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
>> index 5e2523cf798..63dff799436 100644
>> --- a/src/port/pg_numa.c
>> +++ b/src/port/pg_numa.c
>> @@ -13,17 +13,14 @@
>>   *-------------------------------------------------------------------------
>>   */
>>  
>> -#include "postgres.h"
>> +#include "c.h"
>>  #include <unistd.h>
>>  
>>  #ifdef WIN32
>>  #include <windows.h>
>>  #endif
> 
> I think this may not be needed anymore, that was just there for
> GetSystemInfo(), right?  Conversely, I suspect it may now be needed in the new
> location of pg_numa_get_pagesize()?
> 

Good question. But if it's needed there, shouldn't it have failed on CI?

> 
>> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas@vondra.me>
>> Date: Wed, 9 Apr 2025 13:29:31 +0200
>> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build
>>
>> Extends the meson build on Debian to also check for missing dependencies
>> by executing
>>
>>     ninja -t missingdeps
>>
>> right after the build. This highlights unindended dependencies.
>>
>> Reviewed-by: Andres Freund <andres@anarazel.de>
>> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
> 
> FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
> have it just in CI until we have that.  I would however also add it to the
> windows job, as that's the most "different" type of build / source of missed
> dependencies that wouldn't show up on our development systems.
> 

We can add it as a meson.build test, sure. I was going for the CI first,
because then it fires no matter what build I do locally (I'm kinda still
used to autotools).

If you agree adding it to build_script is the right way to do that, I'll
do the same thing for the windows job.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Andres Freund

Date:

09 April, 18:51:12

Hi,

On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote:
> On 4/9/25 17:14, Andres Freund wrote:
> > I'd mention that the includes of postgres.h/fmgr.h is what caused missing
> > build-time dependencies and via that failures on buildfarm member dogfish.
> > 
> 
> Not really, I also need to include "c.h" instead of "postgres.h" (which
> is also causing the same failure).

I did mention postgres.h :)



> > I think this may not be needed anymore, that was just there for
> > GetSystemInfo(), right?  Conversely, I suspect it may now be needed in the new
> > location of pg_numa_get_pagesize()?
> > 
> 
> Good question. But if it's needed there, shouldn't it have failed on CI?

Oh. No. It shouldn't have - because that include was completely
unnecessary. We always include windows.h on windows.



> >> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
> >> From: Tomas Vondra <tomas@vondra.me>
> >> Date: Wed, 9 Apr 2025 13:29:31 +0200
> >> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build
> >>
> >> Extends the meson build on Debian to also check for missing dependencies
> >> by executing
> >>
> >>     ninja -t missingdeps
> >>
> >> right after the build. This highlights unindended dependencies.
> >>
> >> Reviewed-by: Andres Freund <andres@anarazel.de>
> >> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
> > 
> > FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
> > have it just in CI until we have that.  I would however also add it to the
> > windows job, as that's the most "different" type of build / source of missed
> > dependencies that wouldn't show up on our development systems.
> > 
> 
> We can add it as a meson.build test, sure. I was going for the CI first,
> because then it fires no matter what build I do locally (I'm kinda still
> used to autotools).

A meson test would do the same thing, it'd fail while running the tests, no?


> If you agree adding it to build_script is the right way to do that, I'll
> do the same thing for the windows job.

WFM.

Greetings,

Andres Freund

Re: Draft for basic NUMA observability

From

Tomas Vondra

Date:

09 April, 20:07:15

On 4/9/25 17:51, Andres Freund wrote:
> Hi,
> 
> On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote:
>> On 4/9/25 17:14, Andres Freund wrote:
>>> I'd mention that the includes of postgres.h/fmgr.h is what caused missing
>>> build-time dependencies and via that failures on buildfarm member dogfish.
>>>
>>
>> Not really, I also need to include "c.h" instead of "postgres.h" (which
>> is also causing the same failure).
> 
> I did mention postgres.h :)
> 

D'oh, I missed that. I was focused on the fmgr one.

> 
> 
>>> I think this may not be needed anymore, that was just there for
>>> GetSystemInfo(), right?  Conversely, I suspect it may now be needed in the new
>>> location of pg_numa_get_pagesize()?
>>>
>>
>> Good question. But if it's needed there, shouldn't it have failed on CI?
> 
> Oh. No. It shouldn't have - because that include was completely
> unnecessary. We always include windows.h on windows.
> 

Makes sense. I'll get rid of the windows.h include.

> 
> 
>>>> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001
>>>> From: Tomas Vondra <tomas@vondra.me>
>>>> Date: Wed, 9 Apr 2025 13:29:31 +0200
>>>> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build
>>>>
>>>> Extends the meson build on Debian to also check for missing dependencies
>>>> by executing
>>>>
>>>>     ninja -t missingdeps
>>>>
>>>> right after the build. This highlights unindended dependencies.
>>>>
>>>> Reviewed-by: Andres Freund <andres@anarazel.de>
>>>> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
>>>
>>> FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to
>>> have it just in CI until we have that.  I would however also add it to the
>>> windows job, as that's the most "different" type of build / source of missed
>>> dependencies that wouldn't show up on our development systems.
>>>
>>
>> We can add it as a meson.build test, sure. I was going for the CI first,
>> because then it fires no matter what build I do locally (I'm kinda still
>> used to autotools).
> 
> A meson test would do the same thing, it'd fail while running the tests, no?
> 

Sure, but only if you use meson. Which I still mostly don't, so I've
been thinking about the CI first, because I use that very consistently
before pushing something.

> 
>> If you agree adding it to build_script is the right way to do that, I'll
>> do the same thing for the windows job.
> 
> WFM.

Thanks. I'll polish this a bit more and push.


regards

-- 
Tomas Vondra

Re: Draft for basic NUMA observability

From

Bertrand Drouvot

Date:

09 April, 21:59:14

Hi,

On Tue, Apr 08, 2025 at 12:46:16PM +0200, Tomas Vondra wrote:
> 
> 
> On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote:
> > Hi, 
> > 
> > Thanks for developing this great feature. 
> > The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is
'int8'.
 
> > The attached small patch fixes the manual.
> > 
> 
> Thank you for noticing this and for the fix! Pushed.
> 
> This also reminded me we agreed to change page_num to bigint, which I
> forgot to change before commit. So I adjusted that too, separately.

I was doing some extra testing and just realized (did not think of it during the
review) that maybe we could add a pg_buffercache_numa usage example (like it's
already done for pg_buffercache).

That might sound obvious but OTOH I think that would not hurt.

Something like in the attached?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Add-pg_buffercache_numa-usage-example.patch