Thread: Re: Draft for basic NUMA observability
Hi, On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote: > As I have promised to Andres on the Discord hacking server some time > ago, I'm attaching the very brief (and potentially way too rushed) > draft of the first step into NUMA observability on PostgreSQL that was > based on his presentation [0]. It might be rough, but it is to get us > started. The patches were not really even basically tested, they are > more like input for discussion - rather than solid code - to shake out > what should be the proper form of this. > > Right now it gives: > > postgres=# select numa_zone_id, count(*) from pg_buffercache group by > numa_zone_id; > NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 > numa_zone_id | count > --------------+------- > | 16127 > 6 | 256 > 1 | 1 Thanks for the patch! Not doing a code review but sharing some experimentation. First, I had to: @@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) Size os_page_size; void **os_page_ptrs; int *os_pages_status; - int os_page_count; + uint64 os_page_count; and - os_page_count = (NBuffers * BLCKSZ) / os_page_size; + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; to make it work with non tiny shared_buffers. Observations: when using 2 sessions: Session 1 first loads buffers (e.g., by querying a relation) and then runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear as unmapped when viewed from a process that hasn't accessed them, even though those same pages appear as allocated on a NUMA node in session 1. To double check, I created a function pg_buffercache_pages_from_pid() that is exactly the same as pg_buffercache_pages() (with your patch) except that it takes a pid as input and uses it in move_pages(<pid>, …). Let me show the results: In session 1 (that "accessed/loaded" the ~65K buffers): postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177310 0 | 65192 -2 | 378 (3 rows) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 1662580 In session 2: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 85 -2 | 65494 (3 rows) ^ postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 90 -2 | 65489 (3 rows) But when session's 1 pid is used: postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 65195 -2 | 384 (3 rows) Results show: Correct NUMA distribution in session 1 Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid() with the pid of session 1 as a parameter (the session that actually accessed the buffers) Which makes me wondering if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though. The initial idea that you shared on discord was to use get_mempolicy() but as Andres stated: " One annoying thing about get_mempolicy() is this: If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read (load) access to that address, and return the ID of the node where that page was allocated. Forcing the allocation to happen inside a monitoring function is decidedly not great. " The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>") while using get_mempolicy(). But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Jakub, On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote: > On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi Bertrand, > > Thanks for playing with this! > > > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observethe same behavior though. > > You are correct, I'm observing identical behaviour, please see attached. Thanks for confirming! > > We probably would need to split it to some separate and new view > within the pg_buffercache extension, but that is going to be slow, yet > still provide valid results. Yup. > In the previous approach that > get_mempolicy() was allocating on 1st access, but it was slow not only > because it was allocating but also because it was just 1 syscall per > 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning > (really allocating) a 128GB buffer cache in future won't cause issues > - that's like 16-17mln (* 2) syscalls to be issued when not using > move_pages(2) Yeah, get_mempolicy() not working on a range is not great. > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > Different perspective: I wanted to use the same approach in the new > pg_shmemallocations_numa, but that won't cut it there. The other idea > that came to my mind is to issue move_pages() from the backend that > has already used all of those pages. That literally mean on of the > below ideas: > 1. from somewhere like checkpointer / bgwriter? > 2. add touching memory on backend startup like always (sic!) > 3. or just attempt to read/touch memory addr just before calling > move_pages(). E.g. this last options is just two lines: > > if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > } > > and it seems to work while still issuing much less syscalls with > move_pages() across backends, well at least here. One of the main issue I see with 1. and 2. is that we would not get accurate results should the kernel decides to migrate the pages. Indeed, the process doing the move_pages() call needs to have accessed the pages more recently than any kernel migrations to see accurate locations. OTOH, one of the main issue that I see with 3. is that the monitoring could probably influence the kernel's decision to start pages migration (I'm not 100% sure but I could imagine it may influence the kernel's decision due to having to read/touch the pages). But I'm thinking: do we really need to know the page location of every single page? I think what we want to see is if the pages are "equally" distributed on all the nodes or are somehow "stuck" to one (or more) nodes. In that case what about using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer or contiguous chunks). We could create a new function that would accept a "sampling distance" as parameter for example, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Bertrand, TL;DR; the main problem seems choosing which way to page-fault the shared memory before the backend is going to use numa_move_pages() as the memory mappings (fresh after fork()/CoW) seem to be not ready for numa_move_pages() inquiry. On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > We probably would need to split it to some separate and new view > > within the pg_buffercache extension, but that is going to be slow, yet > > still provide valid results. > > Yup. OK so I've made that NUMA inquiry (now with that "volatile touch" to get valid results for not used memory) into a new and separate pg_buffercache_numa view. This avoids the problem that somebody would automatically run into this slow path when using pg_buffercache. > > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > > > Different perspective: I wanted to use the same approach in the new > > pg_shmemallocations_numa, but that won't cut it there. The other idea > > that came to my mind is to issue move_pages() from the backend that > > has already used all of those pages. That literally mean on of the > > below ideas: > > 1. from somewhere like checkpointer / bgwriter? > > 2. add touching memory on backend startup like always (sic!) > > 3. or just attempt to read/touch memory addr just before calling > > move_pages(). E.g. this last options is just two lines: > > > > if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > > (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > } > > > > and it seems to work while still issuing much less syscalls with > > move_pages() across backends, well at least here. > > One of the main issue I see with 1. and 2. is that we would not get accurate > results should the kernel decides to migrate the pages. Indeed, the process doing > the move_pages() call needs to have accessed the pages more recently than any > kernel migrations to see accurate locations. We never get fully accurate state as the zone memory migration might be happening as we query it, but in theory we could add something to e.g. checkpointer/bgwriter that would inquiry it on demand and report it back somewhat through shared memory (?), but I'm somehow afraid because as stated at the end of email, it might take some time (well we probably wouldn't need to "touch memory" then after all, as all of it is active), but that's still impact to those bgworkers. Somehow I feel safer if that code is NOT part of bgworker. > OTOH, one of the main issue that I see with 3. is that the monitoring could > probably influence the kernel's decision to start pages migration (I'm not 100% > sure but I could imagine it may influence the kernel's decision due to having to > read/touch the pages). > > But I'm thinking: do we really need to know the page location of every single page? > I think what we want to see is if the pages are "equally" distributed on all > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > or contiguous chunks). We could create a new function that would accept a > "sampling distance" as parameter for example, thoughts? The way I envision it (and I think what Andres wanted, not sure, still yet to see him comment on all of this) is to give PG devs a way to quickly spot NUMA imbalances, even for single relation. Probably some DBA in the wild could also query it to see how PG/kernel distributes memory from time to time. It seems to be more debugging and coding aid for future NUMA optimizations, rather than being used by some monitoring constantly. I would even dare to say it would require --enable-debug (or some other developer-only toggle), but apparently there's no need to hide it like that if those are separate views. Changes since previous version: 0. rebase due the recent OAuth commit introducing libcurl 1. cast uint64 for NBuffers as You found out 2. put stuff into pg_buffercache_numa 3. 0003 adds pg_shmem_numa_allocations Or should we rather call it pg_shmem_numa_zones or maybe just pg_shm_numa ? If there would be agreement that this is the way we want to have it (from the backend and not from checkpointer), here's what's left on the table to be done here: a. isn't there something quicker for touching / page-faulting memory ? If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it probably only works for parent//postmaster). I've also tried MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too: + rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ); + if(rc != 0) { + elog(NOTICE, "madvice() failed"); + } [..] - volatile uint64 touch pg_attribute_unused(); os_page_ptrs[i] = (char *)ent->location + (i * os_page_size); - touch = *(uint64 *)os_page_ptrs[i]; with volatile touching memory or MADV_POPULATE_READ the result seems to reliable (s_b 128MB here): postgres@postgres:1234 : 14442 # select * from pg_shmem_numa_allocations order by numa_size desc; name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 134221824 XLOG Ctl | 0 | 4206592 Buffer Descriptors | 0 | 1048576 transaction | 0 | 528384 Checkpointer Data | 0 | 524288 Checkpoint BufferIds | 0 | 327680 Shared Memory Stats | 0 | 311296 [..] without at least one of those two, new backend reports complete garbage: name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 995328 Shared Memory Stats | 0 | 245760 shmInvalBuffer | 0 | 65536 Buffer Descriptors | 0 | 65536 Backend Status Array | 0 | 61440 serializable | 0 | 57344 [..] b. refactor shared code so that it goes into src/port (but with Linux-only support so far) c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? d. fix tests, indent it, docs, make cfbot happy As for the sampling, dunno, fine for me. As an optional argument? but wouldn't it be better to find a way to actually for it to be quick? OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to 128GB with numactl --interleave=all pg_ctl start: postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] Buffer Blocks | 0 | 17179869184 Buffer Blocks | 1 | 17179869184 Buffer Blocks | 2 | 17179869184 Buffer Blocks | 3 | 17179869184 Buffer Blocks | 4 | 17179869184 Buffer Blocks | 5 | 17179869184 Buffer Blocks | 6 | 17179869184 Buffer Blocks | 7 | 17179869184 Buffer IO Condition Variables | 0 | 33554432 Buffer IO Condition Variables | 1 | 33554432 Buffer IO Condition Variables | 2 | 33554432 [..] but it takes 23s. Yes it takes 23s to just gather that info with memory touch, but that's ~128GB of memory and is hardly responsible (lack of C_F_I()). By default without numactl's interleave=all, you get clear picture of lack of NUMA awareness in PG shared segment (just as Andres presented, but now it is evident; well it is subject to autobalancing of course): postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] commit_timestamp | 0 | 2097152 commit_timestamp | 1 | 6291456 commit_timestamp | 2 | 0 commit_timestamp | 3 | 0 commit_timestamp | 4 | 0 [..] transaction | 0 | 14680064 transaction | 1 | 0 transaction | 2 | 0 transaction | 3 | 0 transaction | 4 | 2097152 [..] Somehow without interleave it is very quick too. -J.
Attachment
Hi, On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: > On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > OTOH, one of the main issue that I see with 3. is that the monitoring could > > probably influence the kernel's decision to start pages migration (I'm not 100% > > sure but I could imagine it may influence the kernel's decision due to having to > > read/touch the pages). > > > > But I'm thinking: do we really need to know the page location of every single page? > > I think what we want to see is if the pages are "equally" distributed on all > > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > > or contiguous chunks). We could create a new function that would accept a > > "sampling distance" as parameter for example, thoughts? > > The way I envision it (and I think what Andres wanted, not sure, still > yet to see him comment on all of this) is to give PG devs a way to > quickly spot NUMA imbalances, even for single relation. Yea. E.g. for some benchmark workloads the difference whether the root btree page is on the same NUMA node as the workload or not makes a roughly 2x perf difference. It's really hard to determine that today. > If there would be agreement that this is the way we want to have it > (from the backend and not from checkpointer), here's what's left on > the table to be done here: > a. isn't there something quicker for touching / page-faulting memory ? If you actually fault in a page the kernel actually has to allocate memory and then zero it out. That rather severely limits the throughput... > If not then maybe add CHECKS_FOR_INTERRUPTS() there? Should definitely be there. > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > help (it probably only works for parent//postmaster). Yes, needs to be in postmaster. Does the issue with "new" backends seeing pages as not present exist both with and without huge pages? FWIW, what you posted fails on CI: https://cirrus-ci.com/task/5114213770723328 Probably some ifdefs are missing. The sanity-check task configures with minimal dependencies, which is why you're seeing this even on linux. > b. refactor shared code so that it goes into src/port (but with > Linux-only support so far) > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? You mean a specific context instead of CurrentMemoryContext? > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > index 91b51142d2e..e3b7554d9e8 100644 > --- a/.cirrus.tasks.yml > +++ b/.cirrus.tasks.yml > @@ -436,6 +436,10 @@ task: > SANITIZER_FLAGS: -fsanitize=address > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > + > + # FIXME: use or not the libnuma? > + # --with-libnuma \ > + # > # Normally, the "relation segment" code basically has no coverage in our > # tests, because we (quite reasonably) don't generate tables large > # enough in tests. We've had plenty bugs that we didn't notice due > the I don't see why not. > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > new file mode 100644 > index 00000000000..e5b3d1f7dd2 > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,30 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the function. > +DROP FUNCTION pg_buffercache_pages() CASCADE; Why? I think that's going to cause problems, as the pg_buffercache view depends on it, and user views might turn in depend on pg_buffercache. I think CASCADE is rarely, if ever, ok to use in an extension scripot. > +CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean) > +RETURNS SETOF RECORD > +AS 'MODULE_PATHNAME', 'pg_buffercache_pages' > +LANGUAGE C PARALLEL SAFE; > > +-- Create a view for convenient access. > +CREATE OR REPLACE VIEW pg_buffercache AS > + SELECT P.* FROM pg_buffercache_pages(false) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4); > + > +CREATE OR REPLACE VIEW pg_buffercache_numa AS > + SELECT P.* FROM pg_buffercache_pages(true) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4, numa_zone_id int4); > + > +-- Don't want these to be available to public. > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same for _numa? > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > else > fctx->record[i].isvalid = false; > > +#ifdef USE_LIBNUMA > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > + blk2page = (int) i * pages_per_blk; BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. > + j = 0; > + do { > + /* > + * Many buffers can point to the same page, but we want to > + * query just first address. > + * > + * In order to get reliable results we also need to touch memory pages > + * so that inquiry about NUMA zone doesn't return -2. > + */ > + if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > + } > + j++; > + } while(j < (int)pages_per_blk); > +#endif > + Why is this done before we even have gotten -2 back? Even if we need it, it seems like we ought to defer this until necessary. > +#ifdef USE_LIBNUMA > + if(query_numa) { > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > + if(numa_available() == -1) { > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > + } else { > + /* Amortize the number of pages we need to query about */ > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > + elog(ERROR, "failed NUMA pages inquiry status"); > + } I wonder if we ought to override numa_error() so we can display more useful errors. > + > + LWLockAcquire(ShmemIndexLock, LW_SHARED); Doing multiple memory allocations while holding an lwlock is probably not a great idea, even if the lock normally isn't contended. > + os_page_ptrs = palloc(sizeof(void *) * os_page_count); > + os_pages_status = palloc(sizeof(int) * os_page_count); Why do this in very loop iteration? Greetings, Andres Freund
Hi, On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > Does the issue with "new" backends seeing pages as not present exist both with > and without huge pages? That's a good point and from what I can see it's correct with huge pages being used (it means all processes see the same NUMA node assignment regardless of access patterns). That said, wouldn't that be too strong to impose a restriction that huge_pages must be enabled? Jakub, thanks for the new patch version! FWIW, I did not look closely to the code yet (just did the minor changes already shared to have valid result with non tiny shared buffer size). I'll look closely at the code for sure once we all agree on the design part of it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: Hi Andres, thanks for your review! OK first sane version attached with new src/port/pg_numa.c boilerplate in 0001. Fixed some bugs too, there is one remaining optimization to be done (see that `static` question later). Docs/tests are still missing. QQ: I'm still wondering if we there's better way of exposing multiple pg's shma entries pointing to the same page (think of something hot: PROCLOCK or ProcArray), so wouldn't it make sense (in some future thread/patch) to expose this info somehow via additional column (pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of an easy way of showing that a potential NUMA auto balancing could lead to TLB NUMA shootdowns (not that it is happening or counting , just identifying it as a problem in allocation). Or that stuff doesn't make sense as we already have pg_shm_allocations.{off|size} and we could derive such info from it (after all it is for devs?)? postgres@postgres:1234 : 18843 # select name,off,off+allocated_size,allocated_size from pg_shmem_allocations order by off; name | off | ?column? | allocated_size ------------------------------------------------+-----------+-----------+---------------- [..] Proc Header | 147114112 | 147114240 | 128 Proc Array | 147274752 | 147275392 | 640 KnownAssignedXids | 147275392 | 147310848 | 35456 KnownAssignedXidsValid | 147310848 | 147319808 | 8960 Backend Status Array | 147319808 | 147381248 | 61440 postgres@postgres:1234 : 18924 # select * from pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array', 'KnownAssignedXids', '..') order by name; name | numa_zone_id | numa_size -------------------+--------------+----------- KnownAssignedXids | 0 | 2097152 Proc Array | 0 | 2097152 Proc Header | 0 | 2097152 I.e. ProcArray ends and right afterwards KnownAssignedXids start, both are hot , but on the same HP and NUMA node > > If there would be agreement that this is the way we want to have it > > (from the backend and not from checkpointer), here's what's left on > > the table to be done here: > > > a. isn't there something quicker for touching / page-faulting memory ? > > If you actually fault in a page the kernel actually has to allocate memory and > then zero it out. That rather severely limits the throughput... OK, no comments about that madvise(MADV_POPULATE_READ), so I'm sticking to pointers. > > If not then maybe add CHECKS_FOR_INTERRUPTS() there? > > Should definitely be there. Added. > > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > > help (it probably only works for parent//postmaster). > > Yes, needs to be in postmaster. > > Does the issue with "new" backends seeing pages as not present exist both with > and without huge pages? Please see attached file for more verbose results, but in short it is like below: patch(-touchpages) hugepages=off INVALID RESULTS (-2) patch(-touchpages) hugepages=on INVALID RESULTS (-2) patch(touchpages) hugepages=off CORRECT RESULT patch(touchpages) hugepages=on CORRECT RESULT patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2) patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2) IMHVO, the only other thing that could work here (but still page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are welcome, maybe it might be kernel version dependent. BTW: and yes you can "feel" the timing impact of MAP_SHARED|MAP_POPULATE during startup, it seems that for our case child backends that don't come-up with pre-faulted page attachments across fork() apparently. > FWIW, what you posted fails on CI: > https://cirrus-ci.com/task/5114213770723328 > > Probably some ifdefs are missing. The sanity-check task configures with > minimal dependencies, which is why you're seeing this even on linux. Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all of this CI stuff... > > b. refactor shared code so that it goes into src/port (but with > > Linux-only support so far) Done. > > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? > > You mean a specific context instead of CurrentMemoryContext? Yes, I had doubts earlier, but for now I'm going to leave it as it is. > > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > > index 91b51142d2e..e3b7554d9e8 100644 > > --- a/.cirrus.tasks.yml > > +++ b/.cirrus.tasks.yml > > @@ -436,6 +436,10 @@ task: > > SANITIZER_FLAGS: -fsanitize=address > > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > > > + > > + # FIXME: use or not the libnuma? > > + # --with-libnuma \ > > + # > > # Normally, the "relation segment" code basically has no coverage in our > > # tests, because we (quite reasonably) don't generate tables large > > # enough in tests. We've had plenty bugs that we didn't notice due > > the > > I don't see why not. Fixed. > > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > new file mode 100644 > > index 00000000000..e5b3d1f7dd2 > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,30 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the function. > > +DROP FUNCTION pg_buffercache_pages() CASCADE; > > Why? I think that's going to cause problems, as the pg_buffercache view > depends on it, and user views might turn in depend on pg_buffercache. I think > CASCADE is rarely, if ever, ok to use in an extension scripot. ... it's just me cutting corners :^), fixed now. [..] > > +-- Don't want these to be available to public. > > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; > > We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same > for _numa? Yup, fixed. > > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > > else > > fctx->record[i].isvalid = false; > > > > +#ifdef USE_LIBNUMA > > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > > + blk2page = (int) i * pages_per_blk; > > BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. Fixed, thanks I was looking for something like this! Is that +1 in v4 good? > > + j = 0; > > + do { > > + /* > > + * Many buffers can point to the same page, but we want to > > + * query just first address. > > + * > > + * In order to get reliable results we also need to touch memory pages > > + * so that inquiry about NUMA zone doesn't return -2. > > + */ > > + if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > + } > > + j++; > > + } while(j < (int)pages_per_blk); > > +#endif > > + > > Why is this done before we even have gotten -2 back? Even if we need it, it > seems like we ought to defer this until necessary. Not fixed yet: maybe we could even do a `static` with `has_this_run_earlier` and just perform this work only once during the first time? > > +#ifdef USE_LIBNUMA > > + if(query_numa) { > > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > > + if(numa_available() == -1) { > > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > > + } else { > > + /* Amortize the number of pages we need to query about */ > > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > > + elog(ERROR, "failed NUMA pages inquiry status"); > > + } > > I wonder if we ought to override numa_error() so we can display more useful > errors. Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine... [..] > Doing multiple memory allocations while holding an lwlock is probably not a > great idea, even if the lock normally isn't contended. [..] > Why do this in very loop iteration? Both fixed. -J.
Attachment
On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > > Does the issue with "new" backends seeing pages as not present exist both with > > and without huge pages? > > That's a good point and from what I can see it's correct with huge pages being > used (it means all processes see the same NUMA node assignment regardless of > access patterns). Hi Bertrand , please see that nearby thread. I've got quite the opposite results. I need page-fault memory or I get invalid results ("-2"). What kernel version are you using ? (I've tried it on two 6.10.x series kernels , virtualized in both cases; one was EPYC [real NUMA, but not VM so not real hardware]). > That said, wouldn't that be too strong to impose a restriction that huge_pages > must be enabled? > > Jakub, thanks for the new patch version! FWIW, I did not look closely to the > code yet (just did the minor changes already shared to have valid result with non > tiny shared buffer size). I'll look closely at the code for sure once we all agree > on the design part of it. Cool, I think we are pretty close actually, but others might have different perspective. -J.
Hi, On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > FWIW, what you posted fails on CI: > > https://cirrus-ci.com/task/5114213770723328 > > > > Probably some ifdefs are missing. The sanity-check task configures with > > minimal dependencies, which is why you're seeing this even on linux. > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > of this CI stuff... FYI, you can enable CI on a github repo, to see results without posting to the list: https://github.com/postgres/postgres/blob/master/src/tools/ci/README Greetings, Andres Freund
On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > > FWIW, what you posted fails on CI: > > > https://cirrus-ci.com/task/5114213770723328 > > > > > > Probably some ifdefs are missing. The sanity-check task configures with > > > minimal dependencies, which is why you're seeing this even on linux. > > > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > > of this CI stuff... > > FYI, you can enable CI on a github repo, to see results without posting to the > list: > https://github.com/postgres/postgres/blob/master/src/tools/ci/README Thanks, I'll take a look into it. Meanwhile v5 is attached with slight changes to try to make cfbot happy: 1. fixed tests and added tiny copy-cat basic tests for pg_buffercache_numa and pg_shm_numa_allocations views 2. win32 doesn't have sysconf() No docs yet. -J.
Attachment
Hi, On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote: > On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > > > FWIW, what you posted fails on CI: > > > > https://cirrus-ci.com/task/5114213770723328 > > > > > > > > Probably some ifdefs are missing. The sanity-check task configures with > > > > minimal dependencies, which is why you're seeing this even on linux. > > > > > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > > > of this CI stuff... > > > > FYI, you can enable CI on a github repo, to see results without posting to the > > list: > > https://github.com/postgres/postgres/blob/master/src/tools/ci/README > > Thanks, I'll take a look into it. > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: Thanks for the updated version! FWIW, I had to do a few changes to get an error free compiling experience with autoconf/or meson and both with or without the libnuma configure option. Sharing here as .txt files: v5-0004-configure-changes.txt: changes in configure + add a test on numa.h availability and a call to numa_available. v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA because the file is still using sysconf() in the non-NUMA code path. Also, removed a ";" in "#endif;" in the non-NUMA code path. v5-0006-meson.build-changes.txt. Those apply on top of your v5. Also the pg_buffercache test fails without the libnuma configure option. Maybe some tests should depend of the libnuma configure option. Still did not look closely to the code. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi Jakub, On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote: > On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > > > Does the issue with "new" backends seeing pages as not present exist both with > > > and without huge pages? > > > > That's a good point and from what I can see it's correct with huge pages being > > used (it means all processes see the same NUMA node assignment regardless of > > access patterns). > > Hi Bertrand , please see that nearby thread. I've got quite the > opposite results. I need page-fault memory or I get invalid results > ("-2"). What kernel version are you using ? (I've tried it on two > 6.10.x series kernels , virtualized in both cases; one was EPYC [real > NUMA, but not VM so not real hardware]) Thanks for sharing your numbers! It looks like that with hp enabled then the shared_buffers plays a role. 1. With hp, shared_buffers 4GB: huge_pages_status ------------------- on (1 row) shared_buffers ---------------- 4GB (1 row) NOTICE: os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+-------- | 507618 0 | 1054 -2 | 15616 (3 rows) 2. With hp, shared_buffers 23GB: huge_pages_status ------------------- on (1 row) shared_buffers ---------------- 23GB (1 row) NOTICE: os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+--------- | 2997974 0 | 16682 (2 rows) 3. no hp, shared_buffers 23GB: huge_pages_status ------------------- off (1 row) shared_buffers ---------------- 23GB (1 row) ERROR: extension "pg_buffercache" already exists NOTICE: os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 2997975 -2 | 16482 1 | 199 (3 rows) Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've no ideas. Anyway there is little than we can do and the "touchpages" patch seems to provide "accurate" results. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: [..] > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: > > Thanks for the updated version! > > FWIW, I had to do a few changes to get an error free compiling experience with > autoconf/or meson and both with or without the libnuma configure option. > > Sharing here as .txt files: > > Also the pg_buffercache test fails without the libnuma configure option. Maybe > some tests should depend of the libnuma configure option. [..] Thank you so much for this Bertrand ! I've applied those , played a little bit with configure and meson and reproduced the test error and fixed it by silencing that NOTICE in tests. So v6 is attached even before I get a chance to start using that CI. Still waiting for some input and tests regarding that earlier touchpages attempt, docs are still missing... -J.
Attachment
Hi, On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote: > On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > [..] > > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: > > > > Thanks for the updated version! > > > > FWIW, I had to do a few changes to get an error free compiling experience with > > autoconf/or meson and both with or without the libnuma configure option. > > > > Sharing here as .txt files: > > > > Also the pg_buffercache test fails without the libnuma configure option. Maybe > > some tests should depend of the libnuma configure option. > [..] > > Thank you so much for this Bertrand ! > > I've applied those , played a little bit with configure and meson and > reproduced the test error and fixed it by silencing that NOTICE in > tests. So v6 is attached even before I get a chance to start using > that CI. Still waiting for some input and tests regarding that earlier > touchpages attempt, docs are still missing... Thanks for the new version! I did some tests and it looks like it's giving correct results. I don't see -2 anymore and every backend reports correct distribution (with or without hp, with "small" or "large" shared buffer). A few random comments: === 1 + /* + * This is for gathering some NUMA statistics. We might be using + * various DB block sizes (4kB, 8kB , .. 32kB) that end up being + * allocated in various different OS memory pages sizes, so first we + * need to understand the OS memory page size before calling + * move_pages() + */ + os_page_size = pg_numa_get_pagesize(); + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; + pages_per_blk = (float) BLCKSZ / os_page_size; + + elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f", + os_page_count, os_page_size, pages_per_blk); + + os_page_ptrs = palloc(sizeof(void *) * os_page_count); + os_pages_status = palloc(sizeof(int) * os_page_count); + memset(os_page_ptrs, 0, sizeof(void *) * os_page_count); + + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here + */ + memset(os_pages_status, 0xff, sizeof(int) * os_page_count); I think that if (query_numa) check should also wrap that entire section of code. === 2 + if (query_numa) + { + blk2page = (int) i * pages_per_blk; + j = 0; + do + { This check is done for every page. I wonder if it would not make sense to create a brand new function for pg_buffercache_numa and just let the current pg_buffercache_pages() as it is. That said it would be great to avoid code duplication as much a possible though, maybe using a shared populate_buffercache_entry() or such helper function? === 3 +#define ONE_GIGABYTE 1024*1024*1024 + if ((i * os_page_size) % ONE_GIGABYTE == 0) + CHECK_FOR_INTERRUPTS(); + } Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS() for every page instead? (I don't see with a 30GB shared buffer). I've the feeling that we could get rid of the "ONE_GIGABYTE" check. === 4 + pfree(os_page_ptrs); + pfree(os_pages_status); Not sure that's needed, we should be in a short-lived memory context here (ExprContext or such). === 5 +/* SQL SRF showing NUMA zones for allocated shared memory */ +Datum +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) +{ That's a good idea. + for (i = 0; i < shm_ent_page_count; i++) + { + /* + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ + volatile uint64 touch pg_attribute_unused(); + + page_ptrs[i] = (char *) ent->location + (i * os_page_size); + pg_numa_touch_mem_if_required(touch, page_ptrs[i]); That sounds right. Could we also avoid some code duplication with pg_get_shmem_allocations()? Also same remarks about pfree() and ONE_GIGABYTE as above. A few other things: ==== 6 +++ b/src/backend/storage/ipc/shmem.c @@ -73,6 +73,7 @@ #include "storage/shmem.h" #include "storage/spin.h" #include "utils/builtins.h" +#include "port/pg_numa.h" Not at the right position, should be between those 2: #include "miscadmin.h" #include "storage/lwlock.h" ==== 7 +/*------------------------------------------------------------------------- + * + * pg_numa.h + * Miscellaneous functions for bit-wise operations. description is not correct. Also the "Copyright (c) 2019-2025" might be "Copyright (c) 2025" instead. === 8 +++ b/src/port/pg_numa.c @@ -0,0 +1,150 @@ +/*------------------------------------------------------------------------- + * + * numa.c + * Basic NUMA portability routines s/numa.c/pg_numa.c/ ? === 9 +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -6,6 +6,7 @@ * contrib/pg_buffercache/pg_buffercache_pages.c *------------------------------------------------------------------------- */ +#include "pg_config.h" #include "postgres.h" Is this new include needed? #include "access/htup_details.h" @@ -13,10 +14,12 @@ #include "funcapi.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" not in the right order. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote: > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > > > > Why is this done before we even have gotten -2 back? Even if we need it, it > > seems like we ought to defer this until necessary. > > Not fixed yet: maybe we could even do a `static` with > `has_this_run_earlier` and just perform this work only once during the > first time? Not sure I get your idea, could you share what the code would look like? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi! On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > I did some tests and it looks like it's giving correct results. I don't see -2 > anymore and every backend reports correct distribution (with or without hp, > with "small" or "large" shared buffer). Cool! Attached is v7 that is fully green on cirrus CI that Andres recommended, we will see how cfbot reacts to this. BTW docs are still missing. When started with proper interleave=all with s_b=64GB, hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench: postgres=# select buffers_used, buffers_unused from pg_buffercache_summary(); buffers_used | buffers_unused --------------+---------------- 170853 | 8217755 ( ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+--------- 0 | 42752 1 | 42752 2 | 42752 3 | 42597 | 8217755 Time: 5828.100 ms (00:05.828) postgres=# select 3*42752+42597; ?column? ---------- 170853 postgres=# select * from pg_shmem_numa_allocations order by numa_size desc limit 12; DEBUG: NUMA: page-faulting shared memory segments for proper NUMA readouts name | numa_zone_id | numa_size --------------------+--------------+------------- Buffer Blocks | 0 | 17179869184 Buffer Blocks | 1 | 17179869184 Buffer Blocks | 3 | 17179869184 Buffer Blocks | 2 | 17179869184 Buffer Descriptors | 2 | 134217728 Buffer Descriptors | 1 | 134217728 Buffer Descriptors | 0 | 134217728 Buffer Descriptors | 3 | 134217728 Checkpointer Data | 1 | 67108864 Checkpointer Data | 0 | 67108864 Checkpointer Data | 2 | 67108864 Checkpointer Data | 3 | 67108864 Time: 68.579 ms > A few random comments: > > === 1 > [..] > I think that the if (query_numa) check should also wrap that entire section of code. Done. > === 2 > > + if (query_numa) > + { > + blk2page = (int) i * pages_per_blk; > + j = 0; > + do > + { > > This check is done for every page. I wonder if it would not make sense > to create a brand new function for pg_buffercache_numa and just let the > current pg_buffercache_pages() as it is. That said it would be great to avoid > code duplication as much a possible though, maybe using a shared > populate_buffercache_entry() or such helper function? Well, I've made query_numa a parameter there simply to avoid that code duplication in the first place, look at those TupleDescInitEntry()... IMHO rarely anybody uses pg_buffercache, but we could add unlikely() there maybe to hint compiler with some smaller routine to reduce complexity of that main routine? (assuming NUMA inquiry is going to be rare). > === 3 > > +#define ONE_GIGABYTE 1024*1024*1024 > + if ((i * os_page_size) % ONE_GIGABYTE == 0) > + CHECK_FOR_INTERRUPTS(); > + } > > Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS() > for every page instead? (I don't see with a 30GB shared buffer). I've the > feeling that we could get rid of the "ONE_GIGABYTE" check. You are right and no it was simply my premature optimization attempt, as apparently CFI on closer looks seems to be already having unlikely() and looks really cheap, so yeah I've removed that. > === 4 > > + pfree(os_page_ptrs); > + pfree(os_pages_status); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). Yes, I wanted to have it just for illustrative and stylish purposes, but right, removed. > === 5 > > +/* SQL SRF showing NUMA zones for allocated shared memory */ > +Datum > +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) > +{ [..] > > + for (i = 0; i < shm_ent_page_count; i++) > + { > + /* > + * In order to get reliable results we also need to touch memory > + * pages so that inquiry about NUMA zone doesn't return -2. > + */ > + volatile uint64 touch pg_attribute_unused(); > + > + page_ptrs[i] = (char *) ent->location + (i * os_page_size); > + pg_numa_touch_mem_if_required(touch, page_ptrs[i]); > > That sounds right. > > Could we also avoid some code duplication with pg_get_shmem_allocations()? Not sure I understand do you want to avoid code duplication pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa = true) ? > > Also same remarks about pfree() and ONE_GIGABYTE as above. Fixed. > A few other things: > > ==== 6 > > +#include "port/pg_numa.h" > Not at the right position, should be between those 2: > > #include "miscadmin.h" > #include "storage/lwlock.h" Fixed. > ==== 7 > > +/*------------------------------------------------------------------------- > + * > + * pg_numa.h > + * Miscellaneous functions for bit-wise operations. > > description is not correct. Also the "Copyright (c) 2019-2025" might be > "Copyright (c) 2025" instead. Fixed. > === 8 > > +++ b/src/port/pg_numa.c > @@ -0,0 +1,150 @@ > +/*------------------------------------------------------------------------- > + * > + * numa.c > + * Basic NUMA portability routines > > s/numa.c/pg_numa.c/ ? Fixed. > === 9 > > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > @@ -6,6 +6,7 @@ > * contrib/pg_buffercache/pg_buffercache_pages.c > *------------------------------------------------------------------------- > */ > +#include "pg_config.h" > #include "postgres.h" > > Is this new include needed? Removed, don't remember how it arrived here, most have been some artifact of earlier attempts. > #include "access/htup_details.h" > @@ -13,10 +14,12 @@ > #include "funcapi.h" > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > +#include "port/pg_numa.h" > +#include "storage/pg_shmem.h" > > not in the right order. Fixed. And also those from nearby message: On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote: > > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Why is this done before we even have gotten -2 back? Even if we need it, it > > > seems like we ought to defer this until necessary. > > > > Not fixed yet: maybe we could even do a `static` with > > `has_this_run_earlier` and just perform this work only once during the > > first time? > > Not sure I get your idea, could you share what the code would look like? Please see pg_buffercache_pages I've just added static bool firstUseInBackend: postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts [..] postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count [..] Same was done to the pg_get_shmem_numa_allocations. Also CFbot/cirrus was getting: > [11:01:05.027] In file included from ../../src/include/postgres.h:49, > [11:01:05.027] from pg_buffercache_pages.c:10: > [11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’: > [11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 hastype ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=] Fixed with %zu (for size_t) instead of %ld. > Linux - Debian Bookworm - Autoconf got: > [10:42:59.216] checking numa.h usability... no > [10:42:59.268] checking numa.h presence... no > [10:42:59.286] checking for numa.h... no > [10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this. > [13:50:47.449] gcc -m32 @src/backend/postgres.rsp > [13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format I've also got an error in 32-bit build as libnuma.h is there, but apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE) and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for such build. -J.
Attachment
Hi, On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote: > Hi! > > On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > I did some tests and it looks like it's giving correct results. I don't see -2 > > anymore and every backend reports correct distribution (with or without hp, > > with "small" or "large" shared buffer). > > Cool! Attached is v7 Thanks for the new version! > > === 2 > > > > + if (query_numa) > > + { > > + blk2page = (int) i * pages_per_blk; > > + j = 0; > > + do > > + { > > > > This check is done for every page. I wonder if it would not make sense > > to create a brand new function for pg_buffercache_numa and just let the > > current pg_buffercache_pages() as it is. That said it would be great to avoid > > code duplication as much a possible though, maybe using a shared > > populate_buffercache_entry() or such helper function? > > Well, I've made query_numa a parameter there simply to avoid that code > duplication in the first place, look at those TupleDescInitEntry()... Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() or such function: to put the "duplicated" code in it and then use this shared function in pg_buffercache_pages() and in the new numa related one. > IMHO rarely anybody uses pg_buffercache, but we could add unlikely() I think unlikely() should be used for optimization based on code path likelihood, not based on how often users might use a feature. > > === 5 > > > > Could we also avoid some code duplication with pg_get_shmem_allocations()? > > Not sure I understand do you want to avoid code duplication > pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or > pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa = > true) ? I meant to say avoid code duplication between pg_get_shmem_allocations() and pg_get_shmem_numa_allocations(). It might be possible to create a shared function for them too. That said, it looks like that the savings (if any), would not be that much, so maybe just forget about it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Cool! Attached is v7 > Thanks for the new version! ... and another one: 7b ;) > > > === 2 [..] > > Well, I've made query_numa a parameter there simply to avoid that code > > duplication in the first place, look at those TupleDescInitEntry()... > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > or such function: to put the "duplicated" code in it and then use this > shared function in pg_buffercache_pages() and in the new numa related one. OK, so hastily attempted that in 7b , I had to do a larger refactor there to avoid code duplication between those two. I don't know which attempt is better though (7 vs 7b).. > > IMHO rarely anybody uses pg_buffercache, but we could add unlikely() > > I think unlikely() should be used for optimization based on code path likelihood, > not based on how often users might use a feature. In 7b I've removed the unlikely() - For a moment I was thinking that you are concerned about this loop for NBuffers to be as much optimized as it can and that's the reason for splitting the routines. > > > === 5 [..] > I meant to say avoid code duplication between pg_get_shmem_allocations() and > pg_get_shmem_numa_allocations(). It might be possible to create a shared > function for them too. That said, it looks like that the savings (if any), would > not be that much, so maybe just forget about it. Yeah, OK, so let's leave it at that. -J.
Attachment
Hi, On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: >Hi, > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > or such function: to put the "duplicated" code in it and then use this > > shared function in pg_buffercache_pages() and in the new numa related one. > > OK, so hastily attempted that in 7b , I had to do a larger refactor > there to avoid code duplication between those two. I don't know which > attempt is better though (7 vs 7b).. > I'm attaching basically the earlier stuff (v7b) as v8 with the following minor changes: - docs are included - changed int8 to int4 in one function definition for numa_zone_id -J.
Attachment
On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > Hi, > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > >Hi, > > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > > or such function: to put the "duplicated" code in it and then use this > > > shared function in pg_buffercache_pages() and in the new numa related one. > > > > OK, so hastily attempted that in 7b , I had to do a larger refactor > > there to avoid code duplication between those two. I don't know which > > attempt is better though (7 vs 7b).. > > > > I'm attaching basically the earlier stuff (v7b) as v8 with the > following minor changes: > - docs are included > - changed int8 to int4 in one function definition for numa_zone_id .. and v9 attached because cfbot partially complained about .cirrus.tasks.yml being adjusted recently (it seems master is hot these days). -J.
Attachment
Hi, On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote: > On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > > > Hi, > > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak > > <jakub.wartak@enterprisedb.com> wrote: > > >Hi, > > > > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > > > or such function: to put the "duplicated" code in it and then use this > > > > shared function in pg_buffercache_pages() and in the new numa related one. > > > > > > OK, so hastily attempted that in 7b , I had to do a larger refactor > > > there to avoid code duplication between those two. I don't know which > > > attempt is better though (7 vs 7b).. > > > > > > > I'm attaching basically the earlier stuff (v7b) as v8 with the > > following minor changes: > > - docs are included > > - changed int8 to int4 in one function definition for numa_zone_id > > .. and v9 attached because cfbot partially complained about > .cirrus.tasks.yml being adjusted recently (it seems master is hot > these days). > Thanks for the new version! Some random comments on 0001: === 1 It does not compiles "alone". It's missing: +++ b/src/include/storage/pg_shmem.h @@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */ extern PGDLLIMPORT int shared_memory_type; extern PGDLLIMPORT int huge_pages; extern PGDLLIMPORT int huge_page_size; +extern PGDLLIMPORT int huge_pages_status; and --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -563,7 +563,7 @@ static int ssl_renegotiation_limit; */ int huge_pages = HUGE_PAGES_TRY; int huge_page_size; -static int huge_pages_status = HUGE_PAGES_UNKNOWN; +int huge_pages_status = HUGE_PAGES_UNKNOWN; That come with 0002. So it has to be in 0001 instead. === 2 +else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fi Maybe we should add the same test (checking for numa.h) for meson? === 3 +# FIXME: filter-out / with/without with_libnuma? +LIBS += $(LIBNUMA_LIBS) It looks to me that we can remove those 2 lines. === 4 + Only supported on Linux. s/on Linux/on platforms for which the libnuma library is implemented/? I did a quick grep on "Only supported on" and it looks like that could be a more consistent wording. === 5 +#include "c.h" +#include "postgres.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" +#include <unistd.h> I had a closer look to other header files and it looks like it "should" be: #include "c.h" #include "postgres.h" #include <unistd.h> #ifdef WIN32 #include <windows.h> #endif #include "port/pg_numa.h" #include "storage/pg_shmem.h" And is "#include "c.h"" really needed? === 6 +/* FIXME not tested, might crash */ That's a bit scary. === 7 + * XXX: for now we issue just WARNING, but long-term that might depend on + * numa_set_strict() here s/here/here./ Did not look carefully at all the comments in 0001, 0002 and 0003 though. A few random comments regarding 0002: === 8 # create extension pg_buffercache; ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() RETURNS SETOF RECORD AS '$libdir/pg_buffercache', 'pg_buffercache_pages' LANGUAGE C PARALLEL SAFE" extension script file "pg_buffercache--1.2.sql", near line 7 While that's ok if 0003 is applied. I think that each individual patch should "fully" work individually. === 9 + do + { + if (os_page_ptrs[blk2page + j] == 0) blk2page + j will be repeated multiple times, better to store it in a local variable instead? === 10 + if (firstUseInBackend == true) if (firstUseInBackend) instead? === 11 + int j = 0, + blk2page = (int) i * pages_per_blk; I wonder if size_t is more appropriate for blk2page: size_t blk2page = (size_t)(i * pages_per_blk) maybe? === 12 as we know that we'll iterate until pages_per_blk then would a for loop be more appropriate here, something like? " for (size_t j = 0; j < pages_per_blk; j++) { if (os_page_ptrs[blk2page + j] == 0) { " === 13 + if (buf_state & BM_DIRTY) + fctx->record[i].isdirty = true; + else + fctx->record[i].isdirty = false; fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? === 14 + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) + fctx->record[i].isvalid = true; + else + fctx->record[i].isvalid = false; fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) maybe? === 15 +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) I wonder if "i" should get a more descriptive name? === 16 s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent with pg_stat_io_build_tuples for example). I now realize that I did propose populate_buffercache_entry() up-thread, sorry for changing my mind. === 17 + static bool firstUseInBackend = true; maybe we should give it a more descriptive name? Also I need to think more about how firstUseInBackend is used, for example: ==== 17.1 would it be better to define it as a file scope variable? (at least that would avoid to pass it as an extra parameter in some functions). === 17.2 what happens if firstUseInBackend is set to false and later on the pages are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is called again by a backend that already had set firstUseInBackend to false, would that provide accurate information? === 18 +Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + BufferCachePagesContext *fctx; /* User function context. */ + bool query_numa = true; I don't think we need query_numa anymore in pg_buffercache_numa_pages(). I think that we can just ERROR (or NOTICE and return) here: + if (pg_numa_init() == -1) + { + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); + query_numa = false; and fully get rid of query_numa. === 19 And then here: for (i = 0; i < NBuffers; i++) { populate_buffercache_entry(i, fctx); if (query_numa) pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); } if (query_numa) { if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) elog(ERROR, "failed NUMA pages inquiry: %m"); for (i = 0; i < NBuffers; i++) { int blk2page = (int) i * pages_per_blk; /* * Technically we can get errors too here and pass that to * user. Also we could somehow report single DB block spanning * more than one NUMA zone, but it should be rare. */ fctx->record[i].numa_zone_id = os_pages_status[blk2page]; } maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? A few random comments regarding 0003: === 20 + <indexterm zone="view-pg-shmem-numa-allocations"> + <primary>pg_shmem_numa_allocations</primary> + </indexterm> + + <para> + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes s/pg_shmem_allocations/pg_shmem_numa_allocations/? === 21 + /* FIXME: should we release LWlock here ? */ + elog(ERROR, "failed NUMA pages inquiry status: %m"); There is no need, see src/backend/storage/lmgr/README. === 22 +#define MAX_NUMA_ZONES 32 /* FIXME? */ + Size zones[MAX_NUMA_ZONES]; could we rely on pg_numa_get_max_node() instead? === 23 + if (s >= 0) + zones[s]++; should we also check that s is below a limit? === 24 Regarding how we make use of pg_numa_get_max_node(), are we sure there is no possible holes? I mean could a system have node 0,1 and 3 but not 2? Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I did a little in 0001 though) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Thanks for the new version! v10 is attached with most fixes after review and one new thing introduced: pg_numa_available() for run-time decision inside tests which was needed after simplifying code a little bit as you wanted. I've also fixed -Dlibnuma=disabled as it was not properly implemented. There are couple open items (minor/decision things), but most is fixed or answered: > Some random comments on 0001: > > === 1 > > It does not compiles "alone". It's missing: > [..] > +extern PGDLLIMPORT int huge_pages_status; [..] > -static int huge_pages_status = HUGE_PAGES_UNKNOWN; > +int huge_pages_status = HUGE_PAGES_UNKNOWN; > > That come with 0002. So it has to be in 0001 instead. Ugh, good catch, I haven't thought about it in isolation, they are separate to just ease review, but should be committed together. Fixed. > === 2 > > +else > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > +fi > > Maybe we should add the same test (checking for numa.h) for meson? TBH, I have no idea, libnuma.h may exist but it may not link e.g. when cross-compiling 32-bit on 64-bit. Or is this more about keeping sync between meson and autoconf? > === 3 > > +# FIXME: filter-out / with/without with_libnuma? > +LIBS += $(LIBNUMA_LIBS) > > It looks to me that we can remove those 2 lines. Done. > === 4 > > + Only supported on Linux. > > s/on Linux/on platforms for which the libnuma library is implemented/? > > I did a quick grep on "Only supported on" and it looks like that could be > a more consistent wording. Fixed. > === 5 > > +#include "c.h" > +#include "postgres.h" > +#include "port/pg_numa.h" > +#include "storage/pg_shmem.h" > +#include <unistd.h> > > I had a closer look to other header files and it looks like it "should" be: > > #include "c.h" > #include "postgres.h" > > #include <unistd.h> > #ifdef WIN32 > #include <windows.h> > #endif > > #include "port/pg_numa.h" > #include "storage/pg_shmem.h" > > And is "#include "c.h"" really needed? Fixed both. It seems to compile without c.h. > === 6 > > +/* FIXME not tested, might crash */ > > That's a bit scary. When you are in support for long enough it is becoming the norm ;) But on serious note Andres wanted have numa error/warning handlers (used by libnuma), but current code has no realistic code-path to hit it from numa_available(3), numa_move_pages(3) or numa_max_node(3). The situation won't change until one day in future (I hope!) we start using some more advanced libnuma functionality for interleaving memory, please see my earlier reply: https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com E.g. numa_available(3) is tiny wrapper , see https://github.com/numactl/numactl/blob/master/libnuma.c#L872 For now, I've adjusted that FIXME to XXX, but still don't know we could inject failure to see this triggered... > === 7 > > + * XXX: for now we issue just WARNING, but long-term that might depend on > + * numa_set_strict() here > > s/here/here./ Done. > Did not look carefully at all the comments in 0001, 0002 and 0003 though. > > A few random comments regarding 0002: > > === 8 > > # create extension pg_buffercache; > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > RETURNS SETOF RECORD > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > LANGUAGE C PARALLEL SAFE" > extension script file "pg_buffercache--1.2.sql", near line 7 > > While that's ok if 0003 is applied. I think that each individual patch should > "fully" work individually. STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + 0002 or 0001 + 0003, but here 0002 is complaining about lack of pg_numa_query_pages() which is part of 0001 (?) Should I merge those patches or keep them separate? > === 9 > > + do > + { > + if (os_page_ptrs[blk2page + j] == 0) > > blk2page + j will be repeated multiple times, better to store it in a local > variable instead? Done. > === 10 > > + if (firstUseInBackend == true) > > > if (firstUseInBackend) instead? Done everywhere. > === 11 > > + int j = 0, > + blk2page = (int) i * pages_per_blk; > > I wonder if size_t is more appropriate for blk2page: > > size_t blk2page = (size_t)(i * pages_per_blk) maybe? Sure, done. > === 12 > > as we know that we'll iterate until pages_per_blk then would a for loop be more > appropriate here, something like? > > " > for (size_t j = 0; j < pages_per_blk; j++) > { > if (os_page_ptrs[blk2page + j] == 0) > { > " Sure. > === 13 > > + if (buf_state & BM_DIRTY) > + fctx->record[i].isdirty = true; > + else > + fctx->record[i].isdirty = false; > > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? > > === 14 > > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > + fctx->record[i].isvalid = true; > + else > + fctx->record[i].isvalid = false; > > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > maybe? It is coming from the original pg_buffercache and it is less readable to me, so I don't want to touch that too much because that might open refactoring doors too wide. > === 15 > > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) > > I wonder if "i" should get a more descriptive name? Done, buffer_id. > === 16 > > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent > with pg_stat_io_build_tuples for example). > > I now realize that I did propose populate_buffercache_entry() up-thread, > sorry for changing my mind. OK, I've tried to create a better name, but all my ideas were too long, but pg_buffercache_build_tuple sounds nice, so lets use that. > === 17 > > + static bool firstUseInBackend = true; > > maybe we should give it a more descriptive name? I couldn't come up with anything that wouldn't look too long, so instead I've added a comment explaining the meaning behind this variable, hope that's good enough. > Also I need to think more about how firstUseInBackend is used, for example: > > ==== 17.1 > > would it be better to define it as a file scope variable? (at least that > would avoid to pass it as an extra parameter in some functions). I have no strong opinion on this, but I have one doubt (for future): isn't creating global variables making life harder for upcoming multithreading guys ? > === 17.2 > > what happens if firstUseInBackend is set to false and later on the pages > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > called again by a backend that already had set firstUseInBackend to false, > would that provide accurate information? It is still the correct result. That "touching" (paging-in) is only necessary probably to properly resolve PTEs as the fork() does not seem to carry them over from parent: postgres=# select 'create table foo' || s || ' as select generate_series(1, 100000);' from generate_series(1, 4) s; postgres=# \gexec SELECT 100000 SELECT 100000 SELECT 100000 SELECT 100000 postgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; -- before: numa_zone_id | count --------------+--------- 0 | 256 1 | 4131 | 8384221 postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 1451 -- now use another root (!) session to "migratepages(8)" from numactl will also shift shm segment: # migratepages 1451 1 3 -- while above will be in progress for a lot of time , but the outcome is visible much faster in that backend (pg is functioning): postgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; numa_zone_id | count --------------+--------- 0 | 256 3 | 4480 | 8383872 So the above clearly shows that initial touching of shm is required, but just once and it stays valid afterwards. BTW: migratepages(8) was stuck for 1-2 minutes there on "__wait_rcu_gp" according to it's wchan, without any sign of activity on the OS and then out of blue completed just fine, s_b=64GB,HP=on. > === 18 > > +Datum > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > +{ > + FuncCallContext *funcctx; > + BufferCachePagesContext *fctx; /* User function context. */ > + bool query_numa = true; > > I don't think we need query_numa anymore in pg_buffercache_numa_pages(). > > I think that we can just ERROR (or NOTICE and return) here: > > + if (pg_numa_init() == -1) > + { > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); > + query_numa = false; > > and fully get rid of query_numa. Right... fixed it here, but it made the tests blow up, so I've had to find a way to conditionally launch tests based on NUMA availability and that's how pg_numa_available() was born. It's in ipc/shmem.c because I couldn't find a better place for it... > === 19 > > And then here: > > for (i = 0; i < NBuffers; i++) > { > populate_buffercache_entry(i, fctx); > if (query_numa) > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); > } > > if (query_numa) > { > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) > elog(ERROR, "failed NUMA pages inquiry: %m"); > > for (i = 0; i < NBuffers; i++) > { > int blk2page = (int) i * pages_per_blk; > > /* > * Technically we can get errors too here and pass that to > * user. Also we could somehow report single DB block spanning > * more than one NUMA zone, but it should be rare. > */ > fctx->record[i].numa_zone_id = os_pages_status[blk2page]; > } > > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper that prepares **os_page_ptrs, which is then used by pg_numa_query_pages() and then we fill the data. But now after removing query_numa , it reads smoother anyway. Can you please take a look again on this, is this up to the project standards? > A few random comments regarding 0003: > > === 20 > > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes > > s/pg_shmem_allocations/pg_shmem_numa_allocations/? Fixed. > === 21 > > + /* FIXME: should we release LWlock here ? */ > + elog(ERROR, "failed NUMA pages inquiry status: %m"); > > There is no need, see src/backend/storage/lmgr/README. Thanks, fixed. > === 22 > > +#define MAX_NUMA_ZONES 32 /* FIXME? */ > + Size zones[MAX_NUMA_ZONES]; > > could we rely on pg_numa_get_max_node() instead? Sure, done. > === 23 > > + if (s >= 0) > + zones[s]++; > > should we also check that s is below a limit? STILL OPEN QUESTION: I'm not sure it would give us value to report e.g. -2 on per shm entry/per numa node entry, would it? If it would we could somehow overallocate that array and remember negative ones too. > === 24 > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > no possible holes? I mean could a system have node 0,1 and 3 but not 2? I have never seen a hole in numbering as it is established during boot, and the only way that could get close to adjusting it could be making processor books (CPU and RAM together) offline. Even *if* someone would be doing some preventive hardware maintenance, that still wouldn't hurt, as we are just using the Id of the zone to display it -- the max would be already higher. I mean, technically one could use lsmem(1) (it mentions removable flag, after let's pages and processes migrated away and from there) and then use chcpu(1) to --disable CPUs on that zone (to prevent new ones coming there and allocating new local memory there) and then offlining that memory region via chmem(1) --disable. Even with all of that, that still shouldn't cause issues for this code I think, because `numa_max_node()` says it `returns the >> highest node number <<< available on the current system.` > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > did a little in 0001 though) That was an attempt to say "Thank You", OK, I've aligned it that way, so you are still referenced in 0001. I wouldn't find motivation to work on this if you won't respond to those emails ;) There is one known issue, CI returned for numa.out test, that I need to get bottom to (it does not reproduce for me locally) inside numa.out/sql: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 -J.
Attachment
On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for the new version! > > v10 is attached with most fixes after review and one new thing > introduced: pg_numa_available() for run-time decision inside tests > which was needed after simplifying code a little bit as you wanted. > I've also fixed -Dlibnuma=disabled as it was not properly implemented. > There are couple open items (minor/decision things), but most is fixed > or answered: > > > Some random comments on 0001: > > > > === 1 > > > > It does not compiles "alone". It's missing: > > > [..] > > +extern PGDLLIMPORT int huge_pages_status; > [..] > > -static int huge_pages_status = HUGE_PAGES_UNKNOWN; > > +int huge_pages_status = HUGE_PAGES_UNKNOWN; > > > > That come with 0002. So it has to be in 0001 instead. > > Ugh, good catch, I haven't thought about it in isolation, they are > separate to just ease review, but should be committed together. Fixed. > > > === 2 > > > > +else > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > +fi > > > > Maybe we should add the same test (checking for numa.h) for meson? > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > between meson and autoconf? > > > === 3 > > > > +# FIXME: filter-out / with/without with_libnuma? > > +LIBS += $(LIBNUMA_LIBS) > > > > It looks to me that we can remove those 2 lines. > > Done. > > > === 4 > > > > + Only supported on Linux. > > > > s/on Linux/on platforms for which the libnuma library is implemented/? > > > > I did a quick grep on "Only supported on" and it looks like that could be > > a more consistent wording. > > Fixed. > > > === 5 > > > > +#include "c.h" > > +#include "postgres.h" > > +#include "port/pg_numa.h" > > +#include "storage/pg_shmem.h" > > +#include <unistd.h> > > > > I had a closer look to other header files and it looks like it "should" be: > > > > #include "c.h" > > #include "postgres.h" > > > > #include <unistd.h> > > #ifdef WIN32 > > #include <windows.h> > > #endif > > > > #include "port/pg_numa.h" > > #include "storage/pg_shmem.h" > > > > And is "#include "c.h"" really needed? > > Fixed both. It seems to compile without c.h. > > > === 6 > > > > +/* FIXME not tested, might crash */ > > > > That's a bit scary. > > When you are in support for long enough it is becoming the norm ;) But > on serious note Andres wanted have numa error/warning handlers (used > by libnuma), but current code has no realistic code-path to hit it > from numa_available(3), numa_move_pages(3) or numa_max_node(3). The > situation won't change until one day in future (I hope!) we start > using some more advanced libnuma functionality for interleaving > memory, please see my earlier reply: > https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com > E.g. numa_available(3) is tiny wrapper , see > https://github.com/numactl/numactl/blob/master/libnuma.c#L872 > > For now, I've adjusted that FIXME to XXX, but still don't know we > could inject failure to see this triggered... > > > === 7 > > > > + * XXX: for now we issue just WARNING, but long-term that might depend on > > + * numa_set_strict() here > > > > s/here/here./ > > Done. > > > Did not look carefully at all the comments in 0001, 0002 and 0003 though. > > > > A few random comments regarding 0002: > > > > === 8 > > > > # create extension pg_buffercache; > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > RETURNS SETOF RECORD > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > LANGUAGE C PARALLEL SAFE" > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > While that's ok if 0003 is applied. I think that each individual patch should > > "fully" work individually. > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > patches or keep them separate? > > > === 9 > > > > + do > > + { > > + if (os_page_ptrs[blk2page + j] == 0) > > > > blk2page + j will be repeated multiple times, better to store it in a local > > variable instead? > > Done. > > > === 10 > > > > + if (firstUseInBackend == true) > > > > > > if (firstUseInBackend) instead? > > Done everywhere. > > > === 11 > > > > + int j = 0, > > + blk2page = (int) i * pages_per_blk; > > > > I wonder if size_t is more appropriate for blk2page: > > > > size_t blk2page = (size_t)(i * pages_per_blk) maybe? > > Sure, done. > > > === 12 > > > > as we know that we'll iterate until pages_per_blk then would a for loop be more > > appropriate here, something like? > > > > " > > for (size_t j = 0; j < pages_per_blk; j++) > > { > > if (os_page_ptrs[blk2page + j] == 0) > > { > > " > > Sure. > > > === 13 > > > > + if (buf_state & BM_DIRTY) > > + fctx->record[i].isdirty = true; > > + else > > + fctx->record[i].isdirty = false; > > > > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? > > > > === 14 > > > > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > + fctx->record[i].isvalid = true; > > + else > > + fctx->record[i].isvalid = false; > > > > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > maybe? > > It is coming from the original pg_buffercache and it is less readable > to me, so I don't want to touch that too much because that might open > refactoring doors too wide. > > > === 15 > > > > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) > > > > I wonder if "i" should get a more descriptive name? > > Done, buffer_id. > > > === 16 > > > > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent > > with pg_stat_io_build_tuples for example). > > > > I now realize that I did propose populate_buffercache_entry() up-thread, > > sorry for changing my mind. > > OK, I've tried to create a better name, but all my ideas were too > long, but pg_buffercache_build_tuple sounds nice, so lets use that. > > > === 17 > > > > + static bool firstUseInBackend = true; > > > > maybe we should give it a more descriptive name? > > I couldn't come up with anything that wouldn't look too long, so > instead I've added a comment explaining the meaning behind this > variable, hope that's good enough. > > > Also I need to think more about how firstUseInBackend is used, for example: > > > > ==== 17.1 > > > > would it be better to define it as a file scope variable? (at least that > > would avoid to pass it as an extra parameter in some functions). > > I have no strong opinion on this, but I have one doubt (for future): > isn't creating global variables making life harder for upcoming > multithreading guys ? > > > === 17.2 > > > > what happens if firstUseInBackend is set to false and later on the pages > > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > > called again by a backend that already had set firstUseInBackend to false, > > would that provide accurate information? > > It is still the correct result. That "touching" (paging-in) is only > necessary probably to properly resolve PTEs as the fork() does not > seem to carry them over from parent: > > postgres=# select 'create table foo' || s || ' as select > generate_series(1, 100000);' from generate_series(1, 4) s; > postgres=# \gexec > SELECT 100000 > SELECT 100000 > SELECT 100000 > SELECT 100000 > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; -- before: > numa_zone_id | count > --------------+--------- > 0 | 256 > 1 | 4131 > | 8384221 > postgres=# select pg_backend_pid(); > pg_backend_pid > ---------------- > 1451 > > -- now use another root (!) session to "migratepages(8)" from numactl > will also shift shm segment: > # migratepages 1451 1 3 > > -- while above will be in progress for a lot of time , but the outcome > is visible much faster in that backend (pg is functioning): > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; > numa_zone_id | count > --------------+--------- > 0 | 256 > 3 | 4480 > | 8383872 > > So the above clearly shows that initial touching of shm is required, > but just once and it stays valid afterwards. > > BTW: migratepages(8) was stuck for 1-2 minutes there on > "__wait_rcu_gp" according to it's wchan, without any sign of activity > on the OS and then out of blue completed just fine, s_b=64GB,HP=on. > > > === 18 > > > > +Datum > > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > > +{ > > + FuncCallContext *funcctx; > > + BufferCachePagesContext *fctx; /* User function context. */ > > + bool query_numa = true; > > > > I don't think we need query_numa anymore in pg_buffercache_numa_pages(). > > > > I think that we can just ERROR (or NOTICE and return) here: > > > > + if (pg_numa_init() == -1) > > + { > > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); > > + query_numa = false; > > > > and fully get rid of query_numa. > > Right... fixed it here, but it made the tests blow up, so I've had to > find a way to conditionally launch tests based on NUMA availability > and that's how pg_numa_available() was born. It's in ipc/shmem.c > because I couldn't find a better place for it... > > > === 19 > > > > And then here: > > > > for (i = 0; i < NBuffers; i++) > > { > > populate_buffercache_entry(i, fctx); > > if (query_numa) > > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); > > } > > > > if (query_numa) > > { > > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) > > elog(ERROR, "failed NUMA pages inquiry: %m"); > > > > for (i = 0; i < NBuffers; i++) > > { > > int blk2page = (int) i * pages_per_blk; > > > > /* > > * Technically we can get errors too here and pass that to > > * user. Also we could somehow report single DB block spanning > > * more than one NUMA zone, but it should be rare. > > */ > > fctx->record[i].numa_zone_id = os_pages_status[blk2page]; > > } > > > > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? > > Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper > that prepares **os_page_ptrs, which is then used by > pg_numa_query_pages() and then we fill the data. But now after > removing query_numa , it reads smoother anyway. Can you please take a > look again on this, is this up to the project standards? > > > A few random comments regarding 0003: > > > > === 20 > > > > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes > > > > s/pg_shmem_allocations/pg_shmem_numa_allocations/? > > Fixed. > > > === 21 > > > > + /* FIXME: should we release LWlock here ? */ > > + elog(ERROR, "failed NUMA pages inquiry status: %m"); > > > > There is no need, see src/backend/storage/lmgr/README. > > Thanks, fixed. > > > === 22 > > > > +#define MAX_NUMA_ZONES 32 /* FIXME? */ > > + Size zones[MAX_NUMA_ZONES]; > > > > could we rely on pg_numa_get_max_node() instead? > > Sure, done. > > > === 23 > > > > + if (s >= 0) > > + zones[s]++; > > > > should we also check that s is below a limit? > > STILL OPEN QUESTION: I'm not sure it would give us value to report > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > could somehow overallocate that array and remember negative ones too. > > > === 24 > > > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > > no possible holes? I mean could a system have node 0,1 and 3 but not 2? > > I have never seen a hole in numbering as it is established during > boot, and the only way that could get close to adjusting it could be > making processor books (CPU and RAM together) offline. Even *if* > someone would be doing some preventive hardware maintenance, that > still wouldn't hurt, as we are just using the Id of the zone to > display it -- the max would be already higher. I mean, technically one > could use lsmem(1) (it mentions removable flag, after let's pages and > processes migrated away and from there) and then use chcpu(1) to > --disable CPUs on that zone (to prevent new ones coming there and > allocating new local memory there) and then offlining that memory > region via chmem(1) --disable. Even with all of that, that still > shouldn't cause issues for this code I think, because > `numa_max_node()` says it `returns the >> highest node number <<< > available on the current system.` > > > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > > did a little in 0001 though) > > That was an attempt to say "Thank You", OK, I've aligned it that way, > so you are still referenced in 0001. I wouldn't find motivation to > work on this if you won't respond to those emails ;) > > There is one known issue, CI returned for numa.out test, that I need > to get bottom to (it does not reproduce for me locally) inside > numa.out/sql: > > SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > Hi, ok, so v11 should fix this last problem and make cfbot happy. -J.
Attachment
Hi, On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote: > On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for the new version! > > v10 is attached with most fixes after review and one new thing > introduced: pg_numa_available() for run-time decision inside tests > which was needed after simplifying code a little bit as you wanted. > I've also fixed -Dlibnuma=disabled as it was not properly implemented. > There are couple open items (minor/decision things), but most is fixed > or answered: Thanks for the new version! > > === 2 > > > > +else > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > +fi > > > > Maybe we should add the same test (checking for numa.h) for meson? > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > between meson and autoconf? Yeah, idea was to have both in sync. > > === 8 > > > > # create extension pg_buffercache; > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > RETURNS SETOF RECORD > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > LANGUAGE C PARALLEL SAFE" > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > While that's ok if 0003 is applied. I think that each individual patch should > > "fully" work individually. > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > patches or keep them separate? Applying 0001 + 0002 only does produce the issue above. I think that each sub-patch once applied should pass make check-world, that means: 0001 : should pass 0001 + 0002 : should pass 0001 + 0002 + 0003 : should pass > > Also I need to think more about how firstUseInBackend is used, for example: > > > > ==== 17.1 > > > > would it be better to define it as a file scope variable? (at least that > > would avoid to pass it as an extra parameter in some functions). > > I have no strong opinion on this, but I have one doubt (for future): > isn't creating global variables making life harder for upcoming > multithreading guys ? That would be "just" one more. I think it's better to use it to avoid the "current" code using "useless" function parameters. > > === 17.2 > > > > what happens if firstUseInBackend is set to false and later on the pages > > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > > called again by a backend that already had set firstUseInBackend to false, > > would that provide accurate information? > > It is still the correct result. That "touching" (paging-in) is only > necessary probably to properly resolve PTEs as the fork() does not > seem to carry them over from parent: > > So the above clearly shows that initial touching of shm is required, > but just once and it stays valid afterwards. Great, thanks for the demo and the explanation! > > === 19 > > > > Can you please take a look again on this Sure, will do. > > === 23 > > > > + if (s >= 0) > > + zones[s]++; > > > > should we also check that s is below a limit? > > STILL OPEN QUESTION: I'm not sure it would give us value to report > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > could somehow overallocate that array and remember negative ones too. I meant to say, ensure that it is below the max node number. > > === 24 > > > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > > no possible holes? I mean could a system have node 0,1 and 3 but not 2? > > I have never seen a hole in numbering as it is established during > boot, and the only way that could get close to adjusting it could be > making processor books (CPU and RAM together) offline. Yeah probably. > Even with all of that, that still > shouldn't cause issues for this code I think, because > `numa_max_node()` says it `returns the >> highest node number <<< > available on the current system.` Yeah, I agree, thanks for clearing my doubts on it. > > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > > did a little in 0001 though) > > That was an attempt to say "Thank You", Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-) I'll have a look at v11 soon. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Mar 13, 2025 at 02:15:14PM +0000, Bertrand Drouvot wrote: > > > === 19 > > > > > > > Can you please take a look again on this > > Sure, will do. > I'll have a look at v11 soon. About 0001: === 1 git am produces: .git/rebase-apply/patch:378: new blank line at EOF. + .git/rebase-apply/patch:411: new blank line at EOF. + warning: 2 lines add whitespace errors. === 2 + Returns true if a <acronym>NUMA</acronym> support is available. What about "Returns true if the server has been compiled with NUMA support"? === 3 +Datum +pg_numa_available(PG_FUNCTION_ARGS) +{ + if(pg_numa_init() == -1) + PG_RETURN_BOOL(false); + PG_RETURN_BOOL(true); +} What about PG_RETURN_BOOL(pg_numa_init() != -1)? Also I wonder if pg_numa.c would not be a better place for it. === 4 +{ oid => '5102', descr => 'Is NUMA compilation available?', + proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool', + proargtypes => '', prosrc => 'pg_numa_available' }, + Not sure that 5102 is a good choice. src/include/catalog/unused_oids is telling me: Best practice is to start with a random choice in the range 8000-9999. Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here) So maybe use 9685 instead? === 5 @@ -12477,3 +12481,4 @@ prosrc => 'gist_stratnum_common' }, ] + garbage? === 6 Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c and src/port/pg_numa.c. === 7 +Size +pg_numa_get_pagesize(void) +{ + Size os_page_size = sysconf(_SC_PAGESIZE); + if (huge_pages_status == HUGE_PAGES_ON) + GetHugePageSize(&os_page_size, NULL); + return os_page_size; +} I think that's more usual to: Size pg_numa_get_pagesize(void) { Size os_page_size = sysconf(_SC_PAGESIZE); if (huge_pages_status == HUGE_PAGES_ON) GetHugePageSize(&os_page_size, NULL); return os_page_size; } I think that makes sense to check huge_pages_status as you did. === 8 +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3); +extern void numa_error(char *where); I wonder if that would make sense to add a comment mentioning github.com/numactl/numactl/blob/master/libnuma.c here. I still need to look at 0002 and 0003. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: Hi, Thank you very much for the review! I'm answering to both reviews in one go and the results is attached v12, seems it all should be solved now: > > > === 2 > > > > > > +else > > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > > +fi > > > > > > Maybe we should add the same test (checking for numa.h) for meson? > > > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > > between meson and autoconf? > > Yeah, idea was to have both in sync. Added. > > > === 8 > > > > > > # create extension pg_buffercache; > > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > > RETURNS SETOF RECORD > > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > > LANGUAGE C PARALLEL SAFE" > > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > > > While that's ok if 0003 is applied. I think that each individual patch should > > > "fully" work individually. > > > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > > patches or keep them separate? > > Applying 0001 + 0002 only does produce the issue above. I think that each > sub-patch once applied should pass make check-world, that means: > > 0001 : should pass > 0001 + 0002 : should pass > 0001 + 0002 + 0003 : should pass OK, I've retested v11 for all three of them. It worked fine (I think in v10 I've moved one function to 0001, but pg_numa_query_pages() as per error message above was always in the 0001). > > > Also I need to think more about how firstUseInBackend is used, for example: > > > > > > ==== 17.1 > > > > > > would it be better to define it as a file scope variable? (at least that > > > would avoid to pass it as an extra parameter in some functions). > > > > I have no strong opinion on this, but I have one doubt (for future): > > isn't creating global variables making life harder for upcoming > > multithreading guys ? > > That would be "just" one more. I think it's better to use it to avoid the "current" > code using "useless" function parameters. Done. > > > === 23 > > > > > > + if (s >= 0) > > > + zones[s]++; > > > > > > should we also check that s is below a limit? > > > > STILL OPEN QUESTION: I'm not sure it would give us value to report > > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > > could somehow overallocate that array and remember negative ones too. > > I meant to say, ensure that it is below the max node number. Done, but I doubt the kernel would return a value higher than numa_max_nodes(), but who knows. Additional defense for this array is now there. > SECOND REVIEW//v11-0001 review > > === 1 > > git am produces: > > .git/rebase-apply/patch:378: new blank line at EOF. > + > .git/rebase-apply/patch:411: new blank line at EOF. > + > warning: 2 lines add whitespace errors. Should be gone, but in at least one case (0003/numa.out) we need to have empty EOF because otherwise expected tests don't pass (even if numa.sql doesnt have EOF in numa.sql) > === 2 > > + Returns true if a <acronym>NUMA</acronym> support is available. > > What about "Returns true if the server has been compiled with NUMA support"? Done. > === 3 > > +Datum > +pg_numa_available(PG_FUNCTION_ARGS) > +{ > + if(pg_numa_init() == -1) > + PG_RETURN_BOOL(false); > + PG_RETURN_BOOL(true); > +} > > What about PG_RETURN_BOOL(pg_numa_init() != -1)? > > Also I wonder if pg_numa.c would not be a better place for it. Both done. > === 4 > > +{ oid => '5102', descr => 'Is NUMA compilation available?', > + proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool', > + proargtypes => '', prosrc => 'pg_numa_available' }, > + > > Not sure that 5102 is a good choice. > > src/include/catalog/unused_oids is telling me: > > Best practice is to start with a random choice in the range 8000-9999. > Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here) > > So maybe use 9685 instead? It's because i've got 5101 there earlier (for pg_shm_numa_allocs view), but I've aligned both (5102@0001 and 5101@0003) to 968[56]. > === 5 > > @@ -12477,3 +12481,4 @@ > prosrc => 'gist_stratnum_common' }, > > ] > + > > garbage? Yea, fixed. > === 6 > > Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c > and src/port/pg_numa.c. Fixed. > === 7 > > +Size > +pg_numa_get_pagesize(void) > +{ > + Size os_page_size = sysconf(_SC_PAGESIZE); > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + return os_page_size; > +} > > I think that's more usual to: > > Size > pg_numa_get_pagesize(void) > { > Size os_page_size = sysconf(_SC_PAGESIZE); > > if (huge_pages_status == HUGE_PAGES_ON) > GetHugePageSize(&os_page_size, NULL); > > return os_page_size; > } > > I think that makes sense to check huge_pages_status as you did. Done. > === 8 > > +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3); > +extern void numa_error(char *where); > > I wonder if that would make sense to add a comment mentioning > github.com/numactl/numactl/blob/master/libnuma.c here. Sure thing, added. -J.
Attachment
Hi, On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote: > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > Thank you very much for the review! I'm answering to both reviews in > one go and the results is attached v12, seems it all should be solved > now: Thanks for v12! I'll review 0001 and 0003 later, but want to share what I've done for 0002. I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply on top of v11 0002 (I just rebased it a bit so that it now applies on top of v12 0002). The main changes are: === 1 Some doc wording changes. === 2 Some comments, pgindent and friends changes. === 3 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, - pinning_backends int4, numa_zone_id int4); + pinning_backends int4, zone_id int4); I changed numa_zone_id to zone_id as we are already in "numa" related functions and/or views. === 4 - CHECK_FOR_INTERRUPTS(); } + + CHECK_FOR_INTERRUPTS(); I think that it's better to put the CFI outside of the if condition, so that it can be called for every page. === 5 -pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx) +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 6 -get_buffercache_tuple(int i, BufferCachePagesContext *fctx) +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 7 - int os_page_count = 0; + uint64 os_page_count = 0; I think that's better. === 8 But then, here, we need to change its format to %lu and and to cast to (unsigned long) to make the CI CompilerWarnings happy. That's fine because we don't provide NUMA support for 32 bits anyway. - elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f", - os_page_count, os_page_size, pages_per_blk); + elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", + (unsigned long) os_page_count, os_page_size, pages_per_blk); === 9 -static bool firstUseInBackend = true; +static bool firstNumaTouch = true; Looks better to me but still not 100% convinced by the name. === 10 static BufferCachePagesContext * -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is a lof of examples in the code that make use of "FunctionCallInfo" for non fmgr-compatible function. and also: === 11 I don't like the fact that we iterate 2 times over NBuffers in pg_buffercache_numa_pages(). But I'm having having hard time finding a better approach given the fact that pg_numa_query_pages() needs all the pointers "prepared" before it can be called... Those 2 loops are probably the best approach, unless someone has a better idea. === 12 Upthread you asked "Can you please take a look again on this, is this up to the project standards?" Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper? What do you think? The comments, doc and code changes are just proposals and are fully open to discussion. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote: > > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > Thank you very much for the review! I'm answering to both reviews in > > one go and the results is attached v12, seems it all should be solved > > now: > > Thanks for v12! > > I'll review 0001 and 0003 later, but want to share what I've done for 0002. > > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply > on top of v11 0002 (I just rebased it a bit so that it now applies on top of > v12 0002). Hey Bertrand, all LGTM (good ideas), so here's v13 attached with applied all of that (rebased, tested). BTW: I'm sending to make cfbot as it still tried to apply that .patch (on my side it .patch, not .txt) > === 9 > > -static bool firstUseInBackend = true; > +static bool firstNumaTouch = true; > > Looks better to me but still not 100% convinced by the name. IMHO, Yes, it looks much better. > === 10 > > static BufferCachePagesContext * > -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) > +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) > > as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is > a lof of examples in the code that make use of "FunctionCallInfo" for non > fmgr-compatible function. Cool, thanks. > and also: > > === 11 > > I don't like the fact that we iterate 2 times over NBuffers in > pg_buffercache_numa_pages(). > > But I'm having having hard time finding a better approach given the fact that > pg_numa_query_pages() needs all the pointers "prepared" before it can be called... > > Those 2 loops are probably the best approach, unless someone has a better idea. IMHO, it doesn't hurt and I've also not been able to think of any better idea. > === 12 > > Upthread you asked "Can you please take a look again on this, is this up to the > project standards?" > > Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper? Yes, this was for an earlier doubt regarding question "19" about reviewing the code after removal of `query_numa` variable. This is the same code for 2 loops, IMHO it is good now. > What do you think? The comments, doc and code changes are just proposals and are > fully open to discussion. They are great, thank You! -J.
Attachment
Hi, On Mon, Mar 17, 2025 at 08:28:46AM +0100, Jakub Wartak wrote: > On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply > > on top of v11 0002 (I just rebased it a bit so that it now applies on top of > > v12 0002). > > Hey Bertrand, > > all LGTM (good ideas), so here's v13 attached with applied all of that > (rebased, tested). Thanks for v13! Looking at 0003: === 1 + <entry>NUMA mappings for shared memory allocations</entry> s/NUMA mappings/NUMA node mappings/ maybe? === 2 + <para> + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes + assigned allocations made from the server's main shared memory segment. What about? " shows how shared memory allocations in the server's main shared memory segment are distributed across NUMA nodes" ? === 3 + <structfield>numa_zone_id</structfield> <type>int4</type> s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in 0002. BTW, I wonder if "node_id" would be better (to match the descriptions...). If so, would also need to be done in 0002. === 4 + ID of NUMA node <acronym>NUMA</acronym> node ID? (to be consistent with 0002). === 5 +static bool firstUseInBackend = true; Let's use firstNumaTouch to be consistent with 0002. === 6 + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");; There is 2 ";" + I think that we should used the same wording as in pg_buffercache_numa_pages(). === 7 What about using ERROR instead? (like in pg_buffercache_numa_pages()) === 8 + /* + * This is for gathering some NUMA statistics. We might be using various + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in + * various different OS memory pages sizes, so first we need to understand + * the OS memory page size before calling move_pages() + */ + os_page_size = pg_numa_get_pagesize(); Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling pg_numa_get_pagesize()? === 9 + max_zones = pg_numa_get_max_node(); I think we are mixing "zone" and "node". I think we should standardize on one and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to vote for node, but zone is fine too if you prefer. === 10 + /* + * Preallocate memory all at once without going into details which shared + * memory segment is the biggest (technically min s_b can be as low as + * 16xBLCKSZ) + */ What about? " Allocate memory for page pointers and status based on total shared memory size. This simplified approach allocates enough space for all pages in shared memory rather than calculating the exact requirements for each segment. " instead? === 11 + int shm_total_page_count, + shm_ent_page_count, I think those 2 should be uint64. === 12 + /* + * XXX: We are ignoring in NUMA version reporting of the following regions + * (compare to pg_get_shmem_allocations() case): 1. output shared memory + * allocated but not counted via the shmem index 2. output as-of-yet + * unused shared memory + */ why XXX? what about? " We are ignoring the following memory regions (as compared to pg_get_shmem_allocations()).... === 13 + page_ptrs = palloc(sizeof(void *) * shm_total_page_count); + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count); maybe we could use palloc0() here? === 14 and I realize that we could probably use it in 0002 for os_page_ptrs. === 15 I think there is still some multi-lines comments that are missing a period. I probably also missed some in 0002 during the previous review. I think that's worth another check. === 16 + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ maybe use the same wording as in 0002? === 17 The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count but (as for 0002) it looks like we can not avoid it (or at least I don't see a way to avoid it). I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is updated. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Thanks for v13! Rebased and fixes inside in the attached v14 (it passes CI too): > Looking at 0003: > > === 1 > > + <entry>NUMA mappings for shared memory allocations</entry> > > s/NUMA mappings/NUMA node mappings/ maybe? Done. > === 2 > > + <para> > + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes > + assigned allocations made from the server's main shared memory segment. > > What about? > > " > shows how shared memory allocations in the server's main shared memory segment > are distributed across NUMA nodes" ? Done. > === 3 > > + <structfield>numa_zone_id</structfield> <type>int4</type> > > s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in > 0002. > > BTW, I wonder if "node_id" would be better (to match the descriptions...). > If so, would also need to be done in 0002. Somewhat duplicate, please see answer for #9 > === 4 > > + ID of NUMA node > > <acronym>NUMA</acronym> node ID? (to be consistent with 0002). > > === 5 > > +static bool firstUseInBackend = true; > > Let's use firstNumaTouch to be consistent with 0002. Done. > === 6 > > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");; > > There is 2 ";" + I think that we should used the same wording as in > pg_buffercache_numa_pages(). > > === 7 > > What about using ERROR instead? (like in pg_buffercache_numa_pages()) Both are synced now. > === 8 > > + /* > + * This is for gathering some NUMA statistics. We might be using various > + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in > + * various different OS memory pages sizes, so first we need to understand > + * the OS memory page size before calling move_pages() > + */ > + os_page_size = pg_numa_get_pagesize(); > > Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling > pg_numa_get_pagesize()? Done, improved style of the comment there and synced pg_buffercache one to shmem.c one. > === 9 > > + max_zones = pg_numa_get_max_node(); > > I think we are mixing "zone" and "node". I think we should standardize on one > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to > vote for node, but zone is fine too if you prefer. Given that numa(7) does not use "zone" keyword at all and both /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split into zones, we can conclude that "zone" is simply a subdivision within a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without internal information about zones), therefore we should stay away from using "zone" within the patch at all, as we are just fetching NUMA node info. My bad, it's a terminology error on my side from start - I've probably saw "zone" info in some command output back then when we had that workshop and started using it and somehow it propagated through the patchset up to this day... I've adjusted it all and settled on "numa_node_id" column name. > === 10 > > + /* > + * Preallocate memory all at once without going into details which shared > + * memory segment is the biggest (technically min s_b can be as low as > + * 16xBLCKSZ) > + */ > > What about? > > " > Allocate memory for page pointers and status based on total shared memory size. > This simplified approach allocates enough space for all pages in shared memory > rather than calculating the exact requirements for each segment. > " instead? Done. > === 11 > > + int shm_total_page_count, > + shm_ent_page_count, > > I think those 2 should be uint64. Right... > === 12 > > + /* > + * XXX: We are ignoring in NUMA version reporting of the following regions > + * (compare to pg_get_shmem_allocations() case): 1. output shared memory > + * allocated but not counted via the shmem index 2. output as-of-yet > + * unused shared memory > + */ > > why XXX? > > what about? > > " > We are ignoring the following memory regions (as compared to > pg_get_shmem_allocations()).... Fixed , it was apparently leftover of when I was thinking if we should still report it. > === 13 > > + page_ptrs = palloc(sizeof(void *) * shm_total_page_count); > + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count); > > maybe we could use palloc0() here? Of course!++ > === 14 > > and I realize that we could probably use it in 0002 for os_page_ptrs. Of course!++ > === 15 > > I think there is still some multi-lines comments that are missing a period. I > probably also missed some in 0002 during the previous review. I think that's > worth another check. Please do such a check, I've tried pgident on all .c files, but I'm apparently blind to such issues. BTW if patch has anything left that causes pgident to fix, that is not picked by CI but it is picked by buildfarm?? > === 16 > > + * In order to get reliable results we also need to touch memory > + * pages so that inquiry about NUMA zone doesn't return -2. > + */ > > maybe use the same wording as in 0002? But 0002 used: "In order to get reliable results we also need to touch memory pages, so that inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated pages)" or are you looking at something different? > === 17 > > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count > but (as for 0002) it looks like we can not avoid it (or at least I don't see > a way to avoid it). Hm, it's literally debug code. Why would we care so much if it is 2 loops rather than 1? (as stated earlier we need to pack ptrs and then analyze it) > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is > updated. Cool, thanks in advance. -J.
Attachment
Hi, On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote: > On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for v13! > > Rebased and fixes inside in the attached v14 (it passes CI too): Thanks! > > === 9 > > > > + max_zones = pg_numa_get_max_node(); > > > > I think we are mixing "zone" and "node". I think we should standardize on one > > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to > > vote for node, but zone is fine too if you prefer. > > Given that numa(7) does not use "zone" keyword at all and both > /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split > into zones, we can conclude that "zone" is simply a subdivision within > a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA, > ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without > internal information about zones), therefore we should stay away from > using "zone" within the patch at all, as we are just fetching NUMA > node info. My bad, it's a terminology error on my side from start - > I've probably saw "zone" info in some command output back then when we > had that workshop and started using it and somehow it propagated > through the patchset up to this day... Thanks for the explanation. > I've adjusted it all and settled on "numa_node_id" column name. Yeah, I can see, things like: + <para> + The definitions of the columns exposed are identical to the + <structname>pg_buffercache</structname> view, except that this one includes + one additional <structfield>numa_node_id</structfield> column as defined in + <xref linkend="pgbuffercache-numa-columns"/>. and like: + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>numa_size</structfield> <type>int4</type> + </para> I think that you re-introduced the "numa_" in the column(s) name that we get rid (or agreed to) of previously. I think that we can get rid of the "numa_" stuff in column(s) name as the column(s) are part of "numa" relation views/output anyway. I think "node_id", "size" as column(s) name should be enough. Or maybe that re-adding "numa_" was intentional? > > === 15 > > > > I think there is still some multi-lines comments that are missing a period. I > > probably also missed some in 0002 during the previous review. I think that's > > worth another check. > > Please do such a check, Found much more: + * In order to get reliable results we also need to touch memory pages, so that + * inquiry about NUMA memory node doesn't return -2 (which indicates + * unmapped/unallocated pages) is missing the period and + /* + * Switch context when allocating stuff to be used in later calls + */ should be as before, meaning on current HEAD: /* Switch context when allocating stuff to be used in later calls */ and + /* + * Return to original context when allocating transient memory + */ should be as before, meaning on current HEAD: /* Return to original context when allocating transient memory */ and + /* + * Note if the buffer is valid, and has storage created + */ should be as before, meaning on current HEAD: /* Note if the buffer is valid, and has storage created */ and + /* + * unused for v1.0 callers, but the array is always long enough + */ should be as before, meaning on current HEAD: /* unused for v1.0 callers, but the array is always long enough */ and +/* + * This is almost identical to the above, but performs + * NUMA inuqiry about memory mappings is missing the period and + * To correctly map between them, we need to: - Determine the OS + * memory page size - Calculate how many OS pages are used by all + * buffer blocks - Calculate how many OS pages are contained within + * each database block is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * We are ignoring the following memory regions (as compared to + * pg_get_shmem_allocations()): 1. output shared memory allocated but not + * counted via the shmem index 2. output as-of-yet unused shared memory is missing the period > I've tried pgident on all .c files, but I'm > apparently blind to such issues. I don't think pgident would report missing period. > BTW if patch has anything left that > causes pgident to fix, that is not picked by CI but it is picked by > buildfarm?? I think it has to be done manually before each commit and that this is anyway done at least once per release cycle. > > === 16 > > > > + * In order to get reliable results we also need to touch memory > > + * pages so that inquiry about NUMA zone doesn't return -2. > > + */ > > > > maybe use the same wording as in 0002? > > But 0002 used: > > "In order to get reliable results we also need to touch memory pages, so that > inquiry about NUMA zone doesn't return -2 (which indicates > unmapped/unallocated > pages)" > > or are you looking at something different? Nope, I meant to say that it could make sense to have the exact same comment. > > === 17 > > > > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count > > but (as for 0002) it looks like we can not avoid it (or at least I don't see > > a way to avoid it). > > Hm, it's literally debug code. Why would we care so much if it is 2 > loops rather than 1? (as stated earlier we need to pack ptrs and then > analyze it) Yeah, but if we could just loop one time I'm pretty sure we'd have done that. > > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is > > updated. > :w > Cool, thanks in advance. 0001 looks in a good shape from my point of view. For 0002: === 1 I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could deserve their own patch. That would ease the review for the "real" numa stuff. Maybe something like: 0001 as it is 0002 introduces (and uses) pg_buffercache_init_entries() and pg_buffercache_build_tuple() 0003 current 0002 attached minus 0002 above We did it that way in c2a50ac678e and ff7c40d7fd6 for example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: Hi! v15 attached, rebased, CI-tested, all fixed incorporated. > > I've adjusted it all and settled on "numa_node_id" column name. > [...] > I think that we can get rid of the "numa_" stuff in column(s) name as > the column(s) are part of "numa" relation views/output anyway. [...] Done, you are probably right (it was done to keep consistency between those two views probably), I'm just not that strongly attached to the naming things. > > Please do such a check, > > Found much more: > [.. 9 issues with missing dots at the end of sentences in comments + fixes to comment structure in relation to HEAD..] All fixed. > > BTW if patch has anything left that > > causes pgident to fix, that is not picked by CI but it is picked by > > buildfarm?? > > I think it has to be done manually before each commit and that this is anyway > done at least once per release cycle. OK, thanks for clarification. [..] > > > > But 0002 used: > > > > "In order to get reliable results we also need to touch memory pages, so that > > inquiry about NUMA zone doesn't return -2 (which indicates > > unmapped/unallocated > > pages)" > > > > or are you looking at something different? > > Nope, I meant to say that it could make sense to have the exact same comment. Synced those two. [..] > > 0001 looks in a good shape from my point of view. Cool! > For 0002: > > === 1 > > I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could > deserve their own patch. That would ease the review for the "real" numa stuff. > Done, 0001+0002 alone passes the meson test. -J.
Attachment
Hi, Thank you for working on this! On Wed, 19 Mar 2025 at 12:06, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi! v15 attached, rebased, CI-tested, all fixed incorporated. This needs to be rebased after 8eadd5c73c. -- Regards, Nazir Bilal Yavuz Microsoft
On Thu, Mar 27, 2025 at 12:31 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > Thank you for working on this! > > On Wed, 19 Mar 2025 at 12:06, Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > > > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi! v15 attached, rebased, CI-tested, all fixed incorporated. > > This needs to be rebased after 8eadd5c73c. Hi Nazir, thanks for spotting! I've not connected the dots with AIO going in and my libnuma dependency blowing up... Attached is rebased v16 that passed my CI run. -J.
Attachment
Hello I think you should remove numa_warn() and numa_error() from 0001. AFAICS they are dead code (even with all your patches applied), and furthermore would get you in trouble regarding memory allocation because src/port is not allowed to use palloc et al. If you wanted to keep them you'd have to have them in src/common, but looking at the rest of the code in that patch, ISTM src/port is the right place for it. If in the future you discover that you do need numa_warn(), you can create a src/common/ file for it then. Is pg_buffercache really the best place for these NUMA introspection routines? I'm not saying that it isn't, maybe we're okay with that (particularly if we can avoid duplicated code), but it seems a bit weird to me. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia)
Hi, On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > setup_additional_packages_script: | > - #apt-get update > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > + apt-get update > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > + libnuma1 \ > + libnuma-dev I think libnuma is installed on the relevant platforms, so you shouldn't need to install it manually. > +# > +# libnuma > +# > +AC_MSG_CHECKING([whether to build with libnuma support]) > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], Most other dependencies say "build with libxyz ..." > +/*------------------------------------------------------------------------- > + * > + * pg_numa.h > + * Basic NUMA portability routines > + * > + * > + * Copyright (c) 2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/include/port/pg_numa.h > + * > + *------------------------------------------------------------------------- > + */ > +#ifndef PG_NUMA_H > +#define PG_NUMA_H > + > +#include "c.h" > +#include "postgres.h" Headers should never include either of those headers. Nor should .c files include both. > @@ -200,6 +200,8 @@ pgxs_empty = [ > > 'ICU_LIBS', > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > + > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > ] Maybe I am missing something, but are you actually defining and using those LIBNUMA_* vars anywhere? > +Size > +pg_numa_get_pagesize(void) > +{ > + Size os_page_size = sysconf(_SC_PAGESIZE); > + > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + > + return os_page_size; > +} Should this have a comment or an assertion that it can only be used after shared memory startup? Because before that huge_pages_status won't be meaningful? > +#ifndef FRONTEND > +/* > + * XXX: not really tested as there is no way to trigger this in our > + * current usage of libnuma. > + * > + * The libnuma built-in code can be seen here: > + * https://github.com/numactl/numactl/blob/master/libnuma.c > + * > + */ > +void > +numa_warn(int num, char *fmt,...) > +{ > + va_list ap; > + int olde = errno; > + int needed; > + StringInfoData msg; > + > + initStringInfo(&msg); > + > + va_start(ap, fmt); > + needed = appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + if (needed > 0) > + { > + enlargeStringInfo(&msg, needed); > + va_start(ap, fmt); > + appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + } > + > + ereport(WARNING, > + (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION), > + errmsg_internal("libnuma: WARNING: %s", msg.data))); I think you would at least have to hold interrupts across this, as ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump out of libnuma in case an interrupt has arrived. > +Size > +pg_numa_get_pagesize(void) > +{ > +#ifndef WIN32 > + Size os_page_size = sysconf(_SC_PAGESIZE); > +#else > + Size os_page_size; > + SYSTEM_INFO sysinfo; > + > + GetSystemInfo(&sysinfo); > + os_page_size = sysinfo.dwPageSize; > +#endif > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + return os_page_size; > +} I would probably implement this once, outside of the big ifdef, with one more ifdef inside, given that you're sharing the same implementation. > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > Date: Wed, 19 Mar 2025 09:34:56 +0100 > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > primary function to separate functions. > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > and get_buffercache_tuple() that help fill result tuplestores based on the > buffercache contents. This will be used in a follow-up commit that implements > NUMA observability in pg_buffercache. > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > --- > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > 1 file changed, 178 insertions(+), 139 deletions(-) > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > index 62602af1775..86e0b8afe01 100644 > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > @@ -14,7 +14,6 @@ > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > > - > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 Independent change. > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > -Datum > -pg_buffercache_pages(PG_FUNCTION_ARGS) > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > + * memory mapping inquiries to display NUMA node information for each buffer. > + */ If it's a helper routine it's probably not identical to pg_buffercache_numa_pages()? > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * Build buffer cache information for a single buffer. > + */ > +static void > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > +{ This isn't really building a tuple tuple? Seems somewhat confusing, because get_buffercache_tuple() does actually build one. > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(record_id); > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); As above, I think this would be more readable if you put fctx->record[record_id] into a local var. > +static Datum > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > +{ > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > + HeapTuple tuple; > + > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > + nulls[0] = false; > + > + /* > + * Set all fields except the bufferid to null if the buffer is unused or > + * not valid. > + */ > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > + fctx->record[record_id].isvalid == false) > + { > + nulls[1] = true; > + nulls[2] = true; > + nulls[3] = true; > + nulls[4] = true; > + nulls[5] = true; > + nulls[6] = true; > + nulls[7] = true; > + > + /* unused for v1.0 callers, but the array is always long enough */ > + nulls[8] = true; I'd probably just memset the entire nulls array to either true or false, instead of doing it one-by-one. > + } > + else > + { > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > + nulls[1] = false; > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > + nulls[2] = false; > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > + nulls[3] = false; > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > + nulls[4] = false; > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > + nulls[5] = false; > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > + nulls[6] = false; > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > + nulls[7] = false; Seems like it would end up a lot more readable if you put fctx->record[record_id] into a local variable. Unfortunately that'd probably be best done in one more commit ahead of the rest of the this one... > @@ -0,0 +1,28 @@ > +SELECT NOT(pg_numa_available()) AS skip_test \gset > +\if :skip_test > +\quit > +\endif You could avoid the need for an alternative output file if you instead made the queries do something like SELECT NOT pg_numa_available() OR count(*) ... > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,42 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the new function. > +DROP VIEW pg_buffercache; > +DROP FUNCTION pg_buffercache_pages(); I don't think we can just drop a view in the upgrade script. That will fail if anybody created a view depending on pg_buffercache. (Sorry, ran out of time / energy here, i had originally just wanted to comment on the apt-get thing in the tests) Greetings, Andres Freund
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hello Good morning :) > I think you should remove numa_warn() and numa_error() from 0001. > AFAICS they are dead code (even with all your patches applied), and > furthermore would get you in trouble regarding memory allocation because > src/port is not allowed to use palloc et al. If you wanted to keep them > you'd have to have them in src/common, but looking at the rest of the > code in that patch, ISTM src/port is the right place for it. If in the > future you discover that you do need numa_warn(), you can create a > src/common/ file for it then. Understood, trimmed it out from the patch. I'm going to respond also within minutes to Andres' review and I'm going to post a new version (v17) there. > Is pg_buffercache really the best place for these NUMA introspection > routines? I'm not saying that it isn't, maybe we're okay with that > (particularly if we can avoid duplicated code), but it seems a bit weird > to me. I think it is, because as I understand, Andres wanted to have observability per single database *page* and to avoid code duplication we are just putting it there (it's natural fit). Imagine looking up an 8kB root btree memory page being hit hard from CPUs on other NUMA nodes (this just gives ability to see that, but you could of course also get aggregation to get e.g. NUMA node balance for single relation and so on). -J.
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, Hi Andres, > On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > > setup_additional_packages_script: | > > - #apt-get update > > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > > + apt-get update > > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > > + libnuma1 \ > > + libnuma-dev > > I think libnuma is installed on the relevant platforms, so you shouldn't need > to install it manually. Fixed. Right, you mentioned this earlier, I just didnt know when it went online. > > +# > > +# libnuma > > +# > > +AC_MSG_CHECKING([whether to build with libnuma support]) > > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], > > Most other dependencies say "build with libxyz ..." Done. > > + * pg_numa.h [..] > > +#include "c.h" > > +#include "postgres.h" > > Headers should never include either of those headers. Nor should .c files > include both. Fixed, huh, I've found explanation: https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us > > @@ -200,6 +200,8 @@ pgxs_empty = [ > > > > 'ICU_LIBS', > > > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > > + > > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > > ] > > Maybe I am missing something, but are you actually defining and using those > LIBNUMA_* vars anywhere? OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)` in configure.ac that would set those *FLAGS. I'm little bit loss dependent in how to gurantee that meson is synced with autoconf as per project requirements - trying to use past commits as reference, but I still could get something wrong here (especially in src/Makefile.global.in) > > +Size > > +pg_numa_get_pagesize(void) [..] > > Should this have a comment or an assertion that it can only be used after > shared memory startup? Because before that huge_pages_status won't be > meaningful? Added both. > > +#ifndef FRONTEND > > +/* > > + * XXX: not really tested as there is no way to trigger this in our > > + * current usage of libnuma. > > + * > > + * The libnuma built-in code can be seen here: > > + * https://github.com/numactl/numactl/blob/master/libnuma.c > > + * > > + */ > > +void > > +numa_warn(int num, char *fmt,...) [..] > > I think you would at least have to hold interrupts across this, as > ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump > out of libnuma in case an interrupt has arrived. On request by Alvaro I've removed it as that code is simply unreachable/untestable, but point taken - I'm planning to re-add this with holding interrupting in future when we start using proper numa_interleave() one day. Anyway, please let me know if you want still to keep it as deadcode. BTW for context , why this is deadcode is explained in the latter part of [1] message (TL;DR; unless we use pining/numa_interleave/local_alloc() we probably never reach that warnings/error handlers). "Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine..." > > +Size > > +pg_numa_get_pagesize(void) [..] > > I would probably implement this once, outside of the big ifdef, with one more > ifdef inside, given that you're sharing the same implementation. Done. > > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Date: Wed, 19 Mar 2025 09:34:56 +0100 > > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > > primary function to separate functions. > > > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > > and get_buffercache_tuple() that help fill result tuplestores based on the > > buffercache contents. This will be used in a follow-up commit that implements > > NUMA observability in pg_buffercache. > > > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > > --- > > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > > 1 file changed, 178 insertions(+), 139 deletions(-) > > > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > > index 62602af1775..86e0b8afe01 100644 > > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > > @@ -14,7 +14,6 @@ > > #include "storage/buf_internals.h" > > #include "storage/bufmgr.h" > > > > - > > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 > > Independent change. Fixed. > > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > > > -Datum > > -pg_buffercache_pages(PG_FUNCTION_ARGS) > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > > + * memory mapping inquiries to display NUMA node information for each buffer. > > + */ > > If it's a helper routine it's probably not identical to > pg_buffercache_numa_pages()? Of course not, fixed by removing that comment. > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * Build buffer cache information for a single buffer. > > + */ > > +static void > > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > This isn't really building a tuple tuple? Seems somewhat confusing, because > get_buffercache_tuple() does actually build one. s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless someone wants to come with better name. > > + BufferDesc *bufHdr; > > + uint32 buf_state; > > + > > + bufHdr = GetBufferDescriptor(record_id); > > + /* Lock each buffer header before inspecting. */ > > + buf_state = LockBufHdr(bufHdr); > > + > > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); > > As above, I think this would be more readable if you put > fctx->record[record_id] into a local var. Done. > > +static Datum > > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > > + HeapTuple tuple; > > + > > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > > + nulls[0] = false; > > + > > + /* > > + * Set all fields except the bufferid to null if the buffer is unused or > > + * not valid. > > + */ > > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > > + fctx->record[record_id].isvalid == false) > > + { > > + nulls[1] = true; > > + nulls[2] = true; > > + nulls[3] = true; > > + nulls[4] = true; > > + nulls[5] = true; > > + nulls[6] = true; > > + nulls[7] = true; > > + > > + /* unused for v1.0 callers, but the array is always long enough */ > > + nulls[8] = true; > > I'd probably just memset the entire nulls array to either true or false, > instead of doing it one-by-one. Done. > > + } > > + else > > + { > > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > > + nulls[1] = false; > > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > > + nulls[2] = false; > > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > > + nulls[3] = false; > > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > > + nulls[4] = false; > > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > > + nulls[5] = false; > > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > > + nulls[6] = false; > > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > > + nulls[7] = false; > > Seems like it would end up a lot more readable if you put > fctx->record[record_id] into a local variable. Unfortunately that'd probably > be best done in one more commit ahead of the rest of the this one... Done, i've put it those refactorig changes into the commit already dedicated only for a refactor. For the record Bertrand also asked for something about this, but I was somehow afraid to touch Tom's code. > > @@ -0,0 +1,28 @@ > > +SELECT NOT(pg_numa_available()) AS skip_test \gset > > +\if :skip_test > > +\quit > > +\endif > > You could avoid the need for an alternative output file if you instead made > the queries do something like > SELECT NOT pg_numa_available() OR count(*) ... ITEM REMAINING: Is this for the future or can it stay like that? I don't have a hard opinion on this, but I've already wasted lots of cycles to discover that one can have those ".1" alternative expected result files. > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,42 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the new function. > > +DROP VIEW pg_buffercache; > > +DROP FUNCTION pg_buffercache_pages(); > > I don't think we can just drop a view in the upgrade script. That will fail if > anybody created a view depending on pg_buffercache. Ugh, fixed, thanks. That must have been some leftover (we later do CREATE OR REPLACE those anyway). > (Sorry, ran out of time / energy here, i had originally just wanted to comment > on the apt-get thing in the tests) Thanks! AIO intensifies ... :) -J. [1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
Attachment
Hi, On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote: > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > > > > +Size > > > +pg_numa_get_pagesize(void) > [..] > > > > Should this have a comment or an assertion that it can only be used after > > shared memory startup? Because before that huge_pages_status won't be > > meaningful? > > Added both. Thanks for the updated version! + Assert(IsUnderPostmaster); I wonder if that would make more sense to add an assertion on huge_pages_status and HUGE_PAGES_UNKNOWN instead (more or less as it is done in CreateSharedMemoryAndSemaphores()). === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch Once applied I can see mention to pg_buffercache_numa_pages() while it only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch. I think pg_buffercache_numa_pages() should not be mentioned before it's actually implemented. === 1 + bufRecord->isvalid == false) { int i; - funcctx = SRF_FIRSTCALL_INIT(); - - /* Switch context when allocating stuff to be used in later calls */ - oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); - - /* Create a user function context for cross-call persistence */ - fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext)); + for (i = 1; i <= 9; i++) + nulls[i] = true; "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM is increased to 10). In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM). That could also make sense to remove the loop and use memset() that way: " memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool)); " instead. It's done that way in some other places (hbafuncs.c for example). === 2 - if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM) - TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", - INT4OID, -1, 0); + if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", + INT4OID, -1, 0); I think we should not change the "expected_tupledesc->natts" check here until v17-0003 is applied. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Hi, Hi Bertrand, happy to see you back, thanks for review and here's v18 attached (an ideal fit for PG18 ;)) > On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote: > > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > +Size > > > > +pg_numa_get_pagesize(void) > > [..] > > > > > > Should this have a comment or an assertion that it can only be used after > > > shared memory startup? Because before that huge_pages_status won't be > > > meaningful? > > > > Added both. > > Thanks for the updated version! > > + Assert(IsUnderPostmaster); > > I wonder if that would make more sense to add an assertion on huge_pages_status > and HUGE_PAGES_UNKNOWN instead (more or less as it is done in > CreateSharedMemoryAndSemaphores()). Ok, let's have both just in case (this status is by default initialized to _UNKNOWN, so I hope you haven't had in mind using GetConfigOption() as this would need guc.h in port?) > === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch [..] > I think pg_buffercache_numa_pages() should not be mentioned before it's actually > implemented. Right, fixed. > === 1 > [..] > "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM > is increased to 10). > > In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM). > > That could also make sense to remove the loop and use memset() that way: > > " > memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool)); > " > > instead. It's done that way in some other places (hbafuncs.c for example). Ouch, good catch. > === 2 > > - if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM) > - TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", > - INT4OID, -1, 0); > > + if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) > + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", > + INT4OID, -1, 0); > > I think we should not change the "expected_tupledesc->natts" check here until > v17-0003 is applied. Right, I've moved that into 003 where it belongs and now 002 has no single NUMA reference. I've thrown 0001+0002 alone onto CI and it passed too. -J.
Attachment
Hi Jakub, On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote: > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Hi, > > Hi Bertrand, happy to see you back, thanks for review and here's v18 > attached (an ideal fit for PG18 ;)) Thanks for the new version! === About v18-0002 It looks in a good shape to me. The helper's name might still be debatable though. I just have 2 comments: === 1 + if (bufRecord->blocknum == InvalidBlockNumber || + bufRecord->isvalid == false) It seems to me that this check could now fit in one line. === 2 + { SRF_RETURN_DONE(funcctx); + } Extra parentheses are not needed. === About v18-0003 === 3 I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain only the necessary changes when updating from 1.5. It means that it should "only" create the new objects (views and functions in our case) that come in v18-0003 and grant appropriate privs. Also it should mention: " \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit " and not: " +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit " The already existing pg_buffercache--1.N--1.(N+1).sql are good examples. === 4 + for (i = 0; i < NBuffers; i++) + { + int blk2page = (int) i * pages_per_blk; + I think that should be: int blk2page = (int) (i * pages_per_blk); === About v18-0004 === 5 When running: select c.name, c.size as num_size, s.size as shmem_size from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss where c.name = s.name; I can see: - pg_shmem_numa_allocations reporting a lot of times the same size - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size Do you observe the same? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, I've spent a bit of time reviewing this. In general I haven't found anything I'd call a bug, but here's a couple comments for v18 ... Most of this is in separate "review" commits, with a couple exceptions. 1) Please update the commit messages, with proper formatting, etc. I tried to do that in the attached v19, but please go through that, add relevant details, update list of reviewers, etc. The subject should not be overly long, etc. 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd use just "libnuma support" similar to other libraries. 3) I don't think we need pg_numa.h to have this: extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why would it be necessary here? It's enough to have a prototype in .c file. 4) Improved .sgml to have acronym/productname in a couple places. 5) I don't think the comment for pg_buffercache_init_entries() is very useful. That it's helper for pg_buffercache_pages() tells me nothing about how to use it, what the arguments are, etc. 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better comment too. I mean, there's no info about what the arguments are, which arguments are input or output, etc. And it only discussed one option (block page < memory page), but what happens in the other case? The formulas with blk2page/blk2pageoff are not quite clear to me (I'm not saying it's wrong). However, it seems rather suspicious that pages_per_blk is calculated as float, and pg_buffercache_numa_prepare_ptrs() then does this: for (size_t j = 0; j < pages_per_blk; j++) { ... } I mean, isn't this vulnerable to rounding errors, which might trigger some weird behavior? If not, it'd be good to add a comment why this is fine, it confuses me a lot. I personally would probably prefer doing just integer arithmetic here. 7) This para in the docs seems self-contradictory: <para> The <function>pg_buffercache_numa_pages()</function> provides the same information as <function>pg_buffercache_pages()</function> but is slower because it also provides the <acronym>NUMA</acronym> node ID per shared buffer entry. The <structname>pg_buffercache_numa</structname> view wraps the function for convenient use. </para> I mean, "provides the same information, but is slower because it provides different information" is strange. I understand the point, but maybe rephrase somehow? 8) Why is pg_numa_available() marked as volatile? Should not change in a running cluster, no? 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe let's not do that. <sect2 id="pgbuffercache-pg-buffercache_numa"> 10) I think it'd be good to mention/explain why move_pages is used instead of get_mempolicy - more efficient with batching, etc. This would be useful both in the commit message and before the move_pages call (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the pointers like this etc.). 11) This could use UINT64_FORMAT, instead of a cast: elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", (unsigned long) os_page_count, os_page_size, pages_per_blk); regards -- Tomas Vondra
Attachment
- v19-0001-Add-support-for-basic-NUMA-awareness.patch
- v19-0002-pgindent.patch
- v19-0003-review.patch
- v19-0004-pg_buffercache-split-pg_buffercache_pages-into-p.patch
- v19-0005-review.patch
- v19-0006-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch
- v19-0007-review.patch
- v19-0008-Add-pg_shmem_numa_allocations-to-show-NUMA-node.patch
On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi Jakub, > > On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote: > > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > Hi, > > > > Hi Bertrand, happy to see you back, thanks for review and here's v18 > > attached (an ideal fit for PG18 ;)) > > Thanks for the new version! > Hi Bertrand, I'll be attaching v20 when responding to the follow-up review by Tomas to avoid double-sending this in the list, but review findings from here are fixed in v20. > === About v18-0002 > > It looks in a good shape to me. The helper's name might still be debatable > though. > > I just have 2 comments: > > === 1 > > + if (bufRecord->blocknum == InvalidBlockNumber || > + bufRecord->isvalid == false) > > It seems to me that this check could now fit in one line. OK. > === 2 > > + { > SRF_RETURN_DONE(funcctx); > + } > > Extra parentheses are not needed. This also should be fixed in v20. > > === About v18-0003 > > === 3 > > I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain > only the necessary changes when updating from 1.5. It means that it should "only" > create the new objects (views and functions in our case) that come in v18-0003 > and grant appropriate privs. > > Also it should mention: > > " > \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit > " > and not: > > " > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > " > > The already existing pg_buffercache--1.N--1.(N+1).sql are good examples. Hm, good find, it might be leftover from first attempts where we were aiming just for adding column numa_node_id to pg_buffercache_pages() (rather than adding new view), all hopefully fixed. > === 4 > > + for (i = 0; i < NBuffers; i++) > + { > + int blk2page = (int) i * pages_per_blk; > + > > I think that should be: > > int blk2page = (int) (i * pages_per_blk); OK, but I still fail to grasp why pg_indent doesnt fix this stuff on it's own... I believe orginal ident, would fix this on it's own? > === About v18-0004 > > === 5 > > When running: > > select c.name, c.size as num_size, s.size as shmem_size > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > where c.name = s.name; > > I can see: > > - pg_shmem_numa_allocations reporting a lot of times the same size > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size > > Do you observe the same? > Yes, it is actually by design: the pg_shmem_allocations.size is sum of page sizes not size of struct, e.g. with "order by 3 desc": name | num_size | shmem_size ------------------------------------------------+-----------+------------ WaitEventCustomCounterData | 4096 | 8 Archiver Data | 4096 | 8 SerialControlData | 4096 | 16 I was even wondering if it does make sense to output "shm pointer address" (or at least offset) there to see which shm structures are on the same page, but we have the same pg_shm_allocations already. -J.
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > I've spent a bit of time reviewing this. In general I haven't found > anything I'd call a bug, but here's a couple comments for v18 ... Most > of this is in separate "review" commits, with a couple exceptions. Hi, thank you very much for help on this, yes I did not anticipate this patch to organically grow like that... I've squashed those review findings into v20 and provided answers for the "review:". > 1) Please update the commit messages, with proper formatting, etc. I > tried to do that in the attached v19, but please go through that, add > relevant details, update list of reviewers, etc. The subject should not > be overly long, etc. Fixed by you. > 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd > use just "libnuma support" similar to other libraries. Fixed by you. > 3) I don't think we need pg_numa.h to have this: > > extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); > > AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why > would it be necessary here? It's enough to have a prototype in .c file. Right, probably the result of ENOTENOUGHCOFFEE and copy/paste. > 4) Improved .sgml to have acronym/productname in a couple places. Great. > 5) I don't think the comment for pg_buffercache_init_entries() is very > useful. That it's helper for pg_buffercache_pages() tells me nothing > about how to use it, what the arguments are, etc. I've added an explanation (in 0003 though), so that this is covered. I've always assumed that 'static' functions don't need that much of that(?) > 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better > comment too. I mean, there's no info about what the arguments are, which > arguments are input or output, etc. And it only discussed one option > (block page < memory page), but what happens in the other case? The > formulas with blk2page/blk2pageoff are not quite clear to me (I'm not > saying it's wrong). > > However, it seems rather suspicious that pages_per_blk is calculated as > float, and pg_buffercache_numa_prepare_ptrs() then does this: > > for (size_t j = 0; j < pages_per_blk; j++) > { ... } > > I mean, isn't this vulnerable to rounding errors, which might trigger > some weird behavior? If not, it'd be good to add a comment why this is > fine, it confuses me a lot. I personally would probably prefer doing > just integer arithmetic here. Please bear with me: If you set client_min_messages to debug1 and then pg_buffercache_numa will dump: a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.00 b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152 pages_per_blk=0.003906 so we need to be agile to support two cases as you mention (BLCKSZ > PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are 4kB..1GB, thus we can get in that float the following sample values: BLCKSZ pagesize 2kB 4kB = 0.5 2kB 2048kb = .0009765625 2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case? 8kB 4kB = 2 8kB 2048kb = .003906250 # example from above (x86_64, 2M HP) 8kB 1024*1024kb # 1GB = .00000762939453 32kB 4kB = 8 32kB 2048kb = .0156250 32kB 1024*1024kb # 1GB = .000030517578125 So that loop: for (size_t j = 0; j < pages_per_blk; j++) is quite generic and launches in both cases. I've somehow failed to somehow come up with integer-based math and generic code for this (without special cases which seem to be no-go here?). So, that loop then will: a) launch many times to support BLCKSZ > pagesize, that is when single DB block spans multiple memory pages b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the example above) Loop touches && stores addresses into os_page_ptrs[] as input to this one big move_pages(2) query. So we basically ask for all memory pages for NBuffers. Once we get our NUMA information we then use blk2page = up_to_NBuffers * pages_per_blk to resolve memory pointers back to Buffers, if anywhere it could be a problem here. So let's say we have s_b=4TB (it wont work for sure for other reasons, let's assume we have it), let's also assume we have no huge pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824 which multiplied by 2 = INT_MAX (integer overflow bug), so I think that int is not big enough there in pg_buffercache_numa_pages() (it should be "size_t blk2page" there as in pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20) Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB => NBuffers=2097152 * 0.003906250 = 8192.0 . OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with float f = 2097152 * 0.003906250; under clang -Weverything I got "implicit conversion increases floating-point precision: 'float' to 'double'", so either it is: - we somehow rewrite all of the core arithmetics here to integer? - or simply go with doubles just to be sure? I went with doubles in v20, comments explaining are not there yet. > 7) This para in the docs seems self-contradictory: > > <para> > The <function>pg_buffercache_numa_pages()</function> provides the same > information > as <function>pg_buffercache_pages()</function> but is slower because > it also > provides the <acronym>NUMA</acronym> node ID per shared buffer entry. > The <structname>pg_buffercache_numa</structname> view wraps the > function for > convenient use. > </para> > > I mean, "provides the same information, but is slower because it > provides different information" is strange. I understand the point, but > maybe rephrase somehow? Oh my... yes, now it looks way better. > 8) Why is pg_numa_available() marked as volatile? Should not change in a > running cluster, no? No it shouldn't, good find, made it 's'table. > 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe > let's not do that. > > <sect2 id="pgbuffercache-pg-buffercache_numa"> Fixed. > 10) I think it'd be good to mention/explain why move_pages is used > instead of get_mempolicy - more efficient with batching, etc. This would > be useful both in the commit message and before the move_pages call Ok, added in 0001. > (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the > pointers like this etc.). Added reference to that earlier new comment here too in 0003. > 11) This could use UINT64_FORMAT, instead of a cast: > > elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu > pages_per_blk=%.2f", > (unsigned long) os_page_count, os_page_size, pages_per_blk); Done. 12) You have also raised "why not pg_shm_allocations_numa" instead of "pg_shm_numa_allocations" OPEN_QUESTION: To be honest, I'm not attached to any of those two (or naming things in general), I can change if you want. 13) In the patch: "review: What if we get multiple pages per buffer (the default). Could we get multiple nodes per buffer?" OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to output multiple rows per single buffer (with "page_no") then we could get this: buffer1:..:page0:numanodeID1 buffer1:..:page1:numanodeID2 buffer2:..:page0:numanodeID1 Should we add such functionality? -J.
Attachment
Hi Jakub, On Wed, Apr 02, 2025 at 04:45:53PM +0200, Jakub Wartak wrote: > On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > === 4 > > > > + for (i = 0; i < NBuffers; i++) > > + { > > + int blk2page = (int) i * pages_per_blk; > > + > > > > I think that should be: > > > > int blk2page = (int) (i * pages_per_blk); > > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on > it's own... I believe orginal ident, would fix this on it's own? My comment was not about indention but about the fact that I think that the casting is not a the right place. I think that's the result of the multiplication that we want to be casted (cast operator has higher precedence than Multiplication operator). > > > > select c.name, c.size as num_size, s.size as shmem_size > > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > > where c.name = s.name; > > > > I can see: > > > > - pg_shmem_numa_allocations reporting a lot of times the same size > > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size > > > > Do you observe the same? > > > > Yes, it is actually by design: the pg_shmem_allocations.size is sum of > page sizes not size of struct, Ok, but then does it make sense to see some num_size < shmem_size? postgres=# select c.name, c.size as num_size, s.size as shmem_size from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss where c.name = s.name and s.size > c.size; name | num_size | shmem_size ---------------+-----------+------------ XLOG Ctl | 4194304 | 4208200 Buffer Blocks | 134217728 | 134221824 AioHandleIOV | 2097152 | 2850816 (3 rows) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 4/2/25 16:46, Jakub Wartak wrote: > On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> I've spent a bit of time reviewing this. In general I haven't found >> anything I'd call a bug, but here's a couple comments for v18 ... Most >> of this is in separate "review" commits, with a couple exceptions. > > Hi, thank you very much for help on this, yes I did not anticipate > this patch to organically grow like that... > I've squashed those review findings into v20 and provided answers for > the "review:". > Thanks. >> 1) Please update the commit messages, with proper formatting, etc. I >> tried to do that in the attached v19, but please go through that, add >> relevant details, update list of reviewers, etc. The subject should not >> be overly long, etc. > > Fixed by you. > OK, so you agree the commit messages are complete / correct? >> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd >> use just "libnuma support" similar to other libraries. > > Fixed by you. > OK. FWIW if you disagree with some of my proposed changes, feel free to push back. I'm sure some may be more a matter of personal preference. >> 3) I don't think we need pg_numa.h to have this: >> >> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); >> >> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why >> would it be necessary here? It's enough to have a prototype in .c file. > > Right, probably the result of ENOTENOUGHCOFFEE and copy/paste. > >> 4) Improved .sgml to have acronym/productname in a couple places. > > Great. > >> 5) I don't think the comment for pg_buffercache_init_entries() is very >> useful. That it's helper for pg_buffercache_pages() tells me nothing >> about how to use it, what the arguments are, etc. > > I've added an explanation (in 0003 though), so that this is covered. > I've always assumed that 'static' functions don't need that much of > that(?) > I think that's mostly true - the (non-static) functions that are part of the API for a module need better / more detailed docs. But that doesn't mean static functions shouldn't have at least basic docs (unless the function is trivial / obvious, but I don't think that's the case here). If I happen to work a pg_buffercache patch in a couple months, I'll still need to understand what the function does. It won't save me that I'm working on the same file ... I'm not saying this needs a detailed docs, but "helper for X" adds very little information - I can easily see where it's called from, right? >> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better >> comment too. I mean, there's no info about what the arguments are, which >> arguments are input or output, etc. And it only discussed one option >> (block page < memory page), but what happens in the other case? The >> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not >> saying it's wrong). >> >> However, it seems rather suspicious that pages_per_blk is calculated as >> float, and pg_buffercache_numa_prepare_ptrs() then does this: >> >> for (size_t j = 0; j < pages_per_blk; j++) >> { ... } >> >> I mean, isn't this vulnerable to rounding errors, which might trigger >> some weird behavior? If not, it'd be good to add a comment why this is >> fine, it confuses me a lot. I personally would probably prefer doing >> just integer arithmetic here. > > Please bear with me: If you set client_min_messages to debug1 and then > pg_buffercache_numa will dump: > a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096 > pages_per_blk=2.00 > b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152 > pages_per_blk=0.003906 > > so we need to be agile to support two cases as you mention (BLCKSZ > > PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are > 4kB..1GB, thus we can get in that float the following sample values: > BLCKSZ pagesize > 2kB 4kB = 0.5 > 2kB 2048kb = .0009765625 > 2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case? > 8kB 4kB = 2 > 8kB 2048kb = .003906250 # example from above (x86_64, 2M HP) > 8kB 1024*1024kb # 1GB = .00000762939453 > 32kB 4kB = 8 > 32kB 2048kb = .0156250 > 32kB 1024*1024kb # 1GB = .000030517578125 > > So that loop: > for (size_t j = 0; j < pages_per_blk; j++) > is quite generic and launches in both cases. I've somehow failed to > somehow come up with integer-based math and generic code for this > (without special cases which seem to be no-go here?). So, that loop > then will: > a) launch many times to support BLCKSZ > pagesize, that is when single > DB block spans multiple memory pages > b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the > example above) > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really confusing and difficult to reason about ... I agree we don't want special cases for each possible combination of page sizes (I'm not sure we even know all the combinations). What I was thinking about is two branches, one for (block >= page) and another for (block < page). AFAICK both values have to be 2^k, so this would guarantee we have either (block/page) or (page/block) as integer. I wonder if you could even just calculate both, and have one loop that deals with both. > Loop touches && stores addresses into os_page_ptrs[] as input to this > one big move_pages(2) query. So we basically ask for all memory pages > for NBuffers. Once we get our NUMA information we then use blk2page = > up_to_NBuffers * pages_per_blk to resolve memory pointers back to > Buffers, if anywhere it could be a problem here. > IMHO this is the information I'd expect in the function comment. > So let's say we have s_b=4TB (it wont work for sure for other reasons, > let's assume we have it), let's also assume we have no huge > pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824 > which multiplied by 2 = INT_MAX (integer overflow bug), so I think > that int is not big enough there in pg_buffercache_numa_pages() (it > should be "size_t blk2page" there as in > pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20) > > Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB => > NBuffers=2097152 * 0.003906250 = 8192.0 . > > OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with > float f = 2097152 * 0.003906250; > under clang -Weverything I got "implicit conversion increases > floating-point precision: 'float' to 'double'", so either it is: > - we somehow rewrite all of the core arithmetics here to integer? > - or simply go with doubles just to be sure? I went with doubles in > v20, comments explaining are not there yet. > When I say "integer arithmetic" I don't mean it should use 32-bit ints, or any other data type. I mean that it works with non-floating point values. It could be int64, Size or whatever is large enough to not overflow. I really don't see how changing stuff to double makes this easier to understand. >> 7) This para in the docs seems self-contradictory: >> >> <para> >> The <function>pg_buffercache_numa_pages()</function> provides the same >> information >> as <function>pg_buffercache_pages()</function> but is slower because >> it also >> provides the <acronym>NUMA</acronym> node ID per shared buffer entry. >> The <structname>pg_buffercache_numa</structname> view wraps the >> function for >> convenient use. >> </para> >> >> I mean, "provides the same information, but is slower because it >> provides different information" is strange. I understand the point, but >> maybe rephrase somehow? > > Oh my... yes, now it looks way better. > >> 8) Why is pg_numa_available() marked as volatile? Should not change in a >> running cluster, no? > > No it shouldn't, good find, made it 's'table. > >> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe >> let's not do that. >> >> <sect2 id="pgbuffercache-pg-buffercache_numa"> > > Fixed. > >> 10) I think it'd be good to mention/explain why move_pages is used >> instead of get_mempolicy - more efficient with batching, etc. This would >> be useful both in the commit message and before the move_pages call > > Ok, added in 0001. > >> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the >> pointers like this etc.). > > Added reference to that earlier new comment here too in 0003. > Will take a look in the evening. >> 11) This could use UINT64_FORMAT, instead of a cast: >> >> elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu >> pages_per_blk=%.2f", >> (unsigned long) os_page_count, os_page_size, pages_per_blk); > > Done. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > "pg_shm_numa_allocations" > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > naming things in general), I can change if you want. > Me neither. I wonder if there's some precedent when adding similar variants for other catalogs ... can you check? I've been thinking about pg_stats and pg_stats_ext, but maybe there's a better example? > 13) In the patch: "review: What if we get multiple pages per buffer > (the default). Could we get multiple nodes per buffer?" > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > output multiple rows per single buffer (with "page_no") then we could > get this: > buffer1:..:page0:numanodeID1 > buffer1:..:page1:numanodeID2 > buffer2:..:page0:numanodeID1 > > Should we add such functionality? > When you say "today no" does that mean we know all pages will be on the same node, or that there may be pages from different nodes and we can't display that? That'd not be great, IMHO. I'm not a huge fan of returning multiple rows per buffer, with one row per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The rest of the fields is for the whole buffer, it'd be wrong to duplicate that for each page. I wonder if we should have a bitmap of nodes for the buffer (but then what if there are multiple pages from the same node?), or maybe just an array of nodes, with one element per page. regards -- Tomas Vondra
On Wed, Apr 2, 2025 at 5:27 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi Jakub, Hi Bertrand, > > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on > > it's own... I believe orginal ident, would fix this on it's own? > > My comment was not about indention but about the fact that I think that the > casting is not a the right place. I think that's the result of the multiplication > that we want to be casted (cast operator has higher precedence than Multiplication > operator). Oh! I've missed that, but v21 got a rewrite (still not polished) just to show it can be done without float points as Tomas requested. [..] > Ok, but then does it make sense to see some num_size < shmem_size? > > postgres=# select c.name, c.size as num_size, s.size as shmem_size > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > where c.name = s.name and s.size > c.size; > name | num_size | shmem_size > ---------------+-----------+------------ > XLOG Ctl | 4194304 | 4208200 > Buffer Blocks | 134217728 | 134221824 > AioHandleIOV | 2097152 | 2850816 This was a real bug, fixed in v21 , the ent->allocated_size was not properly page aligned. Thanks for the attention to detail. -J.
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > OK, so you agree the commit messages are complete / correct? Yes. > OK. FWIW if you disagree with some of my proposed changes, feel free to > push back. I'm sure some may be more a matter of personal preference. No, it's all fine. I will probably have lots of questions about setting proper env for development that cares itself about style, but that's for another day. > [..floats..] > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really > confusing and difficult to reason about ... > > I agree we don't want special cases for each possible combination of > page sizes (I'm not sure we even know all the combinations). What I was > thinking about is two branches, one for (block >= page) and another for > (block < page). AFAICK both values have to be 2^k, so this would > guarantee we have either (block/page) or (page/block) as integer. > > I wonder if you could even just calculate both, and have one loop that > deals with both. > [..] > When I say "integer arithmetic" I don't mean it should use 32-bit ints, > or any other data type. I mean that it works with non-floating point > values. It could be int64, Size or whatever is large enough to not > overflow. I really don't see how changing stuff to double makes this > easier to understand. I hear you, attached v21 / 0003 is free of float/double arithmetics and uses non-float point values. It should be more readable too with those comments. I have not put it into its own function, because now it fits the whole screen, so hopefully one can follow visually. Please let me know if that code solves the doubts or feel free to reformat it. That _numa_prepare_ptrs() is unused and will need to be removed, but we can still move some code there if necessary. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > > "pg_shm_numa_allocations" > > > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > > naming things in general), I can change if you want. > > > > Me neither. I wonder if there's some precedent when adding similar > variants for other catalogs ... can you check? I've been thinking about > pg_stats and pg_stats_ext, but maybe there's a better example? Hm, it seems we always go with suffix "_somethingnew": * pg_stat_database -> pg_stat_database_conflicts * pg_stat_subscription -> pg_stat_subscription_stats * even here: pg_buffercache -> pg_buffercache_numa @Bertrand: do you have anything against pg_shm_allocations_numa instead of pg_shm_numa_allocations? I don't mind changing it... > > 13) In the patch: "review: What if we get multiple pages per buffer > > (the default). Could we get multiple nodes per buffer?" > > > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > > output multiple rows per single buffer (with "page_no") then we could > > get this: > > buffer1:..:page0:numanodeID1 > > buffer1:..:page1:numanodeID2 > > buffer2:..:page0:numanodeID1 > > > > Should we add such functionality? > > When you say "today no" does that mean we know all pages will be on the > same node, or that there may be pages from different nodes and we can't > display that? That'd not be great, IMHO. > > I'm not a huge fan of returning multiple rows per buffer, with one row > per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The > rest of the fields is for the whole buffer, it'd be wrong to duplicate > that for each page. OPEN_QUESTION: With v21 we have all the information available, we are just unable to display this in pg_buffercache_numa right now. We could trim the view so that it has 3 columns (and user needs to JOIN to pg_buffercache for more details like relationoid), but then what the whole refactor (0002) was for if we would just return bufferId like below: buffer1:page0:numanodeID1 buffer1:page1:numanodeID2 buffer2:page0:numanodeID1 buffer2:page1:numanodeID1 There's also the problem that reading/joining could be inconsistent and even slower. > I wonder if we should have a bitmap of nodes for the buffer (but then > what if there are multiple pages from the same node?), or maybe just an > array of nodes, with one element per page. AFAIR this has been discussed back in end of January, and the conclusion was more or less - on Discord - that everything sucks (bitmaps, BIT() datatype, arrays,...) either from implementation or user side, but apparently arrays [] would suck the least from implementation side. So we could probably do something like up to node_max_nodes(): buffer1:..:{0, 2, 0, 0} buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes buffer3:..:{0, 0, 0, 2} Other idea is JSON or even simple string with numa_node_id<->count: buffer1:..:"1=2" buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes buffer3:..:"3=2" I find all of those non-user friendly and I'm afraid I won't be able to pull that alone in time... -J.
Attachment
Hi, On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > > > OK, so you agree the commit messages are complete / correct? > > Yes. Not 100% sure on my side. === v21-0002 Says: " This introduces three new functions: - pg_buffercache_init_entries - pg_buffercache_build_tuple - get_buffercache_tuple " While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() is). About v21-0002: === 1 I can see that the pg_buffercache_init_entries() helper comments are added in v21-0003 but I think that it would be better to add them in v21-0002 (where the helper is actually created). About v21-0003: === 2 > I hear you, attached v21 / 0003 is free of float/double arithmetics > and uses non-float point values. + if (buffers_per_page > 1) + os_page_query_count = NBuffers; + else + os_page_query_count = NBuffers * pages_per_buffer; yeah, that's more elegant. I think that it properly handles the relationships between buffer and page sizes without relying on float arithmetic. === 3 + if (buffers_per_page > 1) + { As buffers_per_page does not change, I think I'd put this check outside of the for (i = 0; i < NBuffers; i++) loop, something like: " if (buffers_per_page > 1) { /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ for (i = 0; i < NBuffers; i++) { . . . . } else { /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ for (i = 0; i < NBuffers; i++) { . . . " === 4 > That _numa_prepare_ptrs() is unused and will need to be removed, > but we can still move some code there if necessary. Yeah I think that it can be simply removed then. === 5 > @Bertrand: do you have anything against pg_shm_allocations_numa > instead of pg_shm_numa_allocations? I don't mind changing it... I think that pg_shm_allocations_numa() is better (given the examples you just shared). === 6 > I find all of those non-user friendly and I'm afraid I won't be able > to pull that alone in time... Maybe we could add a few words in the doc to mention the "multiple nodes per buffer" case? And try to improve it for say 19? Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 4/3/25 09:01, Jakub Wartak wrote: > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > >> OK, so you agree the commit messages are complete / correct? > > Yes. > >> OK. FWIW if you disagree with some of my proposed changes, feel free to >> push back. I'm sure some may be more a matter of personal preference. > > No, it's all fine. I will probably have lots of questions about > setting proper env for development that cares itself about style, but > that's for another day. > >> [..floats..] >> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really >> confusing and difficult to reason about ... >> >> I agree we don't want special cases for each possible combination of >> page sizes (I'm not sure we even know all the combinations). What I was >> thinking about is two branches, one for (block >= page) and another for >> (block < page). AFAICK both values have to be 2^k, so this would >> guarantee we have either (block/page) or (page/block) as integer. >> >> I wonder if you could even just calculate both, and have one loop that >> deals with both. >> > [..] >> When I say "integer arithmetic" I don't mean it should use 32-bit ints, >> or any other data type. I mean that it works with non-floating point >> values. It could be int64, Size or whatever is large enough to not >> overflow. I really don't see how changing stuff to double makes this >> easier to understand. > > I hear you, attached v21 / 0003 is free of float/double arithmetics > and uses non-float point values. It should be more readable too with > those comments. I have not put it into its own function, because now > it fits the whole screen, so hopefully one can follow visually. Please > let me know if that code solves the doubts or feel free to reformat > it. That _numa_prepare_ptrs() is unused and will need to be removed, > but we can still move some code there if necessary. > IMHO the code in v21 is much easier to understand. It's not quite clear to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though. >>> 12) You have also raised "why not pg_shm_allocations_numa" instead of >>> "pg_shm_numa_allocations" >>> >>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or >>> naming things in general), I can change if you want. >>> >> >> Me neither. I wonder if there's some precedent when adding similar >> variants for other catalogs ... can you check? I've been thinking about >> pg_stats and pg_stats_ext, but maybe there's a better example? > > Hm, it seems we always go with suffix "_somethingnew": > > * pg_stat_database -> pg_stat_database_conflicts > * pg_stat_subscription -> pg_stat_subscription_stats > * even here: pg_buffercache -> pg_buffercache_numa > > @Bertrand: do you have anything against pg_shm_allocations_numa > instead of pg_shm_numa_allocations? I don't mind changing it... > +1 to pg_shmem_allocations_numa >>> 13) In the patch: "review: What if we get multiple pages per buffer >>> (the default). Could we get multiple nodes per buffer?" >>> >>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to >>> output multiple rows per single buffer (with "page_no") then we could >>> get this: >>> buffer1:..:page0:numanodeID1 >>> buffer1:..:page1:numanodeID2 >>> buffer2:..:page0:numanodeID1 >>> >>> Should we add such functionality? >> >> When you say "today no" does that mean we know all pages will be on the >> same node, or that there may be pages from different nodes and we can't >> display that? That'd not be great, IMHO. >> >> I'm not a huge fan of returning multiple rows per buffer, with one row >> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The >> rest of the fields is for the whole buffer, it'd be wrong to duplicate >> that for each page. > > OPEN_QUESTION: With v21 we have all the information available, we are > just unable to display this in pg_buffercache_numa right now. We could > trim the view so that it has 3 columns (and user needs to JOIN to > pg_buffercache for more details like relationoid), but then what the > whole refactor (0002) was for if we would just return bufferId like > below: > > buffer1:page0:numanodeID1 > buffer1:page1:numanodeID2 > buffer2:page0:numanodeID1 > buffer2:page1:numanodeID1 > > There's also the problem that reading/joining could be inconsistent > and even slower. > I think a view with just 3 columns would be a good solution. It's what pg_shmem_allocations_numa already does, so it'd be consistent with that part too. I'm not too worried about the cost of the extra join - it's going to be a couple dozen milliseconds at worst, I guess, and that's negligible in the bigger scheme of things (e.g. compared to how long the move_pages is expected to take). Also, it's not like having everything in the same view is free - people would have to do some sort of post-processing, and that has a cost too. So unless someone can demonstrate a use case where this would matter, I'd not worry about it too much. >> I wonder if we should have a bitmap of nodes for the buffer (but then >> what if there are multiple pages from the same node?), or maybe just an >> array of nodes, with one element per page. > > AFAIR this has been discussed back in end of January, and the > conclusion was more or less - on Discord - that everything sucks > (bitmaps, BIT() datatype, arrays,...) either from implementation or > user side, but apparently arrays [] would suck the least from > implementation side. So we could probably do something like up to > node_max_nodes(): > buffer1:..:{0, 2, 0, 0} > buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes > buffer3:..:{0, 0, 0, 2} > > Other idea is JSON or even simple string with numa_node_id<->count: > buffer1:..:"1=2" > buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes > buffer3:..:"3=2" > > I find all of those non-user friendly and I'm afraid I won't be able > to pull that alone in time... I'm -1 on JSON, I don't see how would that solve anything better than e.g. a regular array, and it's going to be harder to work with. So if we don't want to go with the 3-column view proposed earlier, I'd stick to a simple array. I don't think there's a huge difference between those two approaches, it should be easy to convert between those approaches using unnest() and array_agg(). Attached is v22, with some minor review comments: 1) I suggested we should just use "libnuma support" in configure, instead of talking about "NUMA awareness support", and AFAICS you agreed. But I still see the old text in configure ... is that intentional or a bit of forgotten text? 2) I added a couple asserts to pg_buffercache_numa_pages() and comments, and simplified a couple lines (but that's a matter of preference). 3) I don't think it's correct for pg_get_shmem_numa_allocations to just silently ignore nodes outside the valid range. I suggest we simply do elog(ERROR), as it's an internal error we don't expect to happen. regards -- Tomas Vondra
Attachment
On 4/3/25 10:23, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: >> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi Tomas, >> >>> OK, so you agree the commit messages are complete / correct? >> >> Yes. > > Not 100% sure on my side. > > === v21-0002 > > Says: > > " > This introduces three new functions: > > - pg_buffercache_init_entries > - pg_buffercache_build_tuple > - get_buffercache_tuple > " > > While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() > is). > Ah, OK. Jakub, can you correct (and double-check) this in the next version of the patch? > About v21-0002: > > === 1 > > I can see that the pg_buffercache_init_entries() helper comments are added in > v21-0003 but I think that it would be better to add them in v21-0002 > (where the helper is actually created). > +1 to that > About v21-0003: > > === 2 > >> I hear you, attached v21 / 0003 is free of float/double arithmetics >> and uses non-float point values. > > + if (buffers_per_page > 1) > + os_page_query_count = NBuffers; > + else > + os_page_query_count = NBuffers * pages_per_buffer; > > yeah, that's more elegant. I think that it properly handles the relationships > between buffer and page sizes without relying on float arithmetic. > In the review I just submitted, I changed this to os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer); but maybe it's less clear. Feel free to undo my change. > === 3 > > + if (buffers_per_page > 1) > + { > > As buffers_per_page does not change, I think I'd put this check outside of the > for (i = 0; i < NBuffers; i++) loop, something like: > > " > if (buffers_per_page > 1) { > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > for (i = 0; i < NBuffers; i++) { > . > . > . > . > } else { > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > for (i = 0; i < NBuffers; i++) { > . > . I don't know. It's a matter of opinion, but I find the current code fairly understandable. Maybe if it meaningfully reduced the code nesting, but even with the extra branch we'd still need the for loop. I'm not against doing this differently, but I'd have to see how that looks. Until then I think it's fine to have the code as is. > . > " > > === 4 > >> That _numa_prepare_ptrs() is unused and will need to be removed, >> but we can still move some code there if necessary. > > Yeah I think that it can be simply removed then. > I'm not particularly attached on having the _ptrs() function, but why couldn't it build the os_page_ptrs array as before? > === 5 > >> @Bertrand: do you have anything against pg_shm_allocations_numa >> instead of pg_shm_numa_allocations? I don't mind changing it... > > I think that pg_shm_allocations_numa() is better (given the examples you just > shared). > > === 6 > >> I find all of those non-user friendly and I'm afraid I won't be able >> to pull that alone in time... > > Maybe we could add a few words in the doc to mention the "multiple nodes per > buffer" case? And try to improve it for say 19? Also maybe we should just focus > till v21-0003 (and discard v21-0004 for 18). > IMHO it's not enough to paper this over by mentioning it in the docs. I'm a bit confused about which patch you suggest to leave out. I think 0004 is pg_shmem_allocations_numa, but I think that part is fine, no? We've been discussing how to represent nodes for buffers in 0003. I understand it'd require a bit of coding, but AFAICS adding an array would be fairly trivial amount of code. Something like values[i++] = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID)); would do, I think. But if we decide to do the 3-column view, that'd be even simpler, I think. AFAICS it means we could mostly ditch the pg_buffercache refactoring in patch 0002, and 0003 would get much simpler too I think. If Jakub doesn't have time to work on this, I can take a stab at it later today. regards -- Tomas Vondra
On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, Hi Bertrand, > On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: [..] > === v21-0002 > While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() > is). Fixed > About v21-0002: > > === 1 > > I can see that the pg_buffercache_init_entries() helper comments are added in > v21-0003 but I think that it would be better to add them in v21-0002 > (where the helper is actually created). Moved > About v21-0003: > > === 2 > > > I hear you, attached v21 / 0003 is free of float/double arithmetics > > and uses non-float point values. > > + if (buffers_per_page > 1) > + os_page_query_count = NBuffers; > + else > + os_page_query_count = NBuffers * pages_per_buffer; > > yeah, that's more elegant. I think that it properly handles the relationships > between buffer and page sizes without relying on float arithmetic. Cool > === 3 > > + if (buffers_per_page > 1) > + { > > As buffers_per_page does not change, I think I'd put this check outside of the > for (i = 0; i < NBuffers; i++) loop, something like: > > " > if (buffers_per_page > 1) { > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > for (i = 0; i < NBuffers; i++) { > . > . > . > . > } else { > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > for (i = 0; i < NBuffers; i++) { > . > . > . > " Done. > === 4 > > > That _numa_prepare_ptrs() is unused and will need to be removed, > > but we can still move some code there if necessary. > > Yeah I think that it can be simply removed then. Removed. > === 5 > > > @Bertrand: do you have anything against pg_shm_allocations_numa > > instead of pg_shm_numa_allocations? I don't mind changing it... > > I think that pg_shm_allocations_numa() is better (given the examples you just > shared). OK, let's go with this name then (in v22). > === 6 > > > I find all of those non-user friendly and I'm afraid I won't be able > > to pull that alone in time... > > Maybe we could add a few words in the doc to mention the "multiple nodes per > buffer" case? And try to improve it for say 19? Right, we could also put it as a limitation. I would be happy to leave it as it must be a rare condition, but Tomas stated he's not. > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). Do you mean discard pg_buffercache_numa (0002+0003) and instead go with pg_shm_allocations_numa (0004) ? BTW: I've noticed that Tomas responded with his v22 to this after I've solved all of the above in mine v22, so I'll drop v23 soon and then let's continue there. -J.
Hi Jakub, On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote: > On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > Right, we could also put it as a limitation. I would be happy to leave > it as it must be a rare condition, but Tomas stated he's not. > > > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). > > Do you mean discard pg_buffercache_numa (0002+0003) and instead go > with pg_shm_allocations_numa (0004) ? No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is confident enough to also focus in addition to 0004, that's fine too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra <tomas@vondra.me> wrote: > Ah, OK. Jakub, can you correct (and double-check) this in the next > version of the patch? Done. > > About v21-0002: > > > > === 1 > > > > I can see that the pg_buffercache_init_entries() helper comments are added in > > v21-0003 but I think that it would be better to add them in v21-0002 > > (where the helper is actually created). > > > > +1 to that Done. > > About v21-0003: > > > > === 2 > > > >> I hear you, attached v21 / 0003 is free of float/double arithmetics > >> and uses non-float point values. > > > > + if (buffers_per_page > 1) > > + os_page_query_count = NBuffers; > > + else > > + os_page_query_count = NBuffers * pages_per_buffer; > > > > yeah, that's more elegant. I think that it properly handles the relationships > > between buffer and page sizes without relying on float arithmetic. > > > > In the review I just submitted, I changed this to > > os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer); > > but maybe it's less clear. Feel free to undo my change. Cool, thanks, will send shortly v23 with this applied. > > > === 3 > > > > + if (buffers_per_page > 1) > > + { > > > > As buffers_per_page does not change, I think I'd put this check outside of the > > for (i = 0; i < NBuffers; i++) loop, something like: > > > > " > > if (buffers_per_page > 1) { > > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > > for (i = 0; i < NBuffers; i++) { > > . > > . > > . > > . > > } else { > > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > > for (i = 0; i < NBuffers; i++) { > > . > > . > > I don't know. It's a matter of opinion, but I find the current code > fairly understandable. Maybe if it meaningfully reduced the code > nesting, but even with the extra branch we'd still need the for loop. > I'm not against doing this differently, but I'd have to see how that > looks. Until then I think it's fine to have the code as is. v23 will have incorporated Bertrand's idea soon. No hard feelings, but it's kind of painful to switch like that ;) > > === 4 > > > >> That _numa_prepare_ptrs() is unused and will need to be removed, > >> but we can still move some code there if necessary. > > > > Yeah I think that it can be simply removed then. > > > > I'm not particularly attached on having the _ptrs() function, but why > couldn't it build the os_page_ptrs array as before? I've removed it in v23, the code for me just didn't have flow... > > === 5 > > > >> @Bertrand: do you have anything against pg_shm_allocations_numa > >> instead of pg_shm_numa_allocations? I don't mind changing it... > > > > I think that pg_shm_allocations_numa() is better (given the examples you just > > shared). Done. > > === 6 > > > >> I find all of those non-user friendly and I'm afraid I won't be able > >> to pull that alone in time... > > > > Maybe we could add a few words in the doc to mention the "multiple nodes per > > buffer" case? And try to improve it for say 19? Also maybe we should just focus > > till v21-0003 (and discard v21-0004 for 18). > > > > IMHO it's not enough to paper this over by mentioning it in the docs. OK > I'm a bit confused about which patch you suggest to leave out. I think > 0004 is pg_shmem_allocations_numa, but I think that part is fine, no? > We've been discussing how to represent nodes for buffers in 0003. IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but we settled on arrays right? > I understand it'd require a bit of coding, but AFAICS adding an array > would be fairly trivial amount of code. Something like > > values[i++] > = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID)); > > would do, I think. But if we decide to do the 3-column view, that'd be > even simpler, I think. AFAICS it means we could mostly ditch the > pg_buffercache refactoring in patch 0002, and 0003 would get much > simpler too I think. > > If Jakub doesn't have time to work on this, I can take a stab at it > later today. I won't be able to even start on this today, so if you have cycles for please do so... -J.
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > On 4/3/25 09:01, Jakub Wartak wrote: > > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, Here's v23 attached (had to rework it because the you sent v22 just the moment you I wanted to send it) Change include: - your review should be incorporated - name change for shm view - pg_buffercache_numa refactor a little to keep out the if outside loops (as Bertrand wanted initially) So let's continue in this subthread. > I think a view with just 3 columns would be a good solution. It's what > pg_shmem_allocations_numa already does, so it'd be consistent with that > part too. > > I'm not too worried about the cost of the extra join - it's going to be > a couple dozen milliseconds at worst, I guess, and that's negligible in > the bigger scheme of things (e.g. compared to how long the move_pages is > expected to take). Also, it's not like having everything in the same > view is free - people would have to do some sort of post-processing, and > that has a cost too. > > So unless someone can demonstrate a use case where this would matter, > I'd not worry about it too much. OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, it's just that I don't have cycles left today and probably lack skills (i've never dealt with arrays so far) thus it would be slow to get it right... but I can pick up anything tomorrow morning. > I'm -1 on JSON, I don't see how would that solve anything better than > e.g. a regular array, and it's going to be harder to work with. So if we > don't want to go with the 3-column view proposed earlier, I'd stick to a > simple array. I don't think there's a huge difference between those two > approaches, it should be easy to convert between those approaches using > unnest() and array_agg(). > > Attached is v22, with some minor review comments: > > 1) I suggested we should just use "libnuma support" in configure, > instead of talking about "NUMA awareness support", and AFAICS you > agreed. But I still see the old text in configure ... is that > intentional or a bit of forgotten text? It was my bad, too many rebases and mishaps with git voodoo.. > 2) I added a couple asserts to pg_buffercache_numa_pages() and comments, > and simplified a couple lines (but that's a matter of preference). Great, thanks. > 3) I don't think it's correct for pg_get_shmem_numa_allocations to just > silently ignore nodes outside the valid range. I suggest we simply do > elog(ERROR), as it's an internal error we don't expect to happen. It's there too. -J.
Attachment
On 4/3/25 15:12, Jakub Wartak wrote: > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > >> ... >> >> So unless someone can demonstrate a use case where this would matter, >> I'd not worry about it too much. > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > it's just that I don't have cycles left today and probably lack skills > (i've never dealt with arrays so far) thus it would be slow to get it > right... but I can pick up anything tomorrow morning. > OK, I took a stab at reworking/simplifying this the way I proposed. Here's v24 - needs more polishing, but hopefully enough to show what I had in mind. It does these changes: 1) Drops 0002 with the pg_buffercache refactoring, because the new view is not "extending" the existing one. 2) Reworks pg_buffercache_num to return just three columns, bufferid, page_num and node_id. page_num is a sequence starting from 0 for each buffer. 3) It now builds an array of records, with one record per buffer/page. 4) I realized we don't really need to worry about buffers_per_page very much, except for logging/debugging. There's always "at least one page" per buffer, even if an incomplete one, so we can do this: os_page_count = NBuffers * Max(1, pages_per_buffer); and then for (i = 0; i < NBuffers; i++) { for (j = 0; j < Max(1, pages_per_buffer); j++) { .. } } and everything just works fine, I think. Opinions? I personally find this much cleaner / easier to understand. regards -- Tomas Vondra
Attachment
Hi, On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: > On 4/3/25 15:12, Jakub Wartak wrote: > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > > >> ... > >> > >> So unless someone can demonstrate a use case where this would matter, > >> I'd not worry about it too much. > > > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > > it's just that I don't have cycles left today and probably lack skills > > (i've never dealt with arrays so far) thus it would be slow to get it > > right... but I can pick up anything tomorrow morning. > > > > OK, I took a stab at reworking/simplifying this the way I proposed. > Here's v24 - needs more polishing, but hopefully enough to show what I > had in mind. > > It does these changes: > > 1) Drops 0002 with the pg_buffercache refactoring, because the new view > is not "extending" the existing one. I think that makes sense. One would just need to join on the pg_buffercache view to get more information about the buffer if needed. The pg_buffercache_numa_pages() doc needs an update though as I don't think that "+ The <function>pg_buffercache_numa_pages()</function> provides the same information as <function>pg_buffercache_pages()</function>" is still true. > 2) Reworks pg_buffercache_num to return just three columns, bufferid, > page_num and node_id. page_num is a sequence starting from 0 for each > buffer. +1 on the idea > 3) It now builds an array of records, with one record per buffer/page. > > 4) I realized we don't really need to worry about buffers_per_page very > much, except for logging/debugging. There's always "at least one page" > per buffer, even if an incomplete one, so we can do this: > > os_page_count = NBuffers * Max(1, pages_per_buffer); > > and then > > for (i = 0; i < NBuffers; i++) > { > for (j = 0; j < Max(1, pages_per_buffer); j++) That's a nice simplification as we always need to take care of at least one page per buffer. > and everything just works fine, I think. I think the same. > Opinions? I personally find this much cleaner / easier to understand. I agree that's easier to understand and that that looks correct. A few random comments: === 1 It looks like that firstNumaTouch is not set to false anymore. === 2 + pfree(os_page_status); + pfree(os_page_ptrs); Not sure that's needed, we should be in a short-lived memory context here (ExprContext or such). === 3 + ro_volatile_var = *(uint64 *)ptr space missing before "ptr"? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: > > On 4/3/25 15:12, Jakub Wartak wrote: > > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > > > > >> ... > > >> > > >> So unless someone can demonstrate a use case where this would matter, > > >> I'd not worry about it too much. > > > > > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > > > it's just that I don't have cycles left today and probably lack skills > > > (i've never dealt with arrays so far) thus it would be slow to get it > > > right... but I can pick up anything tomorrow morning. > > > > > > > OK, I took a stab at reworking/simplifying this the way I proposed. > > Here's v24 - needs more polishing, but hopefully enough to show what I > > had in mind. > > > > It does these changes: > > > > 1) Drops 0002 with the pg_buffercache refactoring, because the new view > > is not "extending" the existing one. > > I think that makes sense. One would just need to join on the pg_buffercache > view to get more information about the buffer if needed. > > The pg_buffercache_numa_pages() doc needs an update though as I don't think that > "+ The <function>pg_buffercache_numa_pages()</function> provides the same > information as <function>pg_buffercache_pages()</function>" is still true. > > > 2) Reworks pg_buffercache_num to return just three columns, bufferid, > > page_num and node_id. page_num is a sequence starting from 0 for each > > buffer. > > +1 on the idea > > > 3) It now builds an array of records, with one record per buffer/page. > > > > 4) I realized we don't really need to worry about buffers_per_page very > > much, except for logging/debugging. There's always "at least one page" > > per buffer, even if an incomplete one, so we can do this: > > > > os_page_count = NBuffers * Max(1, pages_per_buffer); > > > > and then > > > > for (i = 0; i < NBuffers; i++) > > { > > for (j = 0; j < Max(1, pages_per_buffer); j++) > > That's a nice simplification as we always need to take care of at least one page > per buffer. > > > and everything just works fine, I think. > > I think the same. > > > Opinions? I personally find this much cleaner / easier to understand. > > I agree that's easier to understand and that that looks correct. > > A few random comments: > > === 1 > > It looks like that firstNumaTouch is not set to false anymore. > > === 2 > > + pfree(os_page_status); > + pfree(os_page_ptrs); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). > > === 3 > > + ro_volatile_var = *(uint64 *)ptr > > space missing before "ptr"? > +my feedback as I've noticed that Bertrand already provided a review. Right, the code is now simple , and that Max() is brilliant. I've attached some review findings as .txt 0001 100%LGTM 0002 doc fix + pgident + Tomas, you should take Authored-by yourself there for sure, I couldn't pull this off alone in time! So big thanks! 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have NUMA on ming32...:)) When started with interleave=all on serious hardware, I'm getting (~5s for s_b=64GB) from pg_buffercache_numa node_id | count ---------+--------- 3 | 2097152 0 | 2097152 2 | 2097152 1 | 2097152 so this is valid result (2097152*4 numa nodes*8192 buffer size/1024/1024/1024 = 64GB) Also with pgbench -i -s 20 , after ~8s: select c.relname, n.node_id, count(*) from pg_buffercache_numa n join pg_buffercache b on (b.bufferid = n.bufferid) join pg_class c on (c.relfilenode = b.relfilenode) group by c.relname, n.node_id order by count(*) desc; relname | node_id | count -----------------------------------------------+---------+------- pgbench_accounts | 2 | 8217 pgbench_accounts | 0 | 8190 pgbench_accounts | 3 | 8189 pgbench_accounts | 1 | 8187 pg_statistic | 2 | 32 pg_operator | 2 | 14 pg_depend | 3 | 14 [..] pg_shm_allocations_numa also looks good. I think v24+tiny fixes is good enough to go in. -J.
Attachment
On 4/4/25 09:35, Jakub Wartak wrote: > On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: >>> On 4/3/25 15:12, Jakub Wartak wrote: >>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: >>>> >>>>> ... >>>>> >>>>> So unless someone can demonstrate a use case where this would matter, >>>>> I'd not worry about it too much. >>>> >>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, >>>> it's just that I don't have cycles left today and probably lack skills >>>> (i've never dealt with arrays so far) thus it would be slow to get it >>>> right... but I can pick up anything tomorrow morning. >>>> >>> >>> OK, I took a stab at reworking/simplifying this the way I proposed. >>> Here's v24 - needs more polishing, but hopefully enough to show what I >>> had in mind. >>> >>> It does these changes: >>> >>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view >>> is not "extending" the existing one. >> >> I think that makes sense. One would just need to join on the pg_buffercache >> view to get more information about the buffer if needed. >> >> The pg_buffercache_numa_pages() doc needs an update though as I don't think that >> "+ The <function>pg_buffercache_numa_pages()</function> provides the same >> information as <function>pg_buffercache_pages()</function>" is still true. >> >>> 2) Reworks pg_buffercache_num to return just three columns, bufferid, >>> page_num and node_id. page_num is a sequence starting from 0 for each >>> buffer. >> >> +1 on the idea >> >>> 3) It now builds an array of records, with one record per buffer/page. >>> >>> 4) I realized we don't really need to worry about buffers_per_page very >>> much, except for logging/debugging. There's always "at least one page" >>> per buffer, even if an incomplete one, so we can do this: >>> >>> os_page_count = NBuffers * Max(1, pages_per_buffer); >>> >>> and then >>> >>> for (i = 0; i < NBuffers; i++) >>> { >>> for (j = 0; j < Max(1, pages_per_buffer); j++) >> >> That's a nice simplification as we always need to take care of at least one page >> per buffer. >> >>> and everything just works fine, I think. >> >> I think the same. >> >>> Opinions? I personally find this much cleaner / easier to understand. >> >> I agree that's easier to understand and that that looks correct. >> >> A few random comments: >> >> === 1 >> >> It looks like that firstNumaTouch is not set to false anymore. >> >> === 2 >> >> + pfree(os_page_status); >> + pfree(os_page_ptrs); >> >> Not sure that's needed, we should be in a short-lived memory context here >> (ExprContext or such). >> >> === 3 >> >> + ro_volatile_var = *(uint64 *)ptr >> >> space missing before "ptr"? >> > > +my feedback as I've noticed that Bertrand already provided a review. > > Right, the code is now simple , and that Max() is brilliant. I've > attached some review findings as .txt > > 0001 100%LGTM > 0002 doc fix + pgident + Tomas, you should take Authored-by yourself > there for sure, I couldn't pull this off alone in time! So big thanks! > 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have > NUMA on ming32...:)) > OK > When started with interleave=all on serious hardware, I'm getting (~5s > for s_b=64GB) from pg_buffercache_numa > > node_id | count > ---------+--------- > 3 | 2097152 > 0 | 2097152 > 2 | 2097152 > 1 | 2097152 > > so this is valid result (2097152*4 numa nodes*8192 buffer > size/1024/1024/1024 = 64GB) > > Also with pgbench -i -s 20 , after ~8s: > select c.relname, n.node_id, count(*) from pg_buffercache_numa n > join pg_buffercache b on (b.bufferid = n.bufferid) > join pg_class c on (c.relfilenode = b.relfilenode) > group by c.relname, n.node_id order by count(*) desc; > relname | node_id | count > -----------------------------------------------+---------+------- > pgbench_accounts | 2 | 8217 > pgbench_accounts | 0 | 8190 > pgbench_accounts | 3 | 8189 > pgbench_accounts | 1 | 8187 > pg_statistic | 2 | 32 > pg_operator | 2 | 14 > pg_depend | 3 | 14 > [..] > > pg_shm_allocations_numa also looks good. > > I think v24+tiny fixes is good enough to go in. > OK. Do you have any suggestions regarding the column names in the new view? I'm not sure I like node_id and page_num. regards -- Tomas Vondra
On 4/4/25 08:50, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: >> On 4/3/25 15:12, Jakub Wartak wrote: >>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: >>> >>>> ... >>>> >>>> So unless someone can demonstrate a use case where this would matter, >>>> I'd not worry about it too much. >>> >>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, >>> it's just that I don't have cycles left today and probably lack skills >>> (i've never dealt with arrays so far) thus it would be slow to get it >>> right... but I can pick up anything tomorrow morning. >>> >> >> OK, I took a stab at reworking/simplifying this the way I proposed. >> Here's v24 - needs more polishing, but hopefully enough to show what I >> had in mind. >> >> It does these changes: >> >> 1) Drops 0002 with the pg_buffercache refactoring, because the new view >> is not "extending" the existing one. > > I think that makes sense. One would just need to join on the pg_buffercache > view to get more information about the buffer if needed. > > The pg_buffercache_numa_pages() doc needs an update though as I don't think that > "+ The <function>pg_buffercache_numa_pages()</function> provides the same > information as <function>pg_buffercache_pages()</function>" is still true. > Right, thanks for checking the docs. >> 2) Reworks pg_buffercache_num to return just three columns, bufferid, >> page_num and node_id. page_num is a sequence starting from 0 for each >> buffer. > > +1 on the idea > >> 3) It now builds an array of records, with one record per buffer/page. >> >> 4) I realized we don't really need to worry about buffers_per_page very >> much, except for logging/debugging. There's always "at least one page" >> per buffer, even if an incomplete one, so we can do this: >> >> os_page_count = NBuffers * Max(1, pages_per_buffer); >> >> and then >> >> for (i = 0; i < NBuffers; i++) >> { >> for (j = 0; j < Max(1, pages_per_buffer); j++) > > That's a nice simplification as we always need to take care of at least one page > per buffer. > OK. I think I'll consider moving some of this code "building" the entries into a separate function, to keep the main function easier to understand. >> and everything just works fine, I think. > > I think the same. > >> Opinions? I personally find this much cleaner / easier to understand. > > I agree that's easier to understand and that that looks correct. > > A few random comments: > > === 1 > > It looks like that firstNumaTouch is not set to false anymore. > Damn, my mistake. > === 2 > > + pfree(os_page_status); > + pfree(os_page_ptrs); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). > Yeah, maybe. It's not allocated in the multi-call context, but I wasn't sure. Will check. > === 3 > > + ro_volatile_var = *(uint64 *)ptr > > space missing before "ptr"? > Interesting the pgindent didn't tweak this. regards -- Tomas Vondra
On Fri, Apr 4, 2025 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > Do you have any suggestions regarding the column names in the new view? > I'm not sure I like node_id and page_num. They actually look good to me. We've discussed earlier dropping s/numa_//g for column names (after all views contain it already) so they are fine in this regard. There's also the question of consistency: (bufferid, page_num, node_id) -- maybe should just drop "_" and that's it? Well I would even possibly consider page_num -> ospagenumber, but that's ugly. -J.
OK, here's v25 after going through the patches once more, fixing the issues mentioned by Bertrand, etc. I think 0001 and 0002 are fine, I have a couple minor questions about 0003. 0002 ---- - Adds the new types to typedefs.list, to make pgindent happy. - Improves comment for pg_buffercache_numa_pages - Minor formatting tweaks. - I was wondering if maybe we should have some "global ID" of memory page, so that with large memory pages it's indicated the buffers are on the same memory page. Right now each buffer starts page_num from 0, but it should not be very hard to have a global counter. Opinions? 0003 ---- - Minor formatting tweaks, comment improvements. - Isn't this comment a bit confusing / misleading? /* Get number of OS aligned pages */ AFAICS the point is to adjust the allocated_size to be a multiple of os-page_size, to get "all" memory pages the segment uses. But that's not what I understand by "aligned page" (which is about there the page is expected to start). Or did I get this wrong? - There's a comment at the end which talks about "ignored segments". IMHO that type of information should be in the function comment, but I'm also not quite sure I understand what "output shared memory" is ... regards -- Tomas Vondra
Attachment
Hi, On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote: > OK, > > here's v25 after going through the patches once more, fixing the issues > mentioned by Bertrand, etc. Thanks! > I think 0001 and 0002 are fine, Agree, I just have some cosmetic nits comments: please find them in nit-bertrand-0002.txt attached. > I have a > couple minor questions about 0003. > > - I was wondering if maybe we should have some "global ID" of memory > page, so that with large memory pages it's indicated the buffers are on > the same memory page. Right now each buffer starts page_num from 0, but > it should not be very hard to have a global counter. Opinions? I think that's a good idea. We could then add a new column (say os_page_id) that would help identify which buffers are sharing the same "physical" page. > 0003 > ---- > - Minor formatting tweaks, comment improvements. > - Isn't this comment a bit confusing / misleading? > > /* Get number of OS aligned pages */ > > AFAICS the point is to adjust the allocated_size to be a multiple of > os-page_size, to get "all" memory pages the segment uses. But that's not > what I understand by "aligned page" (which is about there the page is > expected to start). Agree, what about? " Align the start of the allocated size to an OS page size boundary and then get the total number of OS pages used by this segment" " > - There's a comment at the end which talks about "ignored segments". > IMHO that type of information should be in the function comment, > but I'm > also not quite sure I understand what "output shared memory" is ... I think that comes from the comments that are already in pg_get_shmem_allocations(). I think that those are located here and worded that way to ease to understand what is not in with pg_get_shmem_allocations_numa() if one look at both functions. That said, I'm +1 to put this kind of comments in the function comment. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 4/5/25 11:37, Bertrand Drouvot wrote: > Hi, > > On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote: >> OK, >> >> here's v25 after going through the patches once more, fixing the issues >> mentioned by Bertrand, etc. > > Thanks! > >> I think 0001 and 0002 are fine, > > Agree, I just have some cosmetic nits comments: please find them in > nit-bertrand-0002.txt attached. > >> I have a >> couple minor questions about 0003. >> >> - I was wondering if maybe we should have some "global ID" of memory >> page, so that with large memory pages it's indicated the buffers are on >> the same memory page. Right now each buffer starts page_num from 0, but >> it should not be very hard to have a global counter. Opinions? > > I think that's a good idea. We could then add a new column (say os_page_id) that > would help identify which buffers are sharing the same "physical" page. > I was thinking we'd change the definition of the existing page_num column, i.e. it wouldn't be 0..N sequence for each buffer, but a global page ID. But I don't know if this would be useful in practice. >> 0003 >> ---- >> - Minor formatting tweaks, comment improvements. >> - Isn't this comment a bit confusing / misleading? >> >> /* Get number of OS aligned pages */ >> >> AFAICS the point is to adjust the allocated_size to be a multiple of >> os-page_size, to get "all" memory pages the segment uses. But that's not >> what I understand by "aligned page" (which is about there the page is >> expected to start). > > Agree, what about? > > " > Align the start of the allocated size to an OS page size boundary and then get > the total number of OS pages used by this segment" > " > Something like that. But I think it should be "align the size of ...", we're not aligning the start. >> - There's a comment at the end which talks about "ignored segments". >> IMHO that type of information should be in the function comment, >> but I'm >> also not quite sure I understand what "output shared memory" is ... > > I think that comes from the comments that are already in > pg_get_shmem_allocations(). > > I think that those are located here and worded that way to ease to understand > what is not in with pg_get_shmem_allocations_numa() if one look at both > functions. That said, I'm +1 to put this kind of comments in the function comment. > OK. But I'm still not sure what "output shared memory" is about. Can you explain what shmem segments are not included? regards -- Tomas Vondra
On 4/5/25 15:23, Tomas Vondra wrote: > On 4/5/25 11:37, Bertrand Drouvot wrote: >> Hi, >> >> On Fri, Apr 04, 2025 at 09:25:57PM +0200, Tomas Vondra wrote: >>> OK, >>> >>> here's v25 after going through the patches once more, fixing the issues >>> mentioned by Bertrand, etc. >> >> Thanks! >> >>> I think 0001 and 0002 are fine, >> >> Agree, I just have some cosmetic nits comments: please find them in >> nit-bertrand-0002.txt attached. >> >>> I have a >>> couple minor questions about 0003. >>> >>> - I was wondering if maybe we should have some "global ID" of memory >>> page, so that with large memory pages it's indicated the buffers are on >>> the same memory page. Right now each buffer starts page_num from 0, but >>> it should not be very hard to have a global counter. Opinions? >> >> I think that's a good idea. We could then add a new column (say os_page_id) that >> would help identify which buffers are sharing the same "physical" page. >> > > I was thinking we'd change the definition of the existing page_num > column, i.e. it wouldn't be 0..N sequence for each buffer, but a global > page ID. But I don't know if this would be useful in practice. > See the attached v25 with a draft of this in patch 0003. While working on this, I realized it's probably wrong to use TYPEALIGN() to calculate the OS page pointer. The code did this: os_page_ptrs[idx] = (char *) TYPEALIGN(os_page_size, buffptr + (os_page_size * j)); but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB memory pages, and that the first buffer is aligned to 4kB (i.e. it starts right at the beginning of a memory page). Then we expect to get page_num sequence: 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ... with 4 buffers per memory page. But we get this: 0, 1, 1, 1, 1, 2, 2, 2, 2, ... So I've changed this to TYPEALIGN_DOWN(), which fixes the result. The pg_shmem_allocations_numa had a variant of this issue when calculating the number of pages, I think. I believe the shmem segment may start half-way through a page, and the allocated_size may not be a multiple of os_page_size (otherwise why use TYPEALIGN). In which case we might skip the first page, I think. The one question I have about this is whether we know the pointer returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or before the first shared buffer) ... regards -- Tomas Vondra
Attachment
Hi, On Sat, Apr 05, 2025 at 04:33:28PM +0200, Tomas Vondra wrote: > On 4/5/25 15:23, Tomas Vondra wrote: > > I was thinking we'd change the definition of the existing page_num > > column, i.e. it wouldn't be 0..N sequence for each buffer, but a global > > page ID. But I don't know if this would be useful in practice. > > > > See the attached v25 with a draft of this in patch 0003. I see, thanks for sharing. I think that's useful because that could help identify which buffers share the same OS page. > While working on this, I realized it's probably wrong to use TYPEALIGN() > to calculate the OS page pointer. The code did this: > > os_page_ptrs[idx] > = (char *) TYPEALIGN(os_page_size, > buffptr + (os_page_size * j)); > > but TYPEALIGN() rounds "up". Let's assume we have 1KB buffers and 4KB > memory pages, and that the first buffer is aligned to 4kB (i.e. it > starts right at the beginning of a memory page). Then we expect to get > page_num sequence: > > 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ... > > with 4 buffers per memory page. But we get this: > > 0, 1, 1, 1, 1, 2, 2, 2, 2, ... > > So I've changed this to TYPEALIGN_DOWN(), which fixes the result. Good catch, that makes fully sense. But now I can see some page_num < 0 : postgres=# select page_num,node_id,count(*) from pg_buffercache_numa group by page_num,node_id order by 1 limit 4; page_num | node_id | count ----------+---------+------- -1 | 0 | 386 0 | 1 | 1024 1 | 0 | 1024 2 | 1 | 1024 I think that can be solved that way: - startptr = (char *) BufferGetBlock(1); + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1)); so that startptr is aligned to the same boundaries. But I guess that we'll have the same question as the following one: > The one question I have about this is whether we know the pointer > returned by TYPEALIGN_DOWN() is valid. It's before ent->location (or > before the first shared buffer) ... Yeah, I'm not 100% sure about that... Maybe for safety we could use TYPEALIGN_DOWN() for the reporting and use the actual buffer address when pg_numa_touch_mem_if_required() is called? (to avoid touching "invalid" memory). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Sat, Apr 05, 2025 at 03:23:38PM +0200, Tomas Vondra wrote: > Something like that. But I think it should be "align the size of ...", > we're not aligning the start. > > >> - There's a comment at the end which talks about "ignored segments". > >> IMHO that type of information should be in the function comment, > >> but I'm > >> also not quite sure I understand what "output shared memory" is ... > > > > I think that comes from the comments that are already in > > pg_get_shmem_allocations(). > > > > I think that those are located here and worded that way to ease to understand > > what is not in with pg_get_shmem_allocations_numa() if one look at both > > functions. That said, I'm +1 to put this kind of comments in the function comment. > > > > OK. But I'm still not sure what "output shared memory" is about. Can you > explain what shmem segments are not included? Looking at pg_get_shmem_allocations() and the pg_shmem_allocations view documentation, I would say the wording is linked to "anonymous allocations" and "unused memory" (i.e the ones reported with <anonymous> or NULL as name in the pg_shmem_allocations view). Using output in this comment sounds confusing while it makes sense in pg_get_shmem_allocations() because it really reports those. I think that we could just mention in the function comment that pg_get_shmem_allocations_numa() does not handle anonymous allocations and unused memory. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, I just played around with this for a bit. As noted somewhere further down, pg_buffercache_numa.page_num ends up wonky in different ways for the different pages. I think one thing that the docs should mention is that calling the numa functions/views will force the pages to be allocated, even if they're currently unused. Newly started server, with s_b of 32GB an 2MB huge pages: grep ^Huge /proc/meminfo HugePages_Total: 34802 HugePages_Free: 34448 HugePages_Rsvd: 16437 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 76517376 kB run SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; Now the pages that previously were marked as reserved are actually allocated: grep ^Huge /proc/meminfo HugePages_Total: 34802 HugePages_Free: 18012 HugePages_Rsvd: 1 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 76517376 kB I don't see how we can avoid that right now, but at the very least we ought to document it. On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote: > The libnuma library is not available on 32-bit builds (there's no shared > object for i386), so we disable it in that case. The i386 is very memory > limited anyway, even with PAE, so NUMA is mostly irrelevant. Hm? libnuma1:i386 installs just fine to me on debian and contains the shared library. > +############################################################### > +# Library: libnuma > +############################################################### > + > +libnumaopt = get_option('libnuma') > +if not libnumaopt.disabled() > + # via pkg-config > + libnuma = dependency('numa', required: libnumaopt) > + if not libnuma.found() > + libnuma = cc.find_library('numa', required: libnumaopt) > + endif This fallback isn't going to work if -dlibnuma=enabled is used, as dependency() will error out, due to not finding a required dependency. You'd need to use required: false there. Do we actually need a non-dependency() fallback here? It's linux only and a new dependency, so just requiring a .pc file seems like it'd be fine? > +#ifdef USE_LIBNUMA > + > +/* > + * This is required on Linux, before pg_numa_query_pages() as we > + * need to page-fault before move_pages(2) syscall returns valid results. > + */ > +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ > + ro_volatile_var = *(uint64 *) ptr Does it really work that way? A volatile variable to assign the result of dereferencing ptr ensures that the *store* isn't removed by the compiler, but it doesn't at all guarantee that the *load* isn't removed, since that memory isn't marked as volatile. I think you'd need to cast the source pointer to a volatile uint64* to ensure the load happens. > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit > + > +-- Register the new functions. > +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages() > +RETURNS SETOF RECORD > +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages' > +LANGUAGE C PARALLEL SAFE; > + > +-- Create a view for convenient access. > +CREATE OR REPLACE VIEW pg_buffercache_numa AS > + SELECT P.* FROM pg_buffercache_numa_pages() AS P > + (bufferid integer, page_num int4, node_id int4); Why CREATE OR REPLACE? > +Datum > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > +{ > ... > + > + /* > + * To smoothly support upgrades from version 1.0 of this extension > + * transparently handle the (non-)existence of the pinning_backends > + * column. We unfortunately have to get the result type for that... - > + * we can't use the result type determined by the function definition > + * without potentially crashing when somebody uses the old (or even > + * wrong) function definition though. > + */ > + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) > + elog(ERROR, "return type must be a row type"); > + Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view? > + > + if (firstNumaTouch) > + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); Over the patchseries the related code is duplicated. Seems like it'd be good to put it into pg_numa instead? This seems like the thing that's good to abstract away in one central spot. > + /* > + * Scan through all the buffers, saving the relevant fields in the > + * fctx->record structure. > + * > + * We don't hold the partition locks, so we don't get a consistent > + * snapshot across all buffers, but we do grab the buffer header > + * locks, so the information of each buffer is self-consistent. > + * > + * This loop touches and stores addresses into os_page_ptrs[] as input > + * to one big big move_pages(2) inquiry system call. Basically we ask > + * for all memory pages for NBuffers. > + */ > + idx = 0; > + for (i = 0; i < NBuffers; i++) > + { > + BufferDesc *bufHdr; > + uint32 buf_state; > + uint32 bufferid; > + > + CHECK_FOR_INTERRUPTS(); > + > + bufHdr = GetBufferDescriptor(i); > + > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + bufferid = BufferDescriptorGetBuffer(bufHdr); > + > + UnlockBufHdr(bufHdr, buf_state); Given that the only thing you're getting here is the buffer id, it's probably not worth getting the spinlock, a single 4 byte read is always atomic on our platforms. > + /* > + * If we have multiple OS pages per buffer, fill those in too. We > + * always want at least one OS page, even if there are multiple > + * buffers per page. > + * > + * Altough we could query just once per each OS page, we do it > + * repeatably for each Buffer and hit the same address as > + * move_pages(2) requires page aligment. This also simplifies > + * retrieval code later on. Also NBuffers starts from 1. > + */ > + for (j = 0; j < Max(1, pages_per_buffer); j++) > + { > + char *buffptr = (char *) BufferGetBlock(i + 1); > + > + fctx->record[idx].bufferid = bufferid; > + fctx->record[idx].numa_page = j; > + > + os_page_ptrs[idx] > + = (char *) TYPEALIGN(os_page_size, > + buffptr + (os_page_size * j)); FWIW, this bit here is the most expensive part of the function itself, as the compiler has no choice than to implement it as an actual division, as os_page_size is runtime variable. It'd be fine to leave it like that, the call to numa_move_pages() is way more expensive. But it shouldn't be too hard to do that alignment once, rather than having to do it over and over. FWIW, neither this definition of numa_page, nor the one from "adjust page_num" works quite right for me. This definition afaict is always 0 when using huge pages and just 0 and 1 for 4k pages. But my understanding of numa_page is that it's the "id" of the numa pages, which isn't that? With "adjust page_num" I get a number that starts at -1 and then increments from there. More correct, but doesn't quite seem right either. > + <tbody> > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>bufferid</structfield> <type>integer</type> > + </para> > + <para> > + ID, in the range 1..<varname>shared_buffers</varname> > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>page_num</structfield> <type>int</type> > + </para> > + <para> > + number of OS memory page for this buffer > + </para></entry> > + </row> "page_num" doesn't really seem very descriptive for "number of OS memory page for this buffer". To me "number of" sounds like it's counting the number of associated pages, but it's really just a "page id" or something like that. Maybe rename page_num to "os_page_id" and rephrase the comment a bit? > From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001 > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > Date: Fri, 21 Feb 2025 14:20:18 +0100 > Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view > > Introduce new pg_shmem_alloctions_numa view with information about how > shared memory is distributed across NUMA nodes. Nice. > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > Reviewed-by: Andres Freund <andres@anarazel.de> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > Reviewed-by: Tomas Vondra <tomas@vondra.me> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > --- > doc/src/sgml/system-views.sgml | 79 ++++++++++++ > src/backend/catalog/system_views.sql | 8 ++ > src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++ > src/include/catalog/pg_proc.dat | 8 ++ > src/test/regress/expected/numa.out | 12 ++ > src/test/regress/expected/numa_1.out | 3 + > src/test/regress/expected/privileges.out | 16 ++- > src/test/regress/expected/rules.out | 4 + > src/test/regress/parallel_schedule | 2 +- > src/test/regress/sql/numa.sql | 9 ++ > src/test/regress/sql/privileges.sql | 6 +- > 11 files changed, 294 insertions(+), 5 deletions(-) > create mode 100644 src/test/regress/expected/numa.out > create mode 100644 src/test/regress/expected/numa_1.out > create mode 100644 src/test/regress/sql/numa.sql > > diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml > index 4f336ee0adf..a83365ae24a 100644 > --- a/doc/src/sgml/system-views.sgml > +++ b/doc/src/sgml/system-views.sgml > @@ -181,6 +181,11 @@ > <entry>shared memory allocations</entry> > </row> > > + <row> > + <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry> > + <entry>NUMA node mappings for shared memory allocations</entry> > + </row> > + > <row> > <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry> > <entry>planner statistics</entry> > @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx > </para> > </sect1> > > + <sect1 id="view-pg-shmem-allocations-numa"> > + <title><structname>pg_shmem_allocations_numa</structname></title> > + > + <indexterm zone="view-pg-shmem-allocations-numa"> > + <primary>pg_shmem_allocations_numa</primary> > + </indexterm> > + > + <para> > + The <structname>pg_shmem_allocations_numa</structname> shows how shared > + memory allocations in the server's main shared memory segment are distributed > + across NUMA nodes. This includes both memory allocated by > + <productname>PostgreSQL</productname> itself and memory allocated > + by extensions using the mechanisms detailed in > + <xref linkend="xfunc-shared-addin" />. > + </para> I think it'd be good to describe that the view will include multiple rows for each name if spread across multiple numa nodes. Perhaps also that querying this view is expensive and that pg_shmem_allocations should be used if numa information isn't needed? > + /* > + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while > + * the OS may have different memory page sizes. > + * > + * To correctly map between them, we need to: 1. Determine the OS memory > + * page size 2. Calculate how many OS pages are used by all buffer blocks > + * 3. Calculate how many OS pages are contained within each database > + * block. > + * > + * This information is needed before calling move_pages() for NUMA memory > + * node inquiry. > + */ > + os_page_size = pg_numa_get_pagesize(); > + > + /* > + * Allocate memory for page pointers and status based on total shared > + * memory size. This simplified approach allocates enough space for all > + * pages in shared memory rather than calculating the exact requirements > + * for each segment. > + * > + * XXX Isn't this wasteful? But there probably is one large segment of > + * shared memory, much larger than the rest anyway. > + */ > + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; > + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); > + pages_status = palloc(sizeof(int) * shm_total_page_count); There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), could more be moved to a shared helper function? > + hash_seq_init(&hstat, ShmemIndex); > + > + /* output all allocated entries */ > + memset(nulls, 0, sizeof(nulls)); > + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) > + { One thing that's interesting with using ShmemIndex is that this won't get anonymous allocations. I think that's ok for now, it'd be complicated to figure out the unmapped "regions". But I guess it' be good to mention it in the ocs? > + int i; > + > + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. > + * It's a bit misleading to call that "aligned", no? */ > + > + /* Get number of OS aligned pages */ > + shm_ent_page_count > + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; > + > + /* > + * If we get ever 0xff back from kernel inquiry, then we probably have > + * bug in our buffers to OS page mapping code here. > + */ > + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count); There's obviously no guarantee that shm_ent_page_count is a multiple of os_page_size. I think it'd be interesting to show in the view when one shmem allocation shares a page with the prior allocation - that can contribute a bit to contention. What about showing a start_os_page_id and end_os_page_id or something? That could be a feature for later though. > +SELECT NOT(pg_numa_available()) AS skip_test \gset > +\if :skip_test > +\quit > +\endif > +-- switch to superuser > +\c - > +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; > + ok > +---- > + t > +(1 row) Could it be worthwhile to run the test if !pg_numa_available(), to test that we do the right thing in that case? We need an alternative output anyway, so that might be fine? Greetings, Andres Freund
Hi, On 2025-04-05 18:29:22 -0400, Andres Freund wrote: > I think one thing that the docs should mention is that calling the numa > functions/views will force the pages to be allocated, even if they're > currently unused. > > Newly started server, with s_b of 32GB an 2MB huge pages: > > grep ^Huge /proc/meminfo > HugePages_Total: 34802 > HugePages_Free: 34448 > HugePages_Rsvd: 16437 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > Hugetlb: 76517376 kB > > run > SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; > > Now the pages that previously were marked as reserved are actually allocated: > > grep ^Huge /proc/meminfo > HugePages_Total: 34802 > HugePages_Free: 18012 > HugePages_Rsvd: 1 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > Hugetlb: 76517376 kB > > > I don't see how we can avoid that right now, but at the very least we ought to > document it. The only allocation where that really matters is shared_buffers. I wonder if we could special case the logic for that, by only probing if at least one of the buffers in the range is valid. Then we could treat a page status of -ENOENT as "page is not mapped" and display NULL for the node_id? Of course that would mean that we'd always need to pg_numa_touch_mem_if_required(), not just the first time round, because we previously might not have for a page that is now valid. But compared to the cost of actually allocating pages, the cost for that seems small. Greetings, Andres Freund
On 4/6/25 00:29, Andres Freund wrote: > Hi, > > I just played around with this for a bit. As noted somewhere further down, > pg_buffercache_numa.page_num ends up wonky in different ways for the different > pages. > > I think one thing that the docs should mention is that calling the numa > functions/views will force the pages to be allocated, even if they're > currently unused. > > Newly started server, with s_b of 32GB an 2MB huge pages: > > grep ^Huge /proc/meminfo > HugePages_Total: 34802 > HugePages_Free: 34448 > HugePages_Rsvd: 16437 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > Hugetlb: 76517376 kB > > run > SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; > > Now the pages that previously were marked as reserved are actually allocated: > > grep ^Huge /proc/meminfo > HugePages_Total: 34802 > HugePages_Free: 18012 > HugePages_Rsvd: 1 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > Hugetlb: 76517376 kB > > > I don't see how we can avoid that right now, but at the very least we ought to > document it. > +1 to documenting this > > On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote: >> The libnuma library is not available on 32-bit builds (there's no shared >> object for i386), so we disable it in that case. The i386 is very memory >> limited anyway, even with PAE, so NUMA is mostly irrelevant. > > Hm? libnuma1:i386 installs just fine to me on debian and contains the shared > library. > > >> +############################################################### >> +# Library: libnuma >> +############################################################### >> + >> +libnumaopt = get_option('libnuma') >> +if not libnumaopt.disabled() >> + # via pkg-config >> + libnuma = dependency('numa', required: libnumaopt) >> + if not libnuma.found() >> + libnuma = cc.find_library('numa', required: libnumaopt) >> + endif > > This fallback isn't going to work if -dlibnuma=enabled is used, as > dependency() will error out, due to not finding a required dependency. You'd > need to use required: false there. > > Do we actually need a non-dependency() fallback here? It's linux only and a > new dependency, so just requiring a .pc file seems like it'd be fine? > No idea. > >> +#ifdef USE_LIBNUMA >> + >> +/* >> + * This is required on Linux, before pg_numa_query_pages() as we >> + * need to page-fault before move_pages(2) syscall returns valid results. >> + */ >> +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ >> + ro_volatile_var = *(uint64 *) ptr > > Does it really work that way? A volatile variable to assign the result of > dereferencing ptr ensures that the *store* isn't removed by the compiler, but > it doesn't at all guarantee that the *load* isn't removed, since that memory > isn't marked as volatile. > > I think you'd need to cast the source pointer to a volatile uint64* to ensure > the load happens. > > >> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ >> + >> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION >> +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit >> + >> +-- Register the new functions. >> +CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages() >> +RETURNS SETOF RECORD >> +AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages' >> +LANGUAGE C PARALLEL SAFE; >> + >> +-- Create a view for convenient access. >> +CREATE OR REPLACE VIEW pg_buffercache_numa AS >> + SELECT P.* FROM pg_buffercache_numa_pages() AS P >> + (bufferid integer, page_num int4, node_id int4); > > > Why CREATE OR REPLACE? > I think this is simply due to copy-pasting the code, a plain CREATE would be enough here. > >> +Datum >> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) >> +{ >> ... >> + >> + /* >> + * To smoothly support upgrades from version 1.0 of this extension >> + * transparently handle the (non-)existence of the pinning_backends >> + * column. We unfortunately have to get the result type for that... - >> + * we can't use the result type determined by the function definition >> + * without potentially crashing when somebody uses the old (or even >> + * wrong) function definition though. >> + */ >> + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) >> + elog(ERROR, "return type must be a row type"); >> + > > Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view? > Yes, good catch. > >> + >> + if (firstNumaTouch) >> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); > > Over the patchseries the related code is duplicated. Seems like it'd be good > to put it into pg_numa instead? This seems like the thing that's good to > abstract away in one central spot. > Abstract away which part, exactly? I thought about moving some of the code to port/pg_numa.c, but it didn't seem worth it. > >> + /* >> + * Scan through all the buffers, saving the relevant fields in the >> + * fctx->record structure. >> + * >> + * We don't hold the partition locks, so we don't get a consistent >> + * snapshot across all buffers, but we do grab the buffer header >> + * locks, so the information of each buffer is self-consistent. >> + * >> + * This loop touches and stores addresses into os_page_ptrs[] as input >> + * to one big big move_pages(2) inquiry system call. Basically we ask >> + * for all memory pages for NBuffers. >> + */ >> + idx = 0; >> + for (i = 0; i < NBuffers; i++) >> + { >> + BufferDesc *bufHdr; >> + uint32 buf_state; >> + uint32 bufferid; >> + >> + CHECK_FOR_INTERRUPTS(); >> + >> + bufHdr = GetBufferDescriptor(i); >> + >> + /* Lock each buffer header before inspecting. */ >> + buf_state = LockBufHdr(bufHdr); >> + bufferid = BufferDescriptorGetBuffer(bufHdr); >> + >> + UnlockBufHdr(bufHdr, buf_state); > > Given that the only thing you're getting here is the buffer id, it's probably > not worth getting the spinlock, a single 4 byte read is always atomic on our > platforms. > Fine with me. I don't expect this to make a measurable difference so I kept the spinlock, but if we want to remove it, I won't obect. > >> + /* >> + * If we have multiple OS pages per buffer, fill those in too. We >> + * always want at least one OS page, even if there are multiple >> + * buffers per page. >> + * >> + * Altough we could query just once per each OS page, we do it >> + * repeatably for each Buffer and hit the same address as >> + * move_pages(2) requires page aligment. This also simplifies >> + * retrieval code later on. Also NBuffers starts from 1. >> + */ >> + for (j = 0; j < Max(1, pages_per_buffer); j++) >> + { >> + char *buffptr = (char *) BufferGetBlock(i + 1); >> + >> + fctx->record[idx].bufferid = bufferid; >> + fctx->record[idx].numa_page = j; >> + >> + os_page_ptrs[idx] >> + = (char *) TYPEALIGN(os_page_size, >> + buffptr + (os_page_size * j)); > > FWIW, this bit here is the most expensive part of the function itself, as the > compiler has no choice than to implement it as an actual division, as > os_page_size is runtime variable. > > It'd be fine to leave it like that, the call to numa_move_pages() is way more > expensive. But it shouldn't be too hard to do that alignment once, rather than > having to do it over and over. > Division? It's entirely possible I'm missing something obvious, but I don't see any divisions in this code. You're however right we could get rid of most of this, because we could get the buffer pointer once (it's a bit silly we get it for each page), align that, and then simply add the page size. Something like this: /* align to start of OS page, determine pointer to end of buffer */ char *buffptr = (char *) BufferGetBlock(i + 1); char *ptr = buffptr - (buffptr % os_page_size); char *endptr = buffptr + BLCKSZ; while (ptr < endptr) { os_page_ptrs[idx] = ptr; ... ptr += os_page_size; } This also made me think a bit more about how the blocks and pages might align / overlap. AFAIK the buffers are aligned to PG_IO_ALIGN_SIZE, which on x86 is 4K, i.e. the same as OS page size. But let's say the OS page size is larger, say 1MB. AFAIK it could happen a buffer could span multiple larger memory pages. For example, 8K buffer could start 4K before the 1MB page boundary, and use 4K from the next memory page. This would mean the current formulas for buffer_per_page and pages_per_buffer can be off by 1. This would complicate calculating os_page_count a bit, because only some buffers would actually need the +1 (in the array / view output). Or what do I miss? It there something that guarantees this won't happen? > > FWIW, neither this definition of numa_page, nor the one from "adjust page_num" > works quite right for me. > > This definition afaict is always 0 when using huge pages and just 0 and 1 for > 4k pages. But my understanding of numa_page is that it's the "id" of the numa > pages, which isn't that? > > With "adjust page_num" I get a number that starts at -1 and then increments > from there. More correct, but doesn't quite seem right either. > > >> + <tbody> >> + <row> >> + <entry role="catalog_table_entry"><para role="column_definition"> >> + <structfield>bufferid</structfield> <type>integer</type> >> + </para> >> + <para> >> + ID, in the range 1..<varname>shared_buffers</varname> >> + </para></entry> >> + </row> >> + >> + <row> >> + <entry role="catalog_table_entry"><para role="column_definition"> >> + <structfield>page_num</structfield> <type>int</type> >> + </para> >> + <para> >> + number of OS memory page for this buffer >> + </para></entry> >> + </row> > > "page_num" doesn't really seem very descriptive for "number of OS memory page > for this buffer". To me "number of" sounds like it's counting the number of > associated pages, but it's really just a "page id" or something like that. > > Maybe rename page_num to "os_page_id" and rephrase the comment a bit? > Yeah, I haven't updated the docs in 0003 when adjusting the page_num definition. It was more an experiment to see if others like this change. > > > >> From 03d24af540f8235ad9ca9537db0a1ba5dbcf6ccb Mon Sep 17 00:00:00 2001 >> From: Jakub Wartak <jakub.wartak@enterprisedb.com> >> Date: Fri, 21 Feb 2025 14:20:18 +0100 >> Subject: [PATCH v25 4/5] Introduce pg_shmem_allocations_numa view >> >> Introduce new pg_shmem_alloctions_numa view with information about how >> shared memory is distributed across NUMA nodes. > > Nice. > > >> Author: Jakub Wartak <jakub.wartak@enterprisedb.com> >> Reviewed-by: Andres Freund <andres@anarazel.de> >> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> >> Reviewed-by: Tomas Vondra <tomas@vondra.me> >> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com >> --- >> doc/src/sgml/system-views.sgml | 79 ++++++++++++ >> src/backend/catalog/system_views.sql | 8 ++ >> src/backend/storage/ipc/shmem.c | 152 +++++++++++++++++++++++ >> src/include/catalog/pg_proc.dat | 8 ++ >> src/test/regress/expected/numa.out | 12 ++ >> src/test/regress/expected/numa_1.out | 3 + >> src/test/regress/expected/privileges.out | 16 ++- >> src/test/regress/expected/rules.out | 4 + >> src/test/regress/parallel_schedule | 2 +- >> src/test/regress/sql/numa.sql | 9 ++ >> src/test/regress/sql/privileges.sql | 6 +- >> 11 files changed, 294 insertions(+), 5 deletions(-) >> create mode 100644 src/test/regress/expected/numa.out >> create mode 100644 src/test/regress/expected/numa_1.out >> create mode 100644 src/test/regress/sql/numa.sql >> >> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml >> index 4f336ee0adf..a83365ae24a 100644 >> --- a/doc/src/sgml/system-views.sgml >> +++ b/doc/src/sgml/system-views.sgml >> @@ -181,6 +181,11 @@ >> <entry>shared memory allocations</entry> >> </row> >> >> + <row> >> + <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry> >> + <entry>NUMA node mappings for shared memory allocations</entry> >> + </row> >> + >> <row> >> <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry> >> <entry>planner statistics</entry> >> @@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx >> </para> >> </sect1> >> >> + <sect1 id="view-pg-shmem-allocations-numa"> >> + <title><structname>pg_shmem_allocations_numa</structname></title> >> + >> + <indexterm zone="view-pg-shmem-allocations-numa"> >> + <primary>pg_shmem_allocations_numa</primary> >> + </indexterm> >> + >> + <para> >> + The <structname>pg_shmem_allocations_numa</structname> shows how shared >> + memory allocations in the server's main shared memory segment are distributed >> + across NUMA nodes. This includes both memory allocated by >> + <productname>PostgreSQL</productname> itself and memory allocated >> + by extensions using the mechanisms detailed in >> + <xref linkend="xfunc-shared-addin" />. >> + </para> > > I think it'd be good to describe that the view will include multiple rows for > each name if spread across multiple numa nodes. > > Perhaps also that querying this view is expensive and that > pg_shmem_allocations should be used if numa information isn't needed? > > >> + /* >> + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while >> + * the OS may have different memory page sizes. >> + * >> + * To correctly map between them, we need to: 1. Determine the OS memory >> + * page size 2. Calculate how many OS pages are used by all buffer blocks >> + * 3. Calculate how many OS pages are contained within each database >> + * block. >> + * >> + * This information is needed before calling move_pages() for NUMA memory >> + * node inquiry. >> + */ >> + os_page_size = pg_numa_get_pagesize(); >> + >> + /* >> + * Allocate memory for page pointers and status based on total shared >> + * memory size. This simplified approach allocates enough space for all >> + * pages in shared memory rather than calculating the exact requirements >> + * for each segment. >> + * >> + * XXX Isn't this wasteful? But there probably is one large segment of >> + * shared memory, much larger than the rest anyway. >> + */ >> + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; >> + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); >> + pages_status = palloc(sizeof(int) * shm_total_page_count); > > There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), > could more be moved to a shared helper function? > > >> + hash_seq_init(&hstat, ShmemIndex); >> + >> + /* output all allocated entries */ >> + memset(nulls, 0, sizeof(nulls)); >> + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) >> + { > > One thing that's interesting with using ShmemIndex is that this won't get > anonymous allocations. I think that's ok for now, it'd be complicated to > figure out the unmapped "regions". But I guess it' be good to mention it in > the ocs? > Agreed. > > > >> + int i; >> + >> + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. >> + * It's a bit misleading to call that "aligned", no? */ >> + >> + /* Get number of OS aligned pages */ >> + shm_ent_page_count >> + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; >> + >> + /* >> + * If we get ever 0xff back from kernel inquiry, then we probably have >> + * bug in our buffers to OS page mapping code here. >> + */ >> + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count); > > There's obviously no guarantee that shm_ent_page_count is a multiple of > os_page_size. I think it'd be interesting to show in the view when one shmem > allocation shares a page with the prior allocation - that can contribute a bit > to contention. What about showing a start_os_page_id and end_os_page_id or > something? That could be a feature for later though. > Yeah, adding first/last page might be interesting. > >> +SELECT NOT(pg_numa_available()) AS skip_test \gset >> +\if :skip_test >> +\quit >> +\endif >> +-- switch to superuser >> +\c - >> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; >> + ok >> +---- >> + t >> +(1 row) > > Could it be worthwhile to run the test if !pg_numa_available(), to test that > we do the right thing in that case? We need an alternative output anyway, so > that might be fine? > +1 regards -- Tomas Vondra
On 4/6/25 01:00, Andres Freund wrote: > Hi, > > On 2025-04-05 18:29:22 -0400, Andres Freund wrote: >> I think one thing that the docs should mention is that calling the numa >> functions/views will force the pages to be allocated, even if they're >> currently unused. >> >> Newly started server, with s_b of 32GB an 2MB huge pages: >> >> grep ^Huge /proc/meminfo >> HugePages_Total: 34802 >> HugePages_Free: 34448 >> HugePages_Rsvd: 16437 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> Hugetlb: 76517376 kB >> >> run >> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; >> >> Now the pages that previously were marked as reserved are actually allocated: >> >> grep ^Huge /proc/meminfo >> HugePages_Total: 34802 >> HugePages_Free: 18012 >> HugePages_Rsvd: 1 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> Hugetlb: 76517376 kB >> >> >> I don't see how we can avoid that right now, but at the very least we ought to >> document it. > > The only allocation where that really matters is shared_buffers. I wonder if > we could special case the logic for that, by only probing if at least one of > the buffers in the range is valid. > > Then we could treat a page status of -ENOENT as "page is not mapped" and > display NULL for the node_id? > > Of course that would mean that we'd always need to > pg_numa_touch_mem_if_required(), not just the first time round, because we > previously might not have for a page that is now valid. But compared to the > cost of actually allocating pages, the cost for that seems small. > I don't think this would be a good trade off. The buffers already have a NUMA node, and users would be interested in that. It's just that we don't have the memory mapped in the current backend, so I'd bet people would not be happy with NULL, and would proceed to force the allocation in some other way (say, a large query of some sort). Which obviously causes a lot of other problems. I can imagine having a flag that makes the allocation optional, but there's no convenient way to pass that to a view, and I think most people want the allocation anyway. Especially for monitoring purposes, which usually happens in a new connection, so the backend has little opportunity to allocate the pages "naturally." regards -- Tomas Vondra
On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, Hi Andres/Tomas, I've noticed that Tomas responded to this while writing this, so I'm attaching git-am patches based on his v25 (no squash) and there's only one new (last one contains fixes based on this review) + slight commit amendment to 0001. > I just played around with this for a bit. As noted somewhere further down, > pg_buffercache_numa.page_num ends up wonky in different ways for the different > pages. I think page_num is under heavy work in progress... I'm still not sure is it worth exposing this (is it worth the hassle). If we scratch it it won't be perfect, but we have everything , otherwise we risk this feature as we are going into a feature freeze literally tomorrow. > I think one thing that the docs should mention is that calling the numa > functions/views will force the pages to be allocated, even if they're > currently unused. [..] > I don't see how we can avoid that right now, but at the very least we ought to > document it. Added statement about this. > On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote: > > The libnuma library is not available on 32-bit builds (there's no shared > > object for i386), so we disable it in that case. The i386 is very memory > > limited anyway, even with PAE, so NUMA is mostly irrelevant. > > Hm? libnuma1:i386 installs just fine to me on debian and contains the shared > library. OK, removed from the commit message, as yeah google states it really exists (somehow I couldn't find it back then at least here)... > > +############################################################### > > +# Library: libnuma > > +############################################################### > > + > > +libnumaopt = get_option('libnuma') > > +if not libnumaopt.disabled() > > + # via pkg-config > > + libnuma = dependency('numa', required: libnumaopt) > > + if not libnuma.found() > > + libnuma = cc.find_library('numa', required: libnumaopt) > > + endif > > This fallback isn't going to work if -dlibnuma=enabled is used, as > dependency() will error out, due to not finding a required dependency. You'd > need to use required: false there. > > Do we actually need a non-dependency() fallback here? It's linux only and a > new dependency, so just requiring a .pc file seems like it'd be fine? I'm not sure pkg-config is present everywhere, but I'm not expernt and AFAIR we are not consistent how various libs are handled there. It's quite late, but for now i've now just follwed Your's recommendation for dependency() with false. > > +#ifdef USE_LIBNUMA > > + > > +/* > > + * This is required on Linux, before pg_numa_query_pages() as we > > + * need to page-fault before move_pages(2) syscall returns valid results. > > + */ > > +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ > > + ro_volatile_var = *(uint64 *) ptr > > Does it really work that way? A volatile variable to assign the result of > dereferencing ptr ensures that the *store* isn't removed by the compiler, but > it doesn't at all guarantee that the *load* isn't removed, since that memory > isn't marked as volatile. > > I think you'd need to cast the source pointer to a volatile uint64* to ensure > the load happens. OK, thanks, good finding, I was not aware compiler can bypass it like that. [..] > > +CREATE OR REPLACE VIEW pg_buffercache_numa AS > > Why CREATE OR REPLACE? Fixed. > > +Datum > > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > > +{ > > ... > > + > > + /* > > + * To smoothly support upgrades from version 1.0 of this extension > > + * transparently handle the (non-)existence of the pinning_backends > > + * column. We unfortunately have to get the result type for that... - > > + * we can't use the result type determined by the function definition > > + * without potentially crashing when somebody uses the old (or even > > + * wrong) function definition though. > > + */ > > + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) > > + elog(ERROR, "return type must be a row type"); > > + > > Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view? Removed. > > + > > + if (firstNumaTouch) > > + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); > > Over the patchseries the related code is duplicated. Seems like it'd be good > to put it into pg_numa instead? This seems like the thing that's good to > abstract away in one central spot. I hear you, but we are using those statistics for per-localized-code (shm.c's firstTouch != pgbuffercache.c's firstTouch). > > + /* > > + * Scan through all the buffers, saving the relevant fields in the > > + * fctx->record structure. > > + * > > + * We don't hold the partition locks, so we don't get a consistent > > + * snapshot across all buffers, but we do grab the buffer header > > + * locks, so the information of each buffer is self-consistent. > > + * > > + * This loop touches and stores addresses into os_page_ptrs[] as input > > + * to one big big move_pages(2) inquiry system call. Basically we ask > > + * for all memory pages for NBuffers. > > + */ > > + idx = 0; > > + for (i = 0; i < NBuffers; i++) > > + { > > + BufferDesc *bufHdr; > > + uint32 buf_state; > > + uint32 bufferid; > > + > > + CHECK_FOR_INTERRUPTS(); > > + > > + bufHdr = GetBufferDescriptor(i); > > + > > + /* Lock each buffer header before inspecting. */ > > + buf_state = LockBufHdr(bufHdr); > > + bufferid = BufferDescriptorGetBuffer(bufHdr); > > + > > + UnlockBufHdr(bufHdr, buf_state); > > Given that the only thing you're getting here is the buffer id, it's probably > not worth getting the spinlock, a single 4 byte read is always atomic on our > platforms. Well, I think this is just copy&pasted from original function, so we follow the pattern for consistency. > > + /* > > + * If we have multiple OS pages per buffer, fill those in too. We > > + * always want at least one OS page, even if there are multiple > > + * buffers per page. > > + * > > + * Altough we could query just once per each OS page, we do it > > + * repeatably for each Buffer and hit the same address as > > + * move_pages(2) requires page aligment. This also simplifies > > + * retrieval code later on. Also NBuffers starts from 1. > > + */ > > + for (j = 0; j < Max(1, pages_per_buffer); j++) > > + { > > + char *buffptr = (char *) BufferGetBlock(i + 1); > > + > > + fctx->record[idx].bufferid = bufferid; > > + fctx->record[idx].numa_page = j; > > + > > + os_page_ptrs[idx] > > + = (char *) TYPEALIGN(os_page_size, > > + buffptr + (os_page_size * j)); > > FWIW, this bit here is the most expensive part of the function itself, as the > compiler has no choice than to implement it as an actual division, as > os_page_size is runtime variable. > > It'd be fine to leave it like that, the call to numa_move_pages() is way more > expensive. But it shouldn't be too hard to do that alignment once, rather than > having to do it over and over. TBH, I don't think we should spend lot of time optimizing, after all it's debugging view (at the start I was actually considering putting it as developer-only compile time option, but with shm view it is actually usuable for others too and well... we want to have it as foundation for real NUMA optimizations) > FWIW, neither this definition of numa_page, nor the one from "adjust page_num" > works quite right for me. > > This definition afaict is always 0 when using huge pages and just 0 and 1 for > 4k pages. But my understanding of numa_page is that it's the "id" of the numa > pages, which isn't that? > > With "adjust page_num" I get a number that starts at -1 and then increments > from there. More correct, but doesn't quite seem right either. Apparently handling this special case of splitted buffers edge-cases was Pandora box ;) Tomas what do we do about it? Does that has chance to get in before freeze? > > + <tbody> > > + <row> > > + <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>bufferid</structfield> <type>integer</type> > > + </para> > > + <para> > > + ID, in the range 1..<varname>shared_buffers</varname> > > + </para></entry> > > + </row> > > + > > + <row> > > + <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>page_num</structfield> <type>int</type> > > + </para> > > + <para> > > + number of OS memory page for this buffer > > + </para></entry> > > + </row> > > "page_num" doesn't really seem very descriptive for "number of OS memory page > for this buffer". To me "number of" sounds like it's counting the number of > associated pages, but it's really just a "page id" or something like that. > > Maybe rename page_num to "os_page_id" and rephrase the comment a bit? Tomas, are you good with rename? I think I've would also prefer os_page_id. > > diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml [..] > > + <para> > > + The <structname>pg_shmem_allocations_numa</structname> shows how shared > > + memory allocations in the server's main shared memory segment are distributed > > + across NUMA nodes. This includes both memory allocated by > > + <productname>PostgreSQL</productname> itself and memory allocated > > + by extensions using the mechanisms detailed in > > + <xref linkend="xfunc-shared-addin" />. > > + </para> > > I think it'd be good to describe that the view will include multiple rows for > each name if spread across multiple numa nodes. Added. > Perhaps also that querying this view is expensive and that > pg_shmem_allocations should be used if numa information isn't needed? Already covered by 1st finding fix. > > + /* > > + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while > > + * the OS may have different memory page sizes. > > + * > > + * To correctly map between them, we need to: 1. Determine the OS memory > > + * page size 2. Calculate how many OS pages are used by all buffer blocks > > + * 3. Calculate how many OS pages are contained within each database > > + * block. > > + * > > + * This information is needed before calling move_pages() for NUMA memory > > + * node inquiry. > > + */ > > + os_page_size = pg_numa_get_pagesize(); > > + > > + /* > > + * Allocate memory for page pointers and status based on total shared > > + * memory size. This simplified approach allocates enough space for all > > + * pages in shared memory rather than calculating the exact requirements > > + * for each segment. > > + * > > + * XXX Isn't this wasteful? But there probably is one large segment of > > + * shared memory, much larger than the rest anyway. > > + */ > > + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; > > + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); > > + pages_status = palloc(sizeof(int) * shm_total_page_count); > > There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), > could more be moved to a shared helper function? -> Tomas? > > + hash_seq_init(&hstat, ShmemIndex); > > + > > + /* output all allocated entries */ > > + memset(nulls, 0, sizeof(nulls)); > > + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) > > + { > > One thing that's interesting with using ShmemIndex is that this won't get > anonymous allocations. I think that's ok for now, it'd be complicated to > figure out the unmapped "regions". But I guess it' be good to mention it in > the ocs? Added. > > + int i; > > + > > + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. > > + * It's a bit misleading to call that "aligned", no? */ > > + > > + /* Get number of OS aligned pages */ > > + shm_ent_page_count > > + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; > > + > > + /* > > + * If we get ever 0xff back from kernel inquiry, then we probably have > > + * bug in our buffers to OS page mapping code here. > > + */ > > + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count); > > There's obviously no guarantee that shm_ent_page_count is a multiple of > os_page_size. I think it'd be interesting to show in the view when one shmem > allocation shares a page with the prior allocation - that can contribute a bit > to contention. What about showing a start_os_page_id and end_os_page_id or > something? That could be a feature for later though. I was thinking about it, but it could be done when analyzing this together with data from pg_shmem_allocations(?) My worry is timing :( Anyway, we could extend this view in future revisions. > > +SELECT NOT(pg_numa_available()) AS skip_test \gset > > +\if :skip_test > > +\quit > > +\endif > > +-- switch to superuser > > +\c - > > +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; > > + ok > > +---- > > + t > > +(1 row) > > Could it be worthwhile to run the test if !pg_numa_available(), to test that > we do the right thing in that case? We need an alternative output anyway, so > that might be fine? Added. the meson test passes, but I'm sending it as fast as possible to avoid a clash with Tomas. -J.
Attachment
On 4/6/25 14:57, Jakub Wartak wrote: > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote: >> >> Hi, > > Hi Andres/Tomas, > > I've noticed that Tomas responded to this while writing this, so I'm > attaching git-am patches based on his v25 (no squash) and there's only > one new (last one contains fixes based on this review) + slight commit > amendment to 0001. > I'm not working on this at the moment. I may have a bit of time in the evening, but more likely I'll get back to this on Monday. >> I just played around with this for a bit. As noted somewhere further down, >> pg_buffercache_numa.page_num ends up wonky in different ways for the different >> pages. > > I think page_num is under heavy work in progress... I'm still not sure > is it worth exposing this (is it worth the hassle). If we scratch it > it won't be perfect, but we have everything , otherwise we risk this > feature as we are going into a feature freeze literally tomorrow. > IMHO it's not difficult to change the definition of page_num this way, it's pretty much a one line change. It's more a question of whether we actually want to expose this. [snip] >>> + >>> + if (firstNumaTouch) >>> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); >> >> Over the patchseries the related code is duplicated. Seems like it'd be good >> to put it into pg_numa instead? This seems like the thing that's good to >> abstract away in one central spot. > > I hear you, but we are using those statistics for per-localized-code > (shm.c's firstTouch != pgbuffercache.c's firstTouch). > Yeah, I don't moving this is quite possible / useful. We could pass the flag somewhere, but we still need to check & update it in local code. >>> + /* >>> + * If we have multiple OS pages per buffer, fill those in too. We >>> + * always want at least one OS page, even if there are multiple >>> + * buffers per page. >>> + * >>> + * Altough we could query just once per each OS page, we do it >>> + * repeatably for each Buffer and hit the same address as >>> + * move_pages(2) requires page aligment. This also simplifies >>> + * retrieval code later on. Also NBuffers starts from 1. >>> + */ >>> + for (j = 0; j < Max(1, pages_per_buffer); j++) >>> + { >>> + char *buffptr = (char *) BufferGetBlock(i + 1); >>> + >>> + fctx->record[idx].bufferid = bufferid; >>> + fctx->record[idx].numa_page = j; >>> + >>> + os_page_ptrs[idx] >>> + = (char *) TYPEALIGN(os_page_size, >>> + buffptr + (os_page_size * j)); >> >> FWIW, this bit here is the most expensive part of the function itself, as the >> compiler has no choice than to implement it as an actual division, as >> os_page_size is runtime variable. >> >> It'd be fine to leave it like that, the call to numa_move_pages() is way more >> expensive. But it shouldn't be too hard to do that alignment once, rather than >> having to do it over and over. > > TBH, I don't think we should spend lot of time optimizing, after all > it's debugging view (at the start I was actually considering putting > it as developer-only compile time option, but with shm view it is > actually usuable for others too and well... we want to have it as > foundation for real NUMA optimizations) > I agree with this. >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num" >> works quite right for me. >> >> This definition afaict is always 0 when using huge pages and just 0 and 1 for >> 4k pages. But my understanding of numa_page is that it's the "id" of the numa >> pages, which isn't that? >> >> With "adjust page_num" I get a number that starts at -1 and then increments >> from there. More correct, but doesn't quite seem right either. > > Apparently handling this special case of splitted buffers edge-cases > was Pandora box ;) Tomas what do we do about it? Does that has chance > to get in before freeze? > I don't think the split buffers are pandora box on their own, it's more that it made us notice other issues / questions. I don't think handling it is particularly complex - the most difficult part seems to be figuring out how many rows we'll return, and mapping them to pages. But that's not very difficult, IMO. The bigger question is whether it's safe to do the TYPEALIGN_DOWN(), which may return a pointer from before the first buffer. But I guess that's OK, thanks to how memory is allocated - at least, that's what all the move_pages() examples I found do, so unless those are all broken, that's OK. >>> + <tbody> >>> + <row> >>> + <entry role="catalog_table_entry"><para role="column_definition"> >>> + <structfield>bufferid</structfield> <type>integer</type> >>> + </para> >>> + <para> >>> + ID, in the range 1..<varname>shared_buffers</varname> >>> + </para></entry> >>> + </row> >>> + >>> + <row> >>> + <entry role="catalog_table_entry"><para role="column_definition"> >>> + <structfield>page_num</structfield> <type>int</type> >>> + </para> >>> + <para> >>> + number of OS memory page for this buffer >>> + </para></entry> >>> + </row> >> >> "page_num" doesn't really seem very descriptive for "number of OS memory page >> for this buffer". To me "number of" sounds like it's counting the number of >> associated pages, but it's really just a "page id" or something like that. >> >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit? > > Tomas, are you good with rename? I think I've would also prefer os_page_id. > Fine with me. >>> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml > [..] >>> + <para> >>> + The <structname>pg_shmem_allocations_numa</structname> shows how shared >>> + memory allocations in the server's main shared memory segment are distributed >>> + across NUMA nodes. This includes both memory allocated by >>> + <productname>PostgreSQL</productname> itself and memory allocated >>> + by extensions using the mechanisms detailed in >>> + <xref linkend="xfunc-shared-addin" />. >>> + </para> >> >> I think it'd be good to describe that the view will include multiple rows for >> each name if spread across multiple numa nodes. > > Added. > >> Perhaps also that querying this view is expensive and that >> pg_shmem_allocations should be used if numa information isn't needed? > > Already covered by 1st finding fix. > >>> + /* >>> + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while >>> + * the OS may have different memory page sizes. >>> + * >>> + * To correctly map between them, we need to: 1. Determine the OS memory >>> + * page size 2. Calculate how many OS pages are used by all buffer blocks >>> + * 3. Calculate how many OS pages are contained within each database >>> + * block. >>> + * >>> + * This information is needed before calling move_pages() for NUMA memory >>> + * node inquiry. >>> + */ >>> + os_page_size = pg_numa_get_pagesize(); >>> + >>> + /* >>> + * Allocate memory for page pointers and status based on total shared >>> + * memory size. This simplified approach allocates enough space for all >>> + * pages in shared memory rather than calculating the exact requirements >>> + * for each segment. >>> + * >>> + * XXX Isn't this wasteful? But there probably is one large segment of >>> + * shared memory, much larger than the rest anyway. >>> + */ >>> + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; >>> + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); >>> + pages_status = palloc(sizeof(int) * shm_total_page_count); >> >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), >> could more be moved to a shared helper function? > > -> Tomas? > I'm not against that in principle, but when I tried it didn't quite help that much. But maybe it's better with the current patch version. >>> + hash_seq_init(&hstat, ShmemIndex); >>> + >>> + /* output all allocated entries */ >>> + memset(nulls, 0, sizeof(nulls)); >>> + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) >>> + { >> >> One thing that's interesting with using ShmemIndex is that this won't get >> anonymous allocations. I think that's ok for now, it'd be complicated to >> figure out the unmapped "regions". But I guess it' be good to mention it in >> the ocs? > > Added. > >>> + int i; >>> + >>> + /* XXX I assume we use TYPEALIGN as a way to round to whole pages. >>> + * It's a bit misleading to call that "aligned", no? */ >>> + >>> + /* Get number of OS aligned pages */ >>> + shm_ent_page_count >>> + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size; >>> + >>> + /* >>> + * If we get ever 0xff back from kernel inquiry, then we probably have >>> + * bug in our buffers to OS page mapping code here. >>> + */ >>> + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count); >> >> There's obviously no guarantee that shm_ent_page_count is a multiple of >> os_page_size. I think it'd be interesting to show in the view when one shmem >> allocation shares a page with the prior allocation - that can contribute a bit >> to contention. What about showing a start_os_page_id and end_os_page_id or >> something? That could be a feature for later though. > > I was thinking about it, but it could be done when analyzing this > together with data from pg_shmem_allocations(?) My worry is timing :( > Anyway, we could extend this view in future revisions. > I'd leave this out for now. It's not difficult, but let's focus on the other issues. >>> +SELECT NOT(pg_numa_available()) AS skip_test \gset >>> +\if :skip_test >>> +\quit >>> +\endif >>> +-- switch to superuser >>> +\c - >>> +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; >>> + ok >>> +---- >>> + t >>> +(1 row) >> >> Could it be worthwhile to run the test if !pg_numa_available(), to test that >> we do the right thing in that case? We need an alternative output anyway, so >> that might be fine? > > Added. the meson test passes, but I'm sending it as fast as possible > to avoid a clash with Tomas. > Please keep working on this. I may hava a bit of time in the evening, but in the worst case I'll merge it into your patch. regards -- Tomas Vondra
On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 4/6/25 14:57, Jakub Wartak wrote: > > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres@anarazel.de> wrote: > >> > >> Hi, > > > > Hi Andres/Tomas, > > > > I've noticed that Tomas responded to this while writing this, so I'm > > attaching git-am patches based on his v25 (no squash) and there's only > > one new (last one contains fixes based on this review) + slight commit > > amendment to 0001. > > > > I'm not working on this at the moment. I may have a bit of time in the > evening, but more likely I'll get back to this on Monday. OK, tried to fix all outstanding issues (except maybe some tiny code refactors for beatification) > >> I just played around with this for a bit. As noted somewhere further down, > >> pg_buffercache_numa.page_num ends up wonky in different ways for the different > >> pages. > > > > I think page_num is under heavy work in progress... I'm still not sure > > is it worth exposing this (is it worth the hassle). If we scratch it > > it won't be perfect, but we have everything , otherwise we risk this > > feature as we are going into a feature freeze literally tomorrow. > > > > IMHO it's not difficult to change the definition of page_num this way, > it's pretty much a one line change. It's more a question of whether we > actually want to expose this. Bertrand noticed this first in https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal : - startptr = (char *) BufferGetBlock(1); + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1)); With the above I'm also not getting wonky (-1) results anymore. The rest of reply assumes we are using this. > [snip] > > >>> + > >>> + if (firstNumaTouch) > >>> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); > >> > >> Over the patchseries the related code is duplicated. Seems like it'd be good > >> to put it into pg_numa instead? This seems like the thing that's good to > >> abstract away in one central spot. > > > > I hear you, but we are using those statistics for per-localized-code > > (shm.c's firstTouch != pgbuffercache.c's firstTouch). > > > > Yeah, I don't moving this is quite possible / useful. We could pass the > flag somewhere, but we still need to check & update it in local code. No idea how such code should be looking, but it would be more code? Are you thinking about some code still touching local &firstNumaTouch ? > >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num" > >> works quite right for me. > >> > >> This definition afaict is always 0 when using huge pages and just 0 and 1 for > >> 4k pages. But my understanding of numa_page is that it's the "id" of the numa > >> pages, which isn't that? > >> > >> With "adjust page_num" I get a number that starts at -1 and then increments > >> from there. More correct, but doesn't quite seem right either. > > > > Apparently handling this special case of splitted buffers edge-cases > > was Pandora box ;) Tomas what do we do about it? Does that has chance > > to get in before freeze? > > > > I don't think the split buffers are pandora box on their own, it's more > that it made us notice other issues / questions. I don't think handling > it is particularly complex - the most difficult part seems to be > figuring out how many rows we'll return, and mapping them to pages. > But that's not very difficult, IMO. OK, with a fresh week, and fresh mind and a different name (ospageid) it looks better to me now. > The bigger question is whether it's safe to do the TYPEALIGN_DOWN(), > which may return a pointer from before the first buffer. But I guess > that's OK, thanks to how memory is allocated - at least, that's what all > the move_pages() examples I found do, so unless those are all broken, > that's OK. I agree. This took some more time, but in case of a) pg_buffercache_numa and HP=off view we shouldn't access ptr below buffercache, because my understanding is that shm memory would be page aligned anyway as per BufferManagerShmemInit() which uses TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query pg_buffercache_numa, one gets the following pages: # strace -fp 14364 -e move_pages strace: Process 14364 attached move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,... (gdb) print BufferBlocks $1 = 0x7f8bfe4b5000 "" while BufferBlocks actually starts there, so we are good without HP. With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb) print BufferBlocks => $1 = 0x7ff33c879000 so we are actually accessing earlier pointer (!), but Buffer Blocks is like 15-th structure there (and there's always going be something earlier to my understanding): select row_number() over (order by off),* from pg_shmem_allocations order by off asc limit 15; row_number | name | off | size | allocated_size ------------+------------------------+---------+-----------+---------------- [..] 13 | Shared MultiXact State | 5735936 | 1148 | 1152 14 | Buffer Descriptors | 5737088 | 1048576 | 1048576 15 | Buffer Blocks | 6785664 | 134221824 | 134221824 To me this is finding in itself, with HP shared_buffers being located on page with something else... b) in pg_shmem_allocations_numa view without HP we don't even get to as low as ShmemBase for move_pages(), but with HP=on we hit as low as ShmemBase but not lower, so we are good IMHO. Wouldn't buildfarm actually tell us this if that's bad as an insurance policy? > >>> + <tbody> > >>> + <row> > >>> + <entry role="catalog_table_entry"><para role="column_definition"> > >>> + <structfield>bufferid</structfield> <type>integer</type> > >>> + </para> > >>> + <para> > >>> + ID, in the range 1..<varname>shared_buffers</varname> > >>> + </para></entry> > >>> + </row> > >>> + > >>> + <row> > >>> + <entry role="catalog_table_entry"><para role="column_definition"> > >>> + <structfield>page_num</structfield> <type>int</type> > >>> + </para> > >>> + <para> > >>> + number of OS memory page for this buffer > >>> + </para></entry> > >>> + </row> > >> > >> "page_num" doesn't really seem very descriptive for "number of OS memory page > >> for this buffer". To me "number of" sounds like it's counting the number of > >> associated pages, but it's really just a "page id" or something like that. > >> > >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit? > > > > Tomas, are you good with rename? I think I've would also prefer os_page_id. > > > > Fine with me. Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_" anywhere, so let's stick to that. Done that for nodeid too. > >>> + /* > >>> + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while > >>> + * the OS may have different memory page sizes. > >>> + * > >>> + * To correctly map between them, we need to: 1. Determine the OS memory > >>> + * page size 2. Calculate how many OS pages are used by all buffer blocks > >>> + * 3. Calculate how many OS pages are contained within each database > >>> + * block. > >>> + * > >>> + * This information is needed before calling move_pages() for NUMA memory > >>> + * node inquiry. > >>> + */ > >>> + os_page_size = pg_numa_get_pagesize(); > >>> + > >>> + /* > >>> + * Allocate memory for page pointers and status based on total shared > >>> + * memory size. This simplified approach allocates enough space for all > >>> + * pages in shared memory rather than calculating the exact requirements > >>> + * for each segment. > >>> + * > >>> + * XXX Isn't this wasteful? But there probably is one large segment of > >>> + * shared memory, much larger than the rest anyway. > >>> + */ > >>> + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size; > >>> + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count); > >>> + pages_status = palloc(sizeof(int) * shm_total_page_count); > >> > >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(), > >> could more be moved to a shared helper function? > > > > -> Tomas? > > > > I'm not against that in principle, but when I tried it didn't quite help > that much. But maybe it's better with the current patch version. > > >>> + hash_seq_init(&hstat, ShmemIndex); > >>> + > >>> + /* output all allocated entries */ > >>> + memset(nulls, 0, sizeof(nulls)); > >>> + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) > >>> + { > >> > >> One thing that's interesting with using ShmemIndex is that this won't get > >> anonymous allocations. I think that's ok for now, it'd be complicated to > >> figure out the unmapped "regions". But I guess it' be good to mention it in > >> the ocs? > > > > Added. > > BTW: it's also in the function comment too. now. > > [..], but I'm sending it as fast as possible > > to avoid a clash with Tomas. > > > > Please keep working on this. I may hava a bit of time in the evening, > but in the worst case I'll merge it into your patch. So, attached is still patchset v25, still with one-off patches (please git am most, but just `patch -p1 < file` last two and just squash it if you are happy. I've intentionally not squash it to provide changelog). This LGTM. -J.
Attachment
- v25-0005-adjust-page-alignment.patch
- v25-0004-Introduce-pg_shmem_allocations_numa-view.patch
- v25-0001-Add-support-for-basic-NUMA-awareness.patch
- v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch
- v25-0003-adjust-page_num.patch
- v25-0006-fixes-for-review-by-Andres.patch
- v25-0007-fix-remaining-outstanding-issues-from-Sunday.patch
Hi, On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote: > Bertrand noticed this first in > https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal > : > > - startptr = (char *) BufferGetBlock(1); > + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char > *) BufferGetBlock(1)); > > With the above I'm also not getting wonky (-1) results anymore. The > rest of reply assumes we are using this. yeah, I can see that you added it in v25-0007. In the same message I mentioned to "use the actual buffer address when pg_numa_touch_mem_if_required() is called?" So, to be extra cautious we could do something like: @@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) /* Only need to touch memory once per backend process lifetime */ if (firstNumaTouch) - pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]); + pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size)); what do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 7, 2025 at 11:53 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Mon, Apr 07, 2025 at 10:09:26AM +0200, Jakub Wartak wrote: > > Bertrand noticed this first in > > https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal > > : > > > > - startptr = (char *) BufferGetBlock(1); > > + startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char > > *) BufferGetBlock(1)); > > > > With the above I'm also not getting wonky (-1) results anymore. The > > rest of reply assumes we are using this. > > yeah, I can see that you added it in v25-0007. In the same message I mentioned > to "use the actual buffer address when pg_numa_touch_mem_if_required() > is called?" > > So, to be extra cautious we could do something like: > > @@ -474,7 +474,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > > /* Only need to touch memory once per backend process lifetime */ > if (firstNumaTouch) > - pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]); > + pg_numa_touch_mem_if_required(touch, buffptr + (j * os_page_size)); > > > what do you think? Yeah, I think we could include this too as it looks safer (sry I've missed that one). Attached v25 as it was , with this little tweak. -J.
Attachment
Hi, Here's a v26 of this patch series, merging the various fixup patches. I've also reordered the patches so that the pg_buffercache part is last. The two other patches are ready to go, and it seems better to push the built-in catalog before the pg_buffercache contrib module. For 0001 and 0002, I only have some minor tweaks - comment rewordings etc. I kept them in separate patches to make it obvious, but I think it's fine. The one specific tweak is to account for the +1 page, which might happen if the buffers/pages are shifted in some way. For 0003, I made some more serious changes - I reworked how buffers and pages are mapped. I concluded that relying on pages_per_buffer and buffer_per_pages is way too fiddly, if we can't guarantee the buffers are not "shifted" in some way. Which I think we can't. So I reworked that so that os_page_ptrs always points to actual pages without duplicate pointers. And then for each buffer we calculate the first page / last page, and iterate over those. A comment suggested the old code made the retrieval simpler, but I find the new approach much easier. And it doesn't have to worry how are the pages/buffers shifted and so on. The one caveat is that we can't know how many entries the function will produce until after going through the buffers. We can say only calculate the maximum number of entries. I think that's fine - we may allocate a bit more space than needed, but we also save space in os_page_status. I intend to push 0001 and 0002 shortly, and 0003 after a bit more review and testing, unless I hear objections. regards -- Tomas Vondra
Attachment
Hi, On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote: > On 4/6/25 01:00, Andres Freund wrote: > > On 2025-04-05 18:29:22 -0400, Andres Freund wrote: > >> I think one thing that the docs should mention is that calling the numa > >> functions/views will force the pages to be allocated, even if they're > >> currently unused. > >> > >> Newly started server, with s_b of 32GB an 2MB huge pages: > >> > >> grep ^Huge /proc/meminfo > >> HugePages_Total: 34802 > >> HugePages_Free: 34448 > >> HugePages_Rsvd: 16437 > >> HugePages_Surp: 0 > >> Hugepagesize: 2048 kB > >> Hugetlb: 76517376 kB > >> > >> run > >> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; > >> > >> Now the pages that previously were marked as reserved are actually allocated: > >> > >> grep ^Huge /proc/meminfo > >> HugePages_Total: 34802 > >> HugePages_Free: 18012 > >> HugePages_Rsvd: 1 > >> HugePages_Surp: 0 > >> Hugepagesize: 2048 kB > >> Hugetlb: 76517376 kB > >> > >> > >> I don't see how we can avoid that right now, but at the very least we ought to > >> document it. > > > > The only allocation where that really matters is shared_buffers. I wonder if > > we could special case the logic for that, by only probing if at least one of > > the buffers in the range is valid. > > > > Then we could treat a page status of -ENOENT as "page is not mapped" and > > display NULL for the node_id? > > > > Of course that would mean that we'd always need to > > pg_numa_touch_mem_if_required(), not just the first time round, because we > > previously might not have for a page that is now valid. But compared to the > > cost of actually allocating pages, the cost for that seems small. > > > > I don't think this would be a good trade off. The buffers already have a > NUMA node, and users would be interested in that. The thing is that the buffer might *NOT* have a numa node. That's e.g. the case in the above example - otherwise we wouldn't initially have seen the large HugePages_Rsvd. Forcing all those pages to be allocated via pg_numa_touch_mem_if_required() itself wouldn't be too bad - in fact I'd rather like to have an explicit way of doing that. The problem is that that leads to all those allocations to happen on the *current* numa node (unless you have started postgres with numactl --interleave=all or such), rather than the node where the normal first use woul have allocated it. > It's just that we don't have the memory mapped in the current backend, so > I'd bet people would not be happy with NULL, and would proceed to force the > allocation in some other way (say, a large query of some sort). Which > obviously causes a lot of other problems. I don't think that really would be the case with what I proposed? If any buffer in the region were valid, we would force the allocation to become known to the current backend. Greetings, Andres Freund
Hi, On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote: > On 4/6/25 00:29, Andres Freund wrote: > >> + > >> + if (firstNumaTouch) > >> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); > > > > Over the patchseries the related code is duplicated. Seems like it'd be good > > to put it into pg_numa instead? This seems like the thing that's good to > > abstract away in one central spot. > > > > Abstract away which part, exactly? I thought about moving some of the > code to port/pg_numa.c, but it didn't seem worth it. The easiest would be to just have a single function that does this for the whole shared memory allocation, without having to integrate it with the per-allocation or per-buffer loop. > > > >> + /* > >> + * If we have multiple OS pages per buffer, fill those in too. We > >> + * always want at least one OS page, even if there are multiple > >> + * buffers per page. > >> + * > >> + * Altough we could query just once per each OS page, we do it > >> + * repeatably for each Buffer and hit the same address as > >> + * move_pages(2) requires page aligment. This also simplifies > >> + * retrieval code later on. Also NBuffers starts from 1. > >> + */ > >> + for (j = 0; j < Max(1, pages_per_buffer); j++) > >> + { > >> + char *buffptr = (char *) BufferGetBlock(i + 1); > >> + > >> + fctx->record[idx].bufferid = bufferid; > >> + fctx->record[idx].numa_page = j; > >> + > >> + os_page_ptrs[idx] > >> + = (char *) TYPEALIGN(os_page_size, > >> + buffptr + (os_page_size * j)); > > > > FWIW, this bit here is the most expensive part of the function itself, as the > > compiler has no choice than to implement it as an actual division, as > > os_page_size is runtime variable. > > > > It'd be fine to leave it like that, the call to numa_move_pages() is way more > > expensive. But it shouldn't be too hard to do that alignment once, rather than > > having to do it over and over. > > > > Division? It's entirely possible I'm missing something obvious, but I > don't see any divisions in this code. Oops. The division was only added in a subsequent patch, not the quoted code. At the time it was: + /* calculate ID of the OS memory page */ + fctx->record[idx].numa_page + = ((char *) os_page_ptrs[idx] - startptr) / os_page_size; Greetings, Andres Freund
On 4/7/25 17:51, Andres Freund wrote: > Hi, > > On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote: >> On 4/6/25 01:00, Andres Freund wrote: >>> On 2025-04-05 18:29:22 -0400, Andres Freund wrote: >>>> I think one thing that the docs should mention is that calling the numa >>>> functions/views will force the pages to be allocated, even if they're >>>> currently unused. >>>> >>>> Newly started server, with s_b of 32GB an 2MB huge pages: >>>> >>>> grep ^Huge /proc/meminfo >>>> HugePages_Total: 34802 >>>> HugePages_Free: 34448 >>>> HugePages_Rsvd: 16437 >>>> HugePages_Surp: 0 >>>> Hugepagesize: 2048 kB >>>> Hugetlb: 76517376 kB >>>> >>>> run >>>> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id; >>>> >>>> Now the pages that previously were marked as reserved are actually allocated: >>>> >>>> grep ^Huge /proc/meminfo >>>> HugePages_Total: 34802 >>>> HugePages_Free: 18012 >>>> HugePages_Rsvd: 1 >>>> HugePages_Surp: 0 >>>> Hugepagesize: 2048 kB >>>> Hugetlb: 76517376 kB >>>> >>>> >>>> I don't see how we can avoid that right now, but at the very least we ought to >>>> document it. >>> >>> The only allocation where that really matters is shared_buffers. I wonder if >>> we could special case the logic for that, by only probing if at least one of >>> the buffers in the range is valid. >>> >>> Then we could treat a page status of -ENOENT as "page is not mapped" and >>> display NULL for the node_id? >>> >>> Of course that would mean that we'd always need to >>> pg_numa_touch_mem_if_required(), not just the first time round, because we >>> previously might not have for a page that is now valid. But compared to the >>> cost of actually allocating pages, the cost for that seems small. >>> >> >> I don't think this would be a good trade off. The buffers already have a >> NUMA node, and users would be interested in that. > > The thing is that the buffer might *NOT* have a numa node. That's e.g. the > case in the above example - otherwise we wouldn't initially have seen the > large HugePages_Rsvd. > > Forcing all those pages to be allocated via pg_numa_touch_mem_if_required() > itself wouldn't be too bad - in fact I'd rather like to have an explicit way > of doing that. The problem is that that leads to all those allocations to > happen on the *current* numa node (unless you have started postgres with > numactl --interleave=all or such), rather than the node where the normal first > use woul have allocated it. > I agree, forcing those allocations to happen on a single node seems rather unfortunate. But really, how likely is it that someone will run this function on a cluster that hasn't already allocated this memory? I'm not saying it can't happen, but we already have this issue if you start and do a warmup from a single connection ... > >> It's just that we don't have the memory mapped in the current backend, so >> I'd bet people would not be happy with NULL, and would proceed to force the >> allocation in some other way (say, a large query of some sort). Which >> obviously causes a lot of other problems. > > I don't think that really would be the case with what I proposed? If any > buffer in the region were valid, we would force the allocation to become known > to the current backend. > It's not quite clear to me what exactly are you proposing :-( I believe you're referring to this: > The only allocation where that really matters is shared_buffers. I wonder if > we could special case the logic for that, by only probing if at least one of > the buffers in the range is valid. > > Then we could treat a page status of -ENOENT as "page is not mapped" and > display NULL for the node_id? > > Of course that would mean that we'd always need to > pg_numa_touch_mem_if_required(), not just the first time round, because we > previously might not have for a page that is now valid. But compared to the > cost of actually allocating pages, the cost for that seems small. I suppose by "range" you mean buffers on a given memory page, and "valid" means BufferIsValid. Yeah, that probably means the memory page is allocated. But if the buffer is invalid, it does not mean the memory is not allocated, right? So does it make the buffer not interesting? I'd find this ambiguity rather confusing, i.e. we'd never know if NULL means just "invalid buffer" or "not allocated". Maybe we should simply return rows only for valid buffers, to make it more explicit that we say nothing about NUMA nodes for the invalid ones. I think we need to decide whether the current patches are good enough for PG18, with the current behavior, and then maybe improve that in PG19. Or whether this is so serious we have to leave all of it for PG19. I'd go with the former, but perhaps I'm wrong. I don't feel like I want to be reworking this less than a day before the feature freeze. Attached is v27, which I planned to push, but I'll hold off. regards -- Tomas Vondra
Attachment
Hi, On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote: > > Forcing all those pages to be allocated via pg_numa_touch_mem_if_required() > > itself wouldn't be too bad - in fact I'd rather like to have an explicit way > > of doing that. The problem is that that leads to all those allocations to > > happen on the *current* numa node (unless you have started postgres with > > numactl --interleave=all or such), rather than the node where the normal first > > use woul have allocated it. > > > > I agree, forcing those allocations to happen on a single node seems > rather unfortunate. But really, how likely is it that someone will run > this function on a cluster that hasn't already allocated this memory? I think it's not at all unlikely to have parts of shared buffers unused at the start of a benchmark, e.g. because the table sizes grow over time. > I'm not saying it can't happen, but we already have this issue if you > start and do a warmup from a single connection ... Indeed! We really need to fix this... > > > >> It's just that we don't have the memory mapped in the current backend, so > >> I'd bet people would not be happy with NULL, and would proceed to force the > >> allocation in some other way (say, a large query of some sort). Which > >> obviously causes a lot of other problems. > > > > I don't think that really would be the case with what I proposed? If any > > buffer in the region were valid, we would force the allocation to become known > > to the current backend. > > > > It's not quite clear to me what exactly are you proposing :-( > > I believe you're referring to this: > > > The only allocation where that really matters is shared_buffers. I wonder if > > we could special case the logic for that, by only probing if at least one of > > the buffers in the range is valid. > > > > Then we could treat a page status of -ENOENT as "page is not mapped" and > > display NULL for the node_id? > > > > Of course that would mean that we'd always need to > > pg_numa_touch_mem_if_required(), not just the first time round, because we > > previously might not have for a page that is now valid. But compared to the > > cost of actually allocating pages, the cost for that seems small. > > I suppose by "range" you mean buffers on a given memory page Correct. > and "valid" means BufferIsValid. I was thinking of checking if the BufferDesc indicates BM_VALID or BM_TAG_VALID. BufferIsValid() just does a range check :(. > Yeah, that probably means the memory page is allocated. But if the buffer is > invalid, it does not mean the memory is not allocated, right? So does it > make the buffer not interesting? Well, you don't have contents in it it can't really affect performance. But yea, I agree, it's not perfect either. > I think we need to decide whether the current patches are good enough > for PG18, with the current behavior, and then maybe improve that in > PG19. I think as long as the docs mention this with <note> or <warning> it's ok for now. Greetings, Andres Freund
On 4/7/25 18:42, Andres Freund wrote: > ... >>> >>> Of course that would mean that we'd always need to >>> pg_numa_touch_mem_if_required(), not just the first time round, because we >>> previously might not have for a page that is now valid. But compared to the >>> cost of actually allocating pages, the cost for that seems small. >> >> I suppose by "range" you mean buffers on a given memory page > > Correct. > >> and "valid" means BufferIsValid. > > I was thinking of checking if the BufferDesc indicates BM_VALID or > BM_TAG_VALID. > > BufferIsValid() just does a range check :(. > Well, I guess BufferIsValid() seems a tad confusing ... > >> Yeah, that probably means the memory page is allocated. But if the buffer is >> invalid, it does not mean the memory is not allocated, right? So does it >> make the buffer not interesting? > > Well, you don't have contents in it it can't really affect performance. But > yea, I agree, it's not perfect either. > > >> I think we need to decide whether the current patches are good enough >> for PG18, with the current behavior, and then maybe improve that in >> PG19. > > I think as long as the docs mention this with <note> or <warning> it's ok for > now. > OK, I'll add a warning explaining this. regards -- Tomas Vondra
On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote: > They actually look good to me. We've discussed earlier dropping > s/numa_//g for column names (after all views contain it already) so > they are fine in this regard. > There's also the question of consistency: (bufferid, page_num, > node_id) -- maybe should just drop "_" and that's it? > Well I would even possibly consider page_num -> ospagenumber, but that's ugly. I'd go for os_page_num.
On 4/7/25 19:24, Andres Freund wrote: > On 2025-04-04 19:07:12 +0200, Jakub Wartak wrote: >> They actually look good to me. We've discussed earlier dropping >> s/numa_//g for column names (after all views contain it already) so >> they are fine in this regard. >> There's also the question of consistency: (bufferid, page_num, >> node_id) -- maybe should just drop "_" and that's it? >> Well I would even possibly consider page_num -> ospagenumber, but that's ugly. > > I'd go for os_page_num. WFM. I've renamed "ospageid" to "os_page_num" in 0003, and I've also renamed "node_id" to "numa_node" in 0002+0003, to make it clearer what kind of node this is. This reminds me whether it's fine to have "os_page_num" as int. Should we make it bigint, perhaps? Attached is v28, with the commit messages updated, added <warning> about allocation of the memory, etc. I'll let the CI run the tests on it, and then will push, unless someone has more comments. regards -- Tomas Vondra
Attachment
Hi, On 2025-04-07 19:59:59 +0200, Tomas Vondra wrote: > This reminds me whether it's fine to have "os_page_num" as int. Should > we make it bigint, perhaps? Yes, that's better. Seems very unlikely anybody will encounter this in the next few years, but it's basically free to use the larger range here. Greetings, Andres Freund
Hi, On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote: > Hi, > > On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote: > > I was thinking of checking if the BufferDesc indicates BM_VALID or > BM_TAG_VALID. Yeah, that's what I did propose in [1] (when we were speaking about get_mempolicy()) and I think that would make sense as future improvement. > > > > I think we need to decide whether the current patches are good enough > > for PG18, with the current behavior, and then maybe improve that in > > PG19. > > I think as long as the docs mention this with <note> or <warning> it's ok for > now. +1 A few comments on v27: === 1 pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa() reports it as node_id. Maybe we should use the same "naming" in both. === 2 postgres=# select count(*) from pg_buffercache; count ------- 65536 (1 row) but postgres=# select count(*) from pg_buffercache_numa; count ------- 64 (1 row) with: postgres=# show block_size; block_size ------------ 2048 and Hugepagesize: 2048 kB. and postgres=# show shared_buffers; shared_buffers ---------------- 128MB (1 row) And even if for testing I set: - funcctx->max_calls = idx; + funcctx->max_calls = 65536; then I start to see weird results: postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache); count ------- 65472 (1 row) So it looks like that the new way to iterate on the buffers that has been introduced in v26/v27 has some issue? [1]: https://www.postgresql.org/message-id/Z64Pr8CTG0RTrGR3%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 4/7/25 20:11, Bertrand Drouvot wrote: > Hi, > > On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote: >> Hi, >> >> On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote: >> >> I was thinking of checking if the BufferDesc indicates BM_VALID or >> BM_TAG_VALID. > > Yeah, that's what I did propose in [1] (when we were speaking about get_mempolicy()) > and I think that would make sense as future improvement. > >> >> >>> I think we need to decide whether the current patches are good enough >>> for PG18, with the current behavior, and then maybe improve that in >>> PG19. >> >> I think as long as the docs mention this with <note> or <warning> it's ok for >> now. > > +1 > > A few comments on v27: > > === 1 > > pg_buffercache_numa() reports the node ID as "nodeid" while pg_shmem_allocations_numa() > reports it as node_id. Maybe we should use the same "naming" in both. > This was renamed in v28 to "numa_node" in both parts. > === 2 > > postgres=# select count(*) from pg_buffercache; > count > ------- > 65536 > (1 row) > > but > > postgres=# select count(*) from pg_buffercache_numa; > count > ------- > 64 > (1 row) > > with: > > postgres=# show block_size; > block_size > ------------ > 2048 > > and Hugepagesize: 2048 kB. > > and > > postgres=# show shared_buffers; > shared_buffers > ---------------- > 128MB > (1 row) > > And even if for testing I set: > > - funcctx->max_calls = idx; > + funcctx->max_calls = 65536; > > then I start to see weird results: > > postgres=# select count(*) from pg_buffercache_numa where bufferid not in (select bufferid from pg_buffercache); > count > ------- > 65472 > (1 row) > > So it looks like that the new way to iterate on the buffers that has been introduced > in v26/v27 has some issue? > Yeah, the calculations of the end pointers were wrong - we need to round up (using TYPEALIGN()) when calculating number of pages, and just add BLCKSZ (without any rounding) when calculating end of buffer. The 0004 fixes this for me (I tried this with various blocksizes / page sizes). Thanks for noticing this! regards -- Tomas Vondra
Attachment
On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote: > > So it looks like that the new way to iterate on the buffers that has been introduced > > in v26/v27 has some issue? > > > > Yeah, the calculations of the end pointers were wrong - we need to round > up (using TYPEALIGN()) when calculating number of pages, and just add > BLCKSZ (without any rounding) when calculating end of buffer. The 0004 > fixes this for me (I tried this with various blocksizes / page sizes). > > Thanks for noticing this! Hi, v28-0001 LGTM v28-0002 got this warning Andres was talking about, so LGTM v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite some time we have 2nd bug there, but it appears that PG never properly aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume count(*) pg_buffercache_numa == count(*) pg_buffercache. So before anybody else reports this as bug about duplicate bufferids: # select * from pg_buffercache_numa where os_page_num <= 2; bufferid | os_page_num | numa_node ----------+-------------+----------- [..] 195 | 0 | 0 196 | 0 | 0 <-- duplicate? 196 | 1 | 0 <-- duplicate? 197 | 1 | 0 198 | 1 | 0 That is strange because on first look one could assume we get 257x 8192 blocks per os_page (2^21) that way, which is impossible. Exercises in pointers show this: # select * from pg_buffercache_numa where os_page_num <= 2; DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152 DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts -- custom elog(DEBUG1) DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real buffptr=0x7f8661079000 [..] DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real buffptr=0x7f86611fb000 DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real buffptr=0x7f86611fd000 DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real buffptr=0x7f86611ff000 (!) DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000 endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real buffptr=0x7f86611ff000 (!) DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real buffptr=0x7f8661201000 DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real buffptr=0x7f8661203000 so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big (and ends up at 0x7f8661201000), while we also have HP that hosts it between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2 hugepages. Open question for another day is shouldn't (of course outside of this $thread) align s_b to HP size or not? As per above even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000 (that's 495616 bytes difference). -J.
Hi, I've pushed all three parts of v29, with some additional corrections (picked lower OIDs, bumped catversion, fixed commit messages). On 4/7/25 23:01, Jakub Wartak wrote: > On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote: > >>> So it looks like that the new way to iterate on the buffers that has been introduced >>> in v26/v27 has some issue? >>> >> >> Yeah, the calculations of the end pointers were wrong - we need to round >> up (using TYPEALIGN()) when calculating number of pages, and just add >> BLCKSZ (without any rounding) when calculating end of buffer. The 0004 >> fixes this for me (I tried this with various blocksizes / page sizes). >> >> Thanks for noticing this! > > Hi, > > v28-0001 LGTM > v28-0002 got this warning Andres was talking about, so LGTM > v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite > some time we have 2nd bug there, but it appears that PG never properly > aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume > count(*) pg_buffercache_numa == count(*) pg_buffercache. > AFAIK v29 fixed this, the end pointer calculations were wrong. With that it passed for me with/without THP, different blocks sizes etc. We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE, which is 4kB or so. And it's determined at compile time, while THP is determined when starting the cluster. > So before anybody else reports this as bug about duplicate bufferids: > > # select * from pg_buffercache_numa where os_page_num <= 2; > bufferid | os_page_num | numa_node > ----------+-------------+----------- > [..] > 195 | 0 | 0 > 196 | 0 | 0 <-- duplicate? > 196 | 1 | 0 <-- duplicate? > 197 | 1 | 0 > 198 | 1 | 0 > > That is strange because on first look one could assume we get 257x > 8192 blocks per os_page (2^21) that way, which is impossible. > Exercises in pointers show this: >> # select * from pg_buffercache_numa where os_page_num <= 2; > DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152 > DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts > -- custom elog(DEBUG1) > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real > buffptr=0x7f8661079000 > [..] > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real > buffptr=0x7f86611fb000 > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real > buffptr=0x7f86611fd000 > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real > buffptr=0x7f86611ff000 (!) > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real > buffptr=0x7f86611ff000 (!) > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 > endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real > buffptr=0x7f8661201000 > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 > endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real > buffptr=0x7f8661203000 > > so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big > (and ends up at 0x7f8661201000), while we also have HP that hosts it > between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2 > hugepages. Open question for another day is shouldn't (of course > outside of this $thread) align s_b to HP size or not? As per above > even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000 > (that's 495616 bytes difference). > Right, this is because that's where the THP boundary happens to be. And that one "duplicate" entry is for a buffer that happens to span two pages. This is *exactly* the misalignment of blocks and pages that I was wondering about earlier, and with the fixed endptr calculation we handle that just fine. No opinion on the aligment - maybe we should do that, but it's not something this patch needs to worry about. regards -- Tomas Vondra
On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > I've pushed all three parts of v29, with some additional corrections > (picked lower OIDs, bumped catversion, fixed commit messages). Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for such incredible support on the last mile of this and also to Bertrand (for persistence!), Andres and Alvaro for lots of babysitting. > AFAIK v29 fixed this, the end pointer calculations were wrong. With that > it passed for me with/without THP, different blocks sizes etc. Yeah, that was a typo, I've started writing about v28, but then in the middle of that v29 landed and I still was chasing that finding, I've just forgotten to bump this. > We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE, > which is 4kB or so. And it's determined at compile time, while THP is > determined when starting the cluster. [..] > Right, this is because that's where the THP boundary happens to be. And > that one "duplicate" entry is for a buffer that happens to span two > pages. This is *exactly* the misalignment of blocks and pages that I was > wondering about earlier, and with the fixed endptr calculation we handle > that just fine. > > No opinion on the aligment - maybe we should do that, but it's not > something this patch needs to worry about. Agreed.I was wondering even if there are other drawbacks of the situation, but other than not reporting duplicates here in this pg_buffercache view, I cannot identify anything worthwhile. -J.
On 4/7/25 23:50, Jakub Wartak wrote: > On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> I've pushed all three parts of v29, with some additional corrections >> (picked lower OIDs, bumped catversion, fixed commit messages). > > Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for > such incredible support on the last mile of this and also to Bertrand > (for persistence!), Andres and Alvaro for lots of babysitting. > Glad I could help, thanks for the patch. >> AFAIK v29 fixed this, the end pointer calculations were wrong. With that >> it passed for me with/without THP, different blocks sizes etc. > > Yeah, that was a typo, I've started writing about v28, but then in the > middle of that v29 landed and I still was chasing that finding, I've > just forgotten to bump this. > >> We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE, >> which is 4kB or so. And it's determined at compile time, while THP is >> determined when starting the cluster. > [..] >> Right, this is because that's where the THP boundary happens to be. And >> that one "duplicate" entry is for a buffer that happens to span two >> pages. This is *exactly* the misalignment of blocks and pages that I was >> wondering about earlier, and with the fixed endptr calculation we handle >> that just fine. >> >> No opinion on the aligment - maybe we should do that, but it's not >> something this patch needs to worry about. > > Agreed.I was wondering even if there are other drawbacks of the > situation, but other than not reporting duplicates here in this > pg_buffercache view, I cannot identify anything worthwhile. > Well, the drawback is that accessing the buffer may require hitting two different NUMA nodes. I'm not 100% sure it can actually happen, though. the buffer should be initialized as a whole, so it should got to the same node. But maybe it could be "split" by THP migration, or something like that. In any case, that's not caused by this patch, and it's less serious with huge pages - it's only affect buffers on the boundaries. But with the small 4K pages it can happen for *every* buffer. regards -- Tomas Vondra
Hi, Thanks for developing this great feature. The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'. The attached small patch fixes the manual. Regards, Noriyoshi Shinoda -----Original Message----- From: Tomas Vondra <tomas@vondra.me> Sent: Tuesday, April 8, 2025 6:59 AM To: Jakub Wartak <jakub.wartak@enterprisedb.com> Cc: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>; Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>;Nazir Bilal Yavuz <byavuz81@gmail.com>; PostgreSQL Hackers <pgsql-hackers@postgresql.org> Subject: Re: Draft for basic NUMA observability On 4/7/25 23:50, Jakub Wartak wrote: > On Mon, Apr 7, 2025 at 11:27 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> I've pushed all three parts of v29, with some additional corrections >> (picked lower OIDs, bumped catversion, fixed commit messages). > > Hi Tomas, great, awesome! (this is an awesome feeling)! Thank You for > such incredible support on the last mile of this and also to Bertrand > (for persistence!), Andres and Alvaro for lots of babysitting. > Glad I could help, thanks for the patch. >> AFAIK v29 fixed this, the end pointer calculations were wrong. With >> that it passed for me with/without THP, different blocks sizes etc. > > Yeah, that was a typo, I've started writing about v28, but then in the > middle of that v29 landed and I still was chasing that finding, I've > just forgotten to bump this. > >> We don't align buffers to os_page_size, we align them >> PG_IO_ALIGN_SIZE, which is 4kB or so. And it's determined at compile >> time, while THP is determined when starting the cluster. > [..] >> Right, this is because that's where the THP boundary happens to be. >> And that one "duplicate" entry is for a buffer that happens to span >> two pages. This is *exactly* the misalignment of blocks and pages >> that I was wondering about earlier, and with the fixed endptr >> calculation we handle that just fine. >> >> No opinion on the aligment - maybe we should do that, but it's not >> something this patch needs to worry about. > > Agreed.I was wondering even if there are other drawbacks of the > situation, but other than not reporting duplicates here in this > pg_buffercache view, I cannot identify anything worthwhile. > Well, the drawback is that accessing the buffer may require hitting two different NUMA nodes. I'm not 100% sure it can actuallyhappen, though. the buffer should be initialized as a whole, so it should got to the same node. But maybe it could be "split" by THP migration,or something like that. In any case, that's not caused by this patch, and it's less serious with huge pages - it's only affect buffers on the boundaries.But with the small 4K pages it can happen for *every* buffer. regards -- Tomas Vondra
Attachment
On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote: > Hi, > > Thanks for developing this great feature. > The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'. > The attached small patch fixes the manual. > Thank you for noticing this and for the fix! Pushed. This also reminded me we agreed to change page_num to bigint, which I forgot to change before commit. So I adjusted that too, separately. regards -- Tomas Vondra
On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: > I'll let the CI run the tests on it, and > then will push, unless someone has more comments. > Hi! I noticed strange failure after this commit[0] Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or directory 79 | #include [01;31m [K"utils/errcodes.h" [m [K | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K compilation terminated. [0] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dogfish&dt=2025-04-08%2011%3A25%3A11 -- Best regards, Kirill Reshke
Hi, On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: > On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: > > I'll let the CI run the tests on it, and > > then will push, unless someone has more comments. > > > > > Hi! I noticed strange failure after this commit[0] > > Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 > > In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, > from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: > [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K > [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or > directory > 79 | #include [01;31m [K"utils/errcodes.h" [m [K > | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K > compilation terminated. $ ninja -t missingdeps Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) Processed 2384 nodes. Error: There are 2 missing dependency paths. 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator. There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. I think it's not right that something in src/port defines an SQL callable function. The set of headers for that pull in a lot of things. Since the file relies on things like GUCs, I suspect this should be in src/backend/port or such instead. Greetings, Andres Freund
On 4/8/25 15:06, Andres Freund wrote: > Hi, > > On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: >> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: >>> I'll let the CI run the tests on it, and >>> then will push, unless someone has more comments. >>> >> >> >> Hi! I noticed strange failure after this commit[0] >> >> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 >> >> In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, >> from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: >> [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K >> [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or >> directory >> 79 | #include [01;31m [K"utils/errcodes.h" [m [K >> | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K >> compilation terminated. > > $ ninja -t missingdeps > Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Processed 2384 nodes. > Error: There are 2 missing dependency paths. > 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to thegenerator. > There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. > > > I think it's not right that something in src/port defines an SQL callable > function. The set of headers for that pull in a lot of things. > > Since the file relies on things like GUCs, I suspect this should be in > src/backend/port or such instead. > Yeah, I think you're right, src/backend/port seems like a better place for this. I'll look into moving that in the evening. regards -- Tomas Vondra
Hi, On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote: >On 4/8/25 15:06, Andres Freund wrote: >> Hi, >> >> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: >>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: >>>> I'll let the CI run the tests on it, and >>>> then will push, unless someone has more comments. >>>> >>> >>> >>> Hi! I noticed strange failure after this commit[0] >>> >>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 >>> >>> In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, >>> from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: >>> [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K >>> [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or >>> directory >>> 79 | #include [01;31m [K"utils/errcodes.h" [m [K >>> | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K >>> compilation terminated. >> >> $ ninja -t missingdeps >> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) >> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) >> Processed 2384 nodes. >> Error: There are 2 missing dependency paths. >> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to thegenerator. >> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. >> >> >> I think it's not right that something in src/port defines an SQL callable >> function. The set of headers for that pull in a lot of things. >> >> Since the file relies on things like GUCs, I suspect this should be in >> src/backend/port or such instead. >> > >Yeah, I think you're right, src/backend/port seems like a better place >for this. I'll look into moving that in the evening. On a second look I wonder if just the SQL function and perhaps the page size function should be moved. There are FE programsthat could potentially benefit from num a awareness (e.g. pgbench). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On 2025-04-08 09:35:37 -0400, Andres Freund wrote: > On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote: > >On 4/8/25 15:06, Andres Freund wrote: > >> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: > >> I think it's not right that something in src/port defines an SQL callable > >> function. The set of headers for that pull in a lot of things. > >> > >> Since the file relies on things like GUCs, I suspect this should be in > >> src/backend/port or such instead. > >> > > > >Yeah, I think you're right, src/backend/port seems like a better place > >for this. I'll look into moving that in the evening. > > On a second look I wonder if just the SQL function and perhaps the page size > function should be moved. There are FE programs that could potentially > benefit from num a awareness (e.g. pgbench). I would move pg_numa_available() to something like src/backend/storage/ipc/shmem.c. I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in with shmem.c or such. Or we could just make it work in FE code by making this part Assert(IsUnderPostmaster); Assert(huge_pages_status != HUGE_PAGES_UNKNOWN); if (huge_pages_status == HUGE_PAGES_ON) GetHugePageSize(&os_page_size, NULL); #ifndef FRONTEND - we don't currently support using huge pages in FE programs after all. But querying the page size might still be useful. Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is needed? Greetings, Andres Freund
On 4/8/25 16:59, Andres Freund wrote: > Hi, > > On 2025-04-08 09:35:37 -0400, Andres Freund wrote: >> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote: >>> On 4/8/25 15:06, Andres Freund wrote: >>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: >>>> I think it's not right that something in src/port defines an SQL callable >>>> function. The set of headers for that pull in a lot of things. >>>> >>>> Since the file relies on things like GUCs, I suspect this should be in >>>> src/backend/port or such instead. >>>> >>> >>> Yeah, I think you're right, src/backend/port seems like a better place >>> for this. I'll look into moving that in the evening. >> >> On a second look I wonder if just the SQL function and perhaps the page size >> function should be moved. There are FE programs that could potentially >> benefit from num a awareness (e.g. pgbench). > > I would move pg_numa_available() to something like > src/backend/storage/ipc/shmem.c. > Makes sense, done in the attached patch. > I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's > not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in > with shmem.c or such. > True. It's true it's not really "NUMA page", but page size for the shmem segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c, same as pg_numa_available(). The rest of pg_numa.c is moved to src/backend/port > Or we could just make it work in FE code by making this part > > Assert(IsUnderPostmaster); > Assert(huge_pages_status != HUGE_PAGES_UNKNOWN); > > if (huge_pages_status == HUGE_PAGES_ON) > GetHugePageSize(&os_page_size, NULL); > > #ifndef FRONTEND - we don't currently support using huge pages in FE programs > after all. But querying the page size might still be useful. > I don't really like this. Why shouldn't the FE program simply call sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also verify huge page status. > > Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is > needed? > Right, that was left there after moving the prototype into the .c file. regards -- Tomas Vondra
Attachment
On 4/8/25 15:06, Andres Freund wrote: > Hi, > > On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: >> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: >>> I'll let the CI run the tests on it, and >>> then will push, unless someone has more comments. >>> >> >> >> Hi! I noticed strange failure after this commit[0] >> >> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 >> >> In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, >> from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: >> [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K >> [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or >> directory >> 79 | #include [01;31m [K"utils/errcodes.h" [m [K >> | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K >> compilation terminated. > > $ ninja -t missingdeps > Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Processed 2384 nodes. > Error: There are 2 missing dependency paths. > 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to thegenerator. > There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. > Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I ran those tests many times, and had it failed at least once I'd have fixed it before commit. regards -- Tomas Vondra
Hi, On 2025-04-09 00:47:59 +0200, Tomas Vondra wrote: > On 4/8/25 16:59, Andres Freund wrote: > > Hi, > > > > On 2025-04-08 09:35:37 -0400, Andres Freund wrote: > >> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote: > >>> On 4/8/25 15:06, Andres Freund wrote: > >>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: > >>>> I think it's not right that something in src/port defines an SQL callable > >>>> function. The set of headers for that pull in a lot of things. > >>>> > >>>> Since the file relies on things like GUCs, I suspect this should be in > >>>> src/backend/port or such instead. > >>>> > >>> > >>> Yeah, I think you're right, src/backend/port seems like a better place > >>> for this. I'll look into moving that in the evening. > >> > >> On a second look I wonder if just the SQL function and perhaps the page size > >> function should be moved. There are FE programs that could potentially > >> benefit from num a awareness (e.g. pgbench). > > > > I would move pg_numa_available() to something like > > src/backend/storage/ipc/shmem.c. > > > > Makes sense, done in the attached patch. > > > I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's > > not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in > > with shmem.c or such. > > > > True. It's true it's not really "NUMA page", but page size for the shmem > segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c, > same as pg_numa_available(). Cool. > The rest of pg_numa.c is moved to src/backend/port Why move the remainder? It shouldn't have any dependency on postgres.h afterwards? And I think it's not backend-specific. Greetings, Andres Freund
Hi, On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote: > On 4/8/25 15:06, Andres Freund wrote: > > Hi, > > > > On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: > >> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: > >>> I'll let the CI run the tests on it, and > >>> then will push, unless someone has more comments. > >>> > >> > >> > >> Hi! I noticed strange failure after this commit[0] > >> > >> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 > >> > >> In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, > >> from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: > >> [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K > >> [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or > >> directory > >> 79 | #include [01;31m [K"utils/errcodes.h" [m [K > >> | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K > >> compilation terminated. > > > > $ ninja -t missingdeps > > Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > > Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > > Processed 2384 nodes. > > Error: There are 2 missing dependency paths. > > 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to thegenerator. > > There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. > > > > Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I > ran those tests many times, and had it failed at least once I'd have > fixed it before commit. Yes, we should. It's a somewhat newer feature, so we originally couldn't. There only was a very clunky and slow python script when I was doing most of the meson work. I was actually thinking that it might make sense as a meson-registered test, that way one quickly can find the issue both locally and on CI. Greetings, Andres Freund
On Wed, Apr 9, 2025 at 12:48 AM Tomas Vondra <tomas@vondra.me> wrote: > On 4/8/25 16:59, Andres Freund wrote: > > Hi, > > > > On 2025-04-08 09:35:37 -0400, Andres Freund wrote: > >> On April 8, 2025 9:21:57 AM EDT, Tomas Vondra <tomas@vondra.me> wrote: > >>> On 4/8/25 15:06, Andres Freund wrote: > >>>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: > >>>> I think it's not right that something in src/port defines an SQL callable > >>>> function. The set of headers for that pull in a lot of things. > >>>> > >>>> Since the file relies on things like GUCs, I suspect this should be in > >>>> src/backend/port or such instead. > >>>> > >>> > >>> Yeah, I think you're right, src/backend/port seems like a better place > >>> for this. I'll look into moving that in the evening. > >> > >> On a second look I wonder if just the SQL function and perhaps the page size > >> function should be moved. There are FE programs that could potentially > >> benefit from num a awareness (e.g. pgbench). > > > > I would move pg_numa_available() to something like > > src/backend/storage/ipc/shmem.c. > > > > Makes sense, done in the attached patch. > > > I wonder if pg_numa_get_pagesize() should loose the _numa_ in the name, it's > > not actually directly NUMA related? If it were pg_get_pagesize() it'd fit in > > with shmem.c or such. > > > > True. It's true it's not really "NUMA page", but page size for the shmem > segment. So renamed to pg_get_shmem_pagesize() and moved to shmem.c, > same as pg_numa_available(). > > The rest of pg_numa.c is moved to src/backend/port Hi Tomas, sorry for this. I've tested the numa-fixes.patch, CI says it is ok to me (but it was always like that), meson test also looks good here, and some other quick use attempts are OK. I can confirm that `ninja -t missingdeps` is clean with this (without it really wasn't; +1 to adding this check to CI). > > Or we could just make it work in FE code by making this part > > > > Assert(IsUnderPostmaster); > > Assert(huge_pages_status != HUGE_PAGES_UNKNOWN); > > > > if (huge_pages_status == HUGE_PAGES_ON) > > GetHugePageSize(&os_page_size, NULL); > > > > #ifndef FRONTEND - we don't currently support using huge pages in FE programs > > after all. But querying the page size might still be useful. > > > > I don't really like this. Why shouldn't the FE program simply call > sysconf(_SC_PAGESIZE)? It'd be just confusing if in backend it'd also > verify huge page status. True, the pg_shm_get_page_size() looks like a great middleground (not in numa but still for shm w/ hugepages). > > > > Regardless of all of that, I don't think the include of fmgr.h in pg_numa.h is > > needed? > > > > Right, that was left there after moving the prototype into the .c file. -J.
On 4/9/25 01:29, Andres Freund wrote: > Hi, > > On 2025-04-09 01:10:09 +0200, Tomas Vondra wrote: >> On 4/8/25 15:06, Andres Freund wrote: >>> Hi, >>> >>> On 2025-04-08 17:44:19 +0500, Kirill Reshke wrote: >>>> On Mon, 7 Apr 2025 at 23:00, Tomas Vondra <tomas@vondra.me> wrote: >>>>> I'll let the CI run the tests on it, and >>>>> then will push, unless someone has more comments. >>>>> >>>> >>>> >>>> Hi! I noticed strange failure after this commit[0] >>>> >>>> Looks like it is related to 65c298f61fc70f2f960437c05649f71b862e2c48 >>>> >>>> In file included from [01m [K../pgsql/src/include/postgres.h:49 [m [K, >>>> from [01m [K../pgsql/src/port/pg_numa.c:16 [m [K: >>>> [01m [K../pgsql/src/include/utils/elog.h:79:10: [m [K >>>> [01;31m [Kfatal error: [m [Kutils/errcodes.h: No such file or >>>> directory >>>> 79 | #include [01;31m [K"utils/errcodes.h" [m [K >>>> | [01;31m [K^~~~~~~~~~~~~~~~~~ [m [K >>>> compilation terminated. >>> >>> $ ninja -t missingdeps >>> Missing dep: src/port/libpgport.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) >>> Missing dep: src/port/libpgport_shlib.a.p/pg_numa.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) >>> Processed 2384 nodes. >>> Error: There are 2 missing dependency paths. >>> 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to thegenerator. >>> There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean outputdirectory. >>> >> >> Wouldn't it be good to add this (ninja -t missingdeps) to the CI task? I >> ran those tests many times, and had it failed at least once I'd have >> fixed it before commit. > > Yes, we should. It's a somewhat newer feature, so we originally couldn't. > There only was a very clunky and slow python script when I was doing most of > the meson work. > > I was actually thinking that it might make sense as a meson-registered test, > that way one quickly can find the issue both locally and on CI. > OK, here are two patches, where 0001 adds the missingdeps check to the Debian meson build. It just adds that to the build script. 0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to src/backend/port). It still needs to include c.h because of PGDLLIMPORT, but I think that's fine. regards -- Tomas Vondra
On 4/9/25 14:07, Tomas Vondra wrote: > ... > > OK, here are two patches, where 0001 adds the missingdeps check to the > Debian meson build. It just adds that to the build script. > > 0002 leaves the NUMA stuff in src/port (i.e. it's no longer moved to > src/backend/port). It still needs to include c.h because of PGDLLIMPORT, > but I think that's fine. > Forgot to attach the patches ... -- Tomas Vondra
Attachment
Updated patches with proper commit messages etc. -- Tomas Vondra
Attachment
Hi, On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote: > From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas@vondra.me> > Date: Tue, 8 Apr 2025 23:31:29 +0200 > Subject: [PATCH 1/2] Cleanup of pg_numa.c > > This moves/renames some of the functions defined in pg_numa.c: > > * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and > moved to src/backend/storage/ipc/shmem.c. The new name better reflects > that the page size is not related to NUMA, and it's specifically about > the page size used for the main shared memory segment. > > * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into > the backend (which more appropriate for functions callable from SQL). > While at it, improve the comment to explain what page size it returns. > > * remove unnecessary includes from src/port/pg_numa.c, adding > unnecessary dependencies (src/port should be suitable for frontent). > These were leftovers from earlier patch versions. I don't think the include in src/port/pg_numa.c were leftover? Just the one in pg_numa.h, right? I'd mention that the includes of postgres.h/fmgr.h is what caused missing build-time dependencies and via that failures on buildfarm member dogfish. > diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c > index 5e2523cf798..63dff799436 100644 > --- a/src/port/pg_numa.c > +++ b/src/port/pg_numa.c > @@ -13,17 +13,14 @@ > *------------------------------------------------------------------------- > */ > > -#include "postgres.h" > +#include "c.h" > #include <unistd.h> > > #ifdef WIN32 > #include <windows.h> > #endif I think this may not be needed anymore, that was just there for GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new location of pg_numa_get_pagesize()? > From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas@vondra.me> > Date: Wed, 9 Apr 2025 13:29:31 +0200 > Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build > > Extends the meson build on Debian to also check for missing dependencies > by executing > > ninja -t missingdeps > > right after the build. This highlights unindended dependencies. > > Reviewed-by: Andres Freund <andres@anarazel.de> > https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to have it just in CI until we have that. I would however also add it to the windows job, as that's the most "different" type of build / source of missed dependencies that wouldn't show up on our development systems. Greetings, Andres Freund
On 4/9/25 17:14, Andres Freund wrote: > Hi, > > On 2025-04-09 16:33:14 +0200, Tomas Vondra wrote: >> From e1f093d091610d70fba72b2848f25ff44899ea8e Mon Sep 17 00:00:00 2001 >> From: Tomas Vondra <tomas@vondra.me> >> Date: Tue, 8 Apr 2025 23:31:29 +0200 >> Subject: [PATCH 1/2] Cleanup of pg_numa.c >> >> This moves/renames some of the functions defined in pg_numa.c: >> >> * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and >> moved to src/backend/storage/ipc/shmem.c. The new name better reflects >> that the page size is not related to NUMA, and it's specifically about >> the page size used for the main shared memory segment. >> >> * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into >> the backend (which more appropriate for functions callable from SQL). >> While at it, improve the comment to explain what page size it returns. >> >> * remove unnecessary includes from src/port/pg_numa.c, adding >> unnecessary dependencies (src/port should be suitable for frontent). >> These were leftovers from earlier patch versions. > > I don't think the include in src/port/pg_numa.c were leftover? Just the one in > pg_numa.h, right? > Right, that wasn't quite accurate. The miscadmin.h and pg_shmem.h are unnecessary thanks to moving stuff to shmem.c. > I'd mention that the includes of postgres.h/fmgr.h is what caused missing > build-time dependencies and via that failures on buildfarm member dogfish. > Not really, I also need to include "c.h" instead of "postgres.h" (which is also causing the same failure). > >> diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c >> index 5e2523cf798..63dff799436 100644 >> --- a/src/port/pg_numa.c >> +++ b/src/port/pg_numa.c >> @@ -13,17 +13,14 @@ >> *------------------------------------------------------------------------- >> */ >> >> -#include "postgres.h" >> +#include "c.h" >> #include <unistd.h> >> >> #ifdef WIN32 >> #include <windows.h> >> #endif > > I think this may not be needed anymore, that was just there for > GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new > location of pg_numa_get_pagesize()? > Good question. But if it's needed there, shouldn't it have failed on CI? > >> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001 >> From: Tomas Vondra <tomas@vondra.me> >> Date: Wed, 9 Apr 2025 13:29:31 +0200 >> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build >> >> Extends the meson build on Debian to also check for missing dependencies >> by executing >> >> ninja -t missingdeps >> >> right after the build. This highlights unindended dependencies. >> >> Reviewed-by: Andres Freund <andres@anarazel.de> >> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com > > FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to > have it just in CI until we have that. I would however also add it to the > windows job, as that's the most "different" type of build / source of missed > dependencies that wouldn't show up on our development systems. > We can add it as a meson.build test, sure. I was going for the CI first, because then it fires no matter what build I do locally (I'm kinda still used to autotools). If you agree adding it to build_script is the right way to do that, I'll do the same thing for the windows job. regards -- Tomas Vondra
Hi, On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote: > On 4/9/25 17:14, Andres Freund wrote: > > I'd mention that the includes of postgres.h/fmgr.h is what caused missing > > build-time dependencies and via that failures on buildfarm member dogfish. > > > > Not really, I also need to include "c.h" instead of "postgres.h" (which > is also causing the same failure). I did mention postgres.h :) > > I think this may not be needed anymore, that was just there for > > GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new > > location of pg_numa_get_pagesize()? > > > > Good question. But if it's needed there, shouldn't it have failed on CI? Oh. No. It shouldn't have - because that include was completely unnecessary. We always include windows.h on windows. > >> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001 > >> From: Tomas Vondra <tomas@vondra.me> > >> Date: Wed, 9 Apr 2025 13:29:31 +0200 > >> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build > >> > >> Extends the meson build on Debian to also check for missing dependencies > >> by executing > >> > >> ninja -t missingdeps > >> > >> right after the build. This highlights unindended dependencies. > >> > >> Reviewed-by: Andres Freund <andres@anarazel.de> > >> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com > > > > FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to > > have it just in CI until we have that. I would however also add it to the > > windows job, as that's the most "different" type of build / source of missed > > dependencies that wouldn't show up on our development systems. > > > > We can add it as a meson.build test, sure. I was going for the CI first, > because then it fires no matter what build I do locally (I'm kinda still > used to autotools). A meson test would do the same thing, it'd fail while running the tests, no? > If you agree adding it to build_script is the right way to do that, I'll > do the same thing for the windows job. WFM. Greetings, Andres Freund
On 4/9/25 17:51, Andres Freund wrote: > Hi, > > On 2025-04-09 17:28:31 +0200, Tomas Vondra wrote: >> On 4/9/25 17:14, Andres Freund wrote: >>> I'd mention that the includes of postgres.h/fmgr.h is what caused missing >>> build-time dependencies and via that failures on buildfarm member dogfish. >>> >> >> Not really, I also need to include "c.h" instead of "postgres.h" (which >> is also causing the same failure). > > I did mention postgres.h :) > D'oh, I missed that. I was focused on the fmgr one. > > >>> I think this may not be needed anymore, that was just there for >>> GetSystemInfo(), right? Conversely, I suspect it may now be needed in the new >>> location of pg_numa_get_pagesize()? >>> >> >> Good question. But if it's needed there, shouldn't it have failed on CI? > > Oh. No. It shouldn't have - because that include was completely > unnecessary. We always include windows.h on windows. > Makes sense. I'll get rid of the windows.h include. > > >>>> From 201f8be652e9344dfa247b035a66e52025afa149 Mon Sep 17 00:00:00 2001 >>>> From: Tomas Vondra <tomas@vondra.me> >>>> Date: Wed, 9 Apr 2025 13:29:31 +0200 >>>> Subject: [PATCH 2/2] ci: Check for missing dependencies in meson build >>>> >>>> Extends the meson build on Debian to also check for missing dependencies >>>> by executing >>>> >>>> ninja -t missingdeps >>>> >>>> right after the build. This highlights unindended dependencies. >>>> >>>> Reviewed-by: Andres Freund <andres@anarazel.de> >>>> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com >>> >>> FWIW, while I'd prefer it as a meson.build visible test(), I think it's ok to >>> have it just in CI until we have that. I would however also add it to the >>> windows job, as that's the most "different" type of build / source of missed >>> dependencies that wouldn't show up on our development systems. >>> >> >> We can add it as a meson.build test, sure. I was going for the CI first, >> because then it fires no matter what build I do locally (I'm kinda still >> used to autotools). > > A meson test would do the same thing, it'd fail while running the tests, no? > Sure, but only if you use meson. Which I still mostly don't, so I've been thinking about the CI first, because I use that very consistently before pushing something. > >> If you agree adding it to build_script is the right way to do that, I'll >> do the same thing for the windows job. > > WFM. Thanks. I'll polish this a bit more and push. regards -- Tomas Vondra
Hi, On Tue, Apr 08, 2025 at 12:46:16PM +0200, Tomas Vondra wrote: > > > On 4/8/25 01:26, Shinoda, Noriyoshi (SXD Japan FSI) wrote: > > Hi, > > > > Thanks for developing this great feature. > > The manual says that the 'size' column of the pg_shmem_allocations_numa view is 'int4', but the implementation is 'int8'. > > The attached small patch fixes the manual. > > > > Thank you for noticing this and for the fix! Pushed. > > This also reminded me we agreed to change page_num to bigint, which I > forgot to change before commit. So I adjusted that too, separately. I was doing some extra testing and just realized (did not think of it during the review) that maybe we could add a pg_buffercache_numa usage example (like it's already done for pg_buffercache). That might sound obvious but OTOH I think that would not hurt. Something like in the attached? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com