Thread: Re: Draft for basic NUMA observability
Hi, On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote: > As I have promised to Andres on the Discord hacking server some time > ago, I'm attaching the very brief (and potentially way too rushed) > draft of the first step into NUMA observability on PostgreSQL that was > based on his presentation [0]. It might be rough, but it is to get us > started. The patches were not really even basically tested, they are > more like input for discussion - rather than solid code - to shake out > what should be the proper form of this. > > Right now it gives: > > postgres=# select numa_zone_id, count(*) from pg_buffercache group by > numa_zone_id; > NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 > numa_zone_id | count > --------------+------- > | 16127 > 6 | 256 > 1 | 1 Thanks for the patch! Not doing a code review but sharing some experimentation. First, I had to: @@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) Size os_page_size; void **os_page_ptrs; int *os_pages_status; - int os_page_count; + uint64 os_page_count; and - os_page_count = (NBuffers * BLCKSZ) / os_page_size; + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; to make it work with non tiny shared_buffers. Observations: when using 2 sessions: Session 1 first loads buffers (e.g., by querying a relation) and then runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear as unmapped when viewed from a process that hasn't accessed them, even though those same pages appear as allocated on a NUMA node in session 1. To double check, I created a function pg_buffercache_pages_from_pid() that is exactly the same as pg_buffercache_pages() (with your patch) except that it takes a pid as input and uses it in move_pages(<pid>, …). Let me show the results: In session 1 (that "accessed/loaded" the ~65K buffers): postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177310 0 | 65192 -2 | 378 (3 rows) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 1662580 In session 2: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 85 -2 | 65494 (3 rows) ^ postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 90 -2 | 65489 (3 rows) But when session's 1 pid is used: postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 65195 -2 | 384 (3 rows) Results show: Correct NUMA distribution in session 1 Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid() with the pid of session 1 as a parameter (the session that actually accessed the buffers) Which makes me wondering if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though. The initial idea that you shared on discord was to use get_mempolicy() but as Andres stated: " One annoying thing about get_mempolicy() is this: If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read (load) access to that address, and return the ID of the node where that page was allocated. Forcing the allocation to happen inside a monitoring function is decidedly not great. " The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>") while using get_mempolicy(). But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Jakub, On Mon, Feb 17, 2025 at 01:02:04PM +0100, Jakub Wartak wrote: > On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi Bertrand, > > Thanks for playing with this! > > > Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observethe same behavior though. > > You are correct, I'm observing identical behaviour, please see attached. Thanks for confirming! > > We probably would need to split it to some separate and new view > within the pg_buffercache extension, but that is going to be slow, yet > still provide valid results. Yup. > In the previous approach that > get_mempolicy() was allocating on 1st access, but it was slow not only > because it was allocating but also because it was just 1 syscall per > 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning > (really allocating) a 128GB buffer cache in future won't cause issues > - that's like 16-17mln (* 2) syscalls to be issued when not using > move_pages(2) Yeah, get_mempolicy() not working on a range is not great. > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > Different perspective: I wanted to use the same approach in the new > pg_shmemallocations_numa, but that won't cut it there. The other idea > that came to my mind is to issue move_pages() from the backend that > has already used all of those pages. That literally mean on of the > below ideas: > 1. from somewhere like checkpointer / bgwriter? > 2. add touching memory on backend startup like always (sic!) > 3. or just attempt to read/touch memory addr just before calling > move_pages(). E.g. this last options is just two lines: > > if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > } > > and it seems to work while still issuing much less syscalls with > move_pages() across backends, well at least here. One of the main issue I see with 1. and 2. is that we would not get accurate results should the kernel decides to migrate the pages. Indeed, the process doing the move_pages() call needs to have accessed the pages more recently than any kernel migrations to see accurate locations. OTOH, one of the main issue that I see with 3. is that the monitoring could probably influence the kernel's decision to start pages migration (I'm not 100% sure but I could imagine it may influence the kernel's decision due to having to read/touch the pages). But I'm thinking: do we really need to know the page location of every single page? I think what we want to see is if the pages are "equally" distributed on all the nodes or are somehow "stuck" to one (or more) nodes. In that case what about using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer or contiguous chunks). We could create a new function that would accept a "sampling distance" as parameter for example, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Bertrand, TL;DR; the main problem seems choosing which way to page-fault the shared memory before the backend is going to use numa_move_pages() as the memory mappings (fresh after fork()/CoW) seem to be not ready for numa_move_pages() inquiry. On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > We probably would need to split it to some separate and new view > > within the pg_buffercache extension, but that is going to be slow, yet > > still provide valid results. > > Yup. OK so I've made that NUMA inquiry (now with that "volatile touch" to get valid results for not used memory) into a new and separate pg_buffercache_numa view. This avoids the problem that somebody would automatically run into this slow path when using pg_buffercache. > > > But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)),thoughts? > > > > Different perspective: I wanted to use the same approach in the new > > pg_shmemallocations_numa, but that won't cut it there. The other idea > > that came to my mind is to issue move_pages() from the backend that > > has already used all of those pages. That literally mean on of the > > below ideas: > > 1. from somewhere like checkpointer / bgwriter? > > 2. add touching memory on backend startup like always (sic!) > > 3. or just attempt to read/touch memory addr just before calling > > move_pages(). E.g. this last options is just two lines: > > > > if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + > > (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > } > > > > and it seems to work while still issuing much less syscalls with > > move_pages() across backends, well at least here. > > One of the main issue I see with 1. and 2. is that we would not get accurate > results should the kernel decides to migrate the pages. Indeed, the process doing > the move_pages() call needs to have accessed the pages more recently than any > kernel migrations to see accurate locations. We never get fully accurate state as the zone memory migration might be happening as we query it, but in theory we could add something to e.g. checkpointer/bgwriter that would inquiry it on demand and report it back somewhat through shared memory (?), but I'm somehow afraid because as stated at the end of email, it might take some time (well we probably wouldn't need to "touch memory" then after all, as all of it is active), but that's still impact to those bgworkers. Somehow I feel safer if that code is NOT part of bgworker. > OTOH, one of the main issue that I see with 3. is that the monitoring could > probably influence the kernel's decision to start pages migration (I'm not 100% > sure but I could imagine it may influence the kernel's decision due to having to > read/touch the pages). > > But I'm thinking: do we really need to know the page location of every single page? > I think what we want to see is if the pages are "equally" distributed on all > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > or contiguous chunks). We could create a new function that would accept a > "sampling distance" as parameter for example, thoughts? The way I envision it (and I think what Andres wanted, not sure, still yet to see him comment on all of this) is to give PG devs a way to quickly spot NUMA imbalances, even for single relation. Probably some DBA in the wild could also query it to see how PG/kernel distributes memory from time to time. It seems to be more debugging and coding aid for future NUMA optimizations, rather than being used by some monitoring constantly. I would even dare to say it would require --enable-debug (or some other developer-only toggle), but apparently there's no need to hide it like that if those are separate views. Changes since previous version: 0. rebase due the recent OAuth commit introducing libcurl 1. cast uint64 for NBuffers as You found out 2. put stuff into pg_buffercache_numa 3. 0003 adds pg_shmem_numa_allocations Or should we rather call it pg_shmem_numa_zones or maybe just pg_shm_numa ? If there would be agreement that this is the way we want to have it (from the backend and not from checkpointer), here's what's left on the table to be done here: a. isn't there something quicker for touching / page-faulting memory ? If not then maybe add CHECKS_FOR_INTERRUPTS() there? BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't help (it probably only works for parent//postmaster). I've also tried MADV_POPULATE_READ (5.14+ kernels only) and that seems to work too: + rc = madvise(ShmemBase, ShmemSegHdr->totalsize, MADV_POPULATE_READ); + if(rc != 0) { + elog(NOTICE, "madvice() failed"); + } [..] - volatile uint64 touch pg_attribute_unused(); os_page_ptrs[i] = (char *)ent->location + (i * os_page_size); - touch = *(uint64 *)os_page_ptrs[i]; with volatile touching memory or MADV_POPULATE_READ the result seems to reliable (s_b 128MB here): postgres@postgres:1234 : 14442 # select * from pg_shmem_numa_allocations order by numa_size desc; name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 134221824 XLOG Ctl | 0 | 4206592 Buffer Descriptors | 0 | 1048576 transaction | 0 | 528384 Checkpointer Data | 0 | 524288 Checkpoint BufferIds | 0 | 327680 Shared Memory Stats | 0 | 311296 [..] without at least one of those two, new backend reports complete garbage: name | numa_zone_id | numa_size ------------------------------------------------+--------------+----------- Buffer Blocks | 0 | 995328 Shared Memory Stats | 0 | 245760 shmInvalBuffer | 0 | 65536 Buffer Descriptors | 0 | 65536 Backend Status Array | 0 | 61440 serializable | 0 | 57344 [..] b. refactor shared code so that it goes into src/port (but with Linux-only support so far) c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? d. fix tests, indent it, docs, make cfbot happy As for the sampling, dunno, fine for me. As an optional argument? but wouldn't it be better to find a way to actually for it to be quick? OK, so here's larger test, on 512GB with 8x NUMA nodes and s_b set to 128GB with numactl --interleave=all pg_ctl start: postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] Buffer Blocks | 0 | 17179869184 Buffer Blocks | 1 | 17179869184 Buffer Blocks | 2 | 17179869184 Buffer Blocks | 3 | 17179869184 Buffer Blocks | 4 | 17179869184 Buffer Blocks | 5 | 17179869184 Buffer Blocks | 6 | 17179869184 Buffer Blocks | 7 | 17179869184 Buffer IO Condition Variables | 0 | 33554432 Buffer IO Condition Variables | 1 | 33554432 Buffer IO Condition Variables | 2 | 33554432 [..] but it takes 23s. Yes it takes 23s to just gather that info with memory touch, but that's ~128GB of memory and is hardly responsible (lack of C_F_I()). By default without numactl's interleave=all, you get clear picture of lack of NUMA awareness in PG shared segment (just as Andres presented, but now it is evident; well it is subject to autobalancing of course): postgres=# select * from pg_shmem_numa_allocations ; name | numa_zone_id | numa_size ------------------------------------------------+--------------+------------- [..] commit_timestamp | 0 | 2097152 commit_timestamp | 1 | 6291456 commit_timestamp | 2 | 0 commit_timestamp | 3 | 0 commit_timestamp | 4 | 0 [..] transaction | 0 | 14680064 transaction | 1 | 0 transaction | 2 | 0 transaction | 3 | 0 transaction | 4 | 2097152 [..] Somehow without interleave it is very quick too. -J.
Attachment
Hi, On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: > On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > OTOH, one of the main issue that I see with 3. is that the monitoring could > > probably influence the kernel's decision to start pages migration (I'm not 100% > > sure but I could imagine it may influence the kernel's decision due to having to > > read/touch the pages). > > > > But I'm thinking: do we really need to know the page location of every single page? > > I think what we want to see is if the pages are "equally" distributed on all > > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > > or contiguous chunks). We could create a new function that would accept a > > "sampling distance" as parameter for example, thoughts? > > The way I envision it (and I think what Andres wanted, not sure, still > yet to see him comment on all of this) is to give PG devs a way to > quickly spot NUMA imbalances, even for single relation. Yea. E.g. for some benchmark workloads the difference whether the root btree page is on the same NUMA node as the workload or not makes a roughly 2x perf difference. It's really hard to determine that today. > If there would be agreement that this is the way we want to have it > (from the backend and not from checkpointer), here's what's left on > the table to be done here: > a. isn't there something quicker for touching / page-faulting memory ? If you actually fault in a page the kernel actually has to allocate memory and then zero it out. That rather severely limits the throughput... > If not then maybe add CHECKS_FOR_INTERRUPTS() there? Should definitely be there. > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > help (it probably only works for parent//postmaster). Yes, needs to be in postmaster. Does the issue with "new" backends seeing pages as not present exist both with and without huge pages? FWIW, what you posted fails on CI: https://cirrus-ci.com/task/5114213770723328 Probably some ifdefs are missing. The sanity-check task configures with minimal dependencies, which is why you're seeing this even on linux. > b. refactor shared code so that it goes into src/port (but with > Linux-only support so far) > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? You mean a specific context instead of CurrentMemoryContext? > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > index 91b51142d2e..e3b7554d9e8 100644 > --- a/.cirrus.tasks.yml > +++ b/.cirrus.tasks.yml > @@ -436,6 +436,10 @@ task: > SANITIZER_FLAGS: -fsanitize=address > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > + > + # FIXME: use or not the libnuma? > + # --with-libnuma \ > + # > # Normally, the "relation segment" code basically has no coverage in our > # tests, because we (quite reasonably) don't generate tables large > # enough in tests. We've had plenty bugs that we didn't notice due > the I don't see why not. > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > new file mode 100644 > index 00000000000..e5b3d1f7dd2 > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,30 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the function. > +DROP FUNCTION pg_buffercache_pages() CASCADE; Why? I think that's going to cause problems, as the pg_buffercache view depends on it, and user views might turn in depend on pg_buffercache. I think CASCADE is rarely, if ever, ok to use in an extension scripot. > +CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean) > +RETURNS SETOF RECORD > +AS 'MODULE_PATHNAME', 'pg_buffercache_pages' > +LANGUAGE C PARALLEL SAFE; > > +-- Create a view for convenient access. > +CREATE OR REPLACE VIEW pg_buffercache AS > + SELECT P.* FROM pg_buffercache_pages(false) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4); > + > +CREATE OR REPLACE VIEW pg_buffercache_numa AS > + SELECT P.* FROM pg_buffercache_pages(true) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4, numa_zone_id int4); > + > +-- Don't want these to be available to public. > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same for _numa? > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > else > fctx->record[i].isvalid = false; > > +#ifdef USE_LIBNUMA > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > + blk2page = (int) i * pages_per_blk; BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. > + j = 0; > + do { > + /* > + * Many buffers can point to the same page, but we want to > + * query just first address. > + * > + * In order to get reliable results we also need to touch memory pages > + * so that inquiry about NUMA zone doesn't return -2. > + */ > + if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > + } > + j++; > + } while(j < (int)pages_per_blk); > +#endif > + Why is this done before we even have gotten -2 back? Even if we need it, it seems like we ought to defer this until necessary. > +#ifdef USE_LIBNUMA > + if(query_numa) { > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > + if(numa_available() == -1) { > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > + } else { > + /* Amortize the number of pages we need to query about */ > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > + elog(ERROR, "failed NUMA pages inquiry status"); > + } I wonder if we ought to override numa_error() so we can display more useful errors. > + > + LWLockAcquire(ShmemIndexLock, LW_SHARED); Doing multiple memory allocations while holding an lwlock is probably not a great idea, even if the lock normally isn't contended. > + os_page_ptrs = palloc(sizeof(void *) * os_page_count); > + os_pages_status = palloc(sizeof(int) * os_page_count); Why do this in very loop iteration? Greetings, Andres Freund
Hi, On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > Does the issue with "new" backends seeing pages as not present exist both with > and without huge pages? That's a good point and from what I can see it's correct with huge pages being used (it means all processes see the same NUMA node assignment regardless of access patterns). That said, wouldn't that be too strong to impose a restriction that huge_pages must be enabled? Jakub, thanks for the new patch version! FWIW, I did not look closely to the code yet (just did the minor changes already shared to have valid result with non tiny shared buffer size). I'll look closely at the code for sure once we all agree on the design part of it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: Hi Andres, thanks for your review! OK first sane version attached with new src/port/pg_numa.c boilerplate in 0001. Fixed some bugs too, there is one remaining optimization to be done (see that `static` question later). Docs/tests are still missing. QQ: I'm still wondering if we there's better way of exposing multiple pg's shma entries pointing to the same page (think of something hot: PROCLOCK or ProcArray), so wouldn't it make sense (in some future thread/patch) to expose this info somehow via additional column (pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of an easy way of showing that a potential NUMA auto balancing could lead to TLB NUMA shootdowns (not that it is happening or counting , just identifying it as a problem in allocation). Or that stuff doesn't make sense as we already have pg_shm_allocations.{off|size} and we could derive such info from it (after all it is for devs?)? postgres@postgres:1234 : 18843 # select name,off,off+allocated_size,allocated_size from pg_shmem_allocations order by off; name | off | ?column? | allocated_size ------------------------------------------------+-----------+-----------+---------------- [..] Proc Header | 147114112 | 147114240 | 128 Proc Array | 147274752 | 147275392 | 640 KnownAssignedXids | 147275392 | 147310848 | 35456 KnownAssignedXidsValid | 147310848 | 147319808 | 8960 Backend Status Array | 147319808 | 147381248 | 61440 postgres@postgres:1234 : 18924 # select * from pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array', 'KnownAssignedXids', '..') order by name; name | numa_zone_id | numa_size -------------------+--------------+----------- KnownAssignedXids | 0 | 2097152 Proc Array | 0 | 2097152 Proc Header | 0 | 2097152 I.e. ProcArray ends and right afterwards KnownAssignedXids start, both are hot , but on the same HP and NUMA node > > If there would be agreement that this is the way we want to have it > > (from the backend and not from checkpointer), here's what's left on > > the table to be done here: > > > a. isn't there something quicker for touching / page-faulting memory ? > > If you actually fault in a page the kernel actually has to allocate memory and > then zero it out. That rather severely limits the throughput... OK, no comments about that madvise(MADV_POPULATE_READ), so I'm sticking to pointers. > > If not then maybe add CHECKS_FOR_INTERRUPTS() there? > > Should definitely be there. Added. > > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > > help (it probably only works for parent//postmaster). > > Yes, needs to be in postmaster. > > Does the issue with "new" backends seeing pages as not present exist both with > and without huge pages? Please see attached file for more verbose results, but in short it is like below: patch(-touchpages) hugepages=off INVALID RESULTS (-2) patch(-touchpages) hugepages=on INVALID RESULTS (-2) patch(touchpages) hugepages=off CORRECT RESULT patch(touchpages) hugepages=on CORRECT RESULT patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2) patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2) IMHVO, the only other thing that could work here (but still page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are welcome, maybe it might be kernel version dependent. BTW: and yes you can "feel" the timing impact of MAP_SHARED|MAP_POPULATE during startup, it seems that for our case child backends that don't come-up with pre-faulted page attachments across fork() apparently. > FWIW, what you posted fails on CI: > https://cirrus-ci.com/task/5114213770723328 > > Probably some ifdefs are missing. The sanity-check task configures with > minimal dependencies, which is why you're seeing this even on linux. Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all of this CI stuff... > > b. refactor shared code so that it goes into src/port (but with > > Linux-only support so far) Done. > > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? > > You mean a specific context instead of CurrentMemoryContext? Yes, I had doubts earlier, but for now I'm going to leave it as it is. > > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > > index 91b51142d2e..e3b7554d9e8 100644 > > --- a/.cirrus.tasks.yml > > +++ b/.cirrus.tasks.yml > > @@ -436,6 +436,10 @@ task: > > SANITIZER_FLAGS: -fsanitize=address > > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > > > + > > + # FIXME: use or not the libnuma? > > + # --with-libnuma \ > > + # > > # Normally, the "relation segment" code basically has no coverage in our > > # tests, because we (quite reasonably) don't generate tables large > > # enough in tests. We've had plenty bugs that we didn't notice due > > the > > I don't see why not. Fixed. > > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > new file mode 100644 > > index 00000000000..e5b3d1f7dd2 > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,30 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the function. > > +DROP FUNCTION pg_buffercache_pages() CASCADE; > > Why? I think that's going to cause problems, as the pg_buffercache view > depends on it, and user views might turn in depend on pg_buffercache. I think > CASCADE is rarely, if ever, ok to use in an extension scripot. ... it's just me cutting corners :^), fixed now. [..] > > +-- Don't want these to be available to public. > > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; > > We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same > for _numa? Yup, fixed. > > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > > else > > fctx->record[i].isvalid = false; > > > > +#ifdef USE_LIBNUMA > > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > > + blk2page = (int) i * pages_per_blk; > > BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. Fixed, thanks I was looking for something like this! Is that +1 in v4 good? > > + j = 0; > > + do { > > + /* > > + * Many buffers can point to the same page, but we want to > > + * query just first address. > > + * > > + * In order to get reliable results we also need to touch memory pages > > + * so that inquiry about NUMA zone doesn't return -2. > > + */ > > + if(os_page_ptrs[blk2page+j] == 0) { > > + volatile uint64 touch pg_attribute_unused(); > > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > > + } > > + j++; > > + } while(j < (int)pages_per_blk); > > +#endif > > + > > Why is this done before we even have gotten -2 back? Even if we need it, it > seems like we ought to defer this until necessary. Not fixed yet: maybe we could even do a `static` with `has_this_run_earlier` and just perform this work only once during the first time? > > +#ifdef USE_LIBNUMA > > + if(query_numa) { > > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > > + if(numa_available() == -1) { > > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > > + } else { > > + /* Amortize the number of pages we need to query about */ > > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > > + elog(ERROR, "failed NUMA pages inquiry status"); > > + } > > I wonder if we ought to override numa_error() so we can display more useful > errors. Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine... [..] > Doing multiple memory allocations while holding an lwlock is probably not a > great idea, even if the lock normally isn't contended. [..] > Why do this in very loop iteration? Both fixed. -J.
Attachment
On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > > Does the issue with "new" backends seeing pages as not present exist both with > > and without huge pages? > > That's a good point and from what I can see it's correct with huge pages being > used (it means all processes see the same NUMA node assignment regardless of > access patterns). Hi Bertrand , please see that nearby thread. I've got quite the opposite results. I need page-fault memory or I get invalid results ("-2"). What kernel version are you using ? (I've tried it on two 6.10.x series kernels , virtualized in both cases; one was EPYC [real NUMA, but not VM so not real hardware]). > That said, wouldn't that be too strong to impose a restriction that huge_pages > must be enabled? > > Jakub, thanks for the new patch version! FWIW, I did not look closely to the > code yet (just did the minor changes already shared to have valid result with non > tiny shared buffer size). I'll look closely at the code for sure once we all agree > on the design part of it. Cool, I think we are pretty close actually, but others might have different perspective. -J.
Hi, On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > FWIW, what you posted fails on CI: > > https://cirrus-ci.com/task/5114213770723328 > > > > Probably some ifdefs are missing. The sanity-check task configures with > > minimal dependencies, which is why you're seeing this even on linux. > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > of this CI stuff... FYI, you can enable CI on a github repo, to see results without posting to the list: https://github.com/postgres/postgres/blob/master/src/tools/ci/README Greetings, Andres Freund
On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > > FWIW, what you posted fails on CI: > > > https://cirrus-ci.com/task/5114213770723328 > > > > > > Probably some ifdefs are missing. The sanity-check task configures with > > > minimal dependencies, which is why you're seeing this even on linux. > > > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > > of this CI stuff... > > FYI, you can enable CI on a github repo, to see results without posting to the > list: > https://github.com/postgres/postgres/blob/master/src/tools/ci/README Thanks, I'll take a look into it. Meanwhile v5 is attached with slight changes to try to make cfbot happy: 1. fixed tests and added tiny copy-cat basic tests for pg_buffercache_numa and pg_shm_numa_allocations views 2. win32 doesn't have sysconf() No docs yet. -J.
Attachment
Hi, On Wed, Feb 26, 2025 at 02:05:59PM +0100, Jakub Wartak wrote: > On Wed, Feb 26, 2025 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2025-02-26 09:38:20 +0100, Jakub Wartak wrote: > > > > FWIW, what you posted fails on CI: > > > > https://cirrus-ci.com/task/5114213770723328 > > > > > > > > Probably some ifdefs are missing. The sanity-check task configures with > > > > minimal dependencies, which is why you're seeing this even on linux. > > > > > > Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all > > > of this CI stuff... > > > > FYI, you can enable CI on a github repo, to see results without posting to the > > list: > > https://github.com/postgres/postgres/blob/master/src/tools/ci/README > > Thanks, I'll take a look into it. > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: Thanks for the updated version! FWIW, I had to do a few changes to get an error free compiling experience with autoconf/or meson and both with or without the libnuma configure option. Sharing here as .txt files: v5-0004-configure-changes.txt: changes in configure + add a test on numa.h availability and a call to numa_available. v5-0005-pg_numa.c-changes.txt: moving the <unistd.h> outside of USE_LIBNUMA because the file is still using sysconf() in the non-NUMA code path. Also, removed a ";" in "#endif;" in the non-NUMA code path. v5-0006-meson.build-changes.txt. Those apply on top of your v5. Also the pg_buffercache test fails without the libnuma configure option. Maybe some tests should depend of the libnuma configure option. Still did not look closely to the code. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi Jakub, On Wed, Feb 26, 2025 at 09:48:41AM +0100, Jakub Wartak wrote: > On Mon, Feb 24, 2025 at 5:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Mon, Feb 24, 2025 at 09:06:20AM -0500, Andres Freund wrote: > > > Does the issue with "new" backends seeing pages as not present exist both with > > > and without huge pages? > > > > That's a good point and from what I can see it's correct with huge pages being > > used (it means all processes see the same NUMA node assignment regardless of > > access patterns). > > Hi Bertrand , please see that nearby thread. I've got quite the > opposite results. I need page-fault memory or I get invalid results > ("-2"). What kernel version are you using ? (I've tried it on two > 6.10.x series kernels , virtualized in both cases; one was EPYC [real > NUMA, but not VM so not real hardware]) Thanks for sharing your numbers! It looks like that with hp enabled then the shared_buffers plays a role. 1. With hp, shared_buffers 4GB: huge_pages_status ------------------- on (1 row) shared_buffers ---------------- 4GB (1 row) NOTICE: os_page_count=2048 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+-------- | 507618 0 | 1054 -2 | 15616 (3 rows) 2. With hp, shared_buffers 23GB: huge_pages_status ------------------- on (1 row) shared_buffers ---------------- 23GB (1 row) NOTICE: os_page_count=11776 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+--------- | 2997974 0 | 16682 (2 rows) 3. no hp, shared_buffers 23GB: huge_pages_status ------------------- off (1 row) shared_buffers ---------------- 23GB (1 row) ERROR: extension "pg_buffercache" already exists NOTICE: os_page_count=6029312 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 2997975 -2 | 16482 1 | 199 (3 rows) Maybe the kernel is taking some decisions based on the HugePages_Rsvd, I've no ideas. Anyway there is little than we can do and the "touchpages" patch seems to provide "accurate" results. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: [..] > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: > > Thanks for the updated version! > > FWIW, I had to do a few changes to get an error free compiling experience with > autoconf/or meson and both with or without the libnuma configure option. > > Sharing here as .txt files: > > Also the pg_buffercache test fails without the libnuma configure option. Maybe > some tests should depend of the libnuma configure option. [..] Thank you so much for this Bertrand ! I've applied those , played a little bit with configure and meson and reproduced the test error and fixed it by silencing that NOTICE in tests. So v6 is attached even before I get a chance to start using that CI. Still waiting for some input and tests regarding that earlier touchpages attempt, docs are still missing... -J.
Attachment
Hi, On Thu, Feb 27, 2025 at 10:05:46AM +0100, Jakub Wartak wrote: > On Wed, Feb 26, 2025 at 6:13 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > [..] > > > Meanwhile v5 is attached with slight changes to try to make cfbot happy: > > > > Thanks for the updated version! > > > > FWIW, I had to do a few changes to get an error free compiling experience with > > autoconf/or meson and both with or without the libnuma configure option. > > > > Sharing here as .txt files: > > > > Also the pg_buffercache test fails without the libnuma configure option. Maybe > > some tests should depend of the libnuma configure option. > [..] > > Thank you so much for this Bertrand ! > > I've applied those , played a little bit with configure and meson and > reproduced the test error and fixed it by silencing that NOTICE in > tests. So v6 is attached even before I get a chance to start using > that CI. Still waiting for some input and tests regarding that earlier > touchpages attempt, docs are still missing... Thanks for the new version! I did some tests and it looks like it's giving correct results. I don't see -2 anymore and every backend reports correct distribution (with or without hp, with "small" or "large" shared buffer). A few random comments: === 1 + /* + * This is for gathering some NUMA statistics. We might be using + * various DB block sizes (4kB, 8kB , .. 32kB) that end up being + * allocated in various different OS memory pages sizes, so first we + * need to understand the OS memory page size before calling + * move_pages() + */ + os_page_size = pg_numa_get_pagesize(); + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; + pages_per_blk = (float) BLCKSZ / os_page_size; + + elog(DEBUG1, "NUMA os_page_count=%d os_page_size=%ld pages_per_blk=%f", + os_page_count, os_page_size, pages_per_blk); + + os_page_ptrs = palloc(sizeof(void *) * os_page_count); + os_pages_status = palloc(sizeof(int) * os_page_count); + memset(os_page_ptrs, 0, sizeof(void *) * os_page_count); + + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here + */ + memset(os_pages_status, 0xff, sizeof(int) * os_page_count); I think that if (query_numa) check should also wrap that entire section of code. === 2 + if (query_numa) + { + blk2page = (int) i * pages_per_blk; + j = 0; + do + { This check is done for every page. I wonder if it would not make sense to create a brand new function for pg_buffercache_numa and just let the current pg_buffercache_pages() as it is. That said it would be great to avoid code duplication as much a possible though, maybe using a shared populate_buffercache_entry() or such helper function? === 3 +#define ONE_GIGABYTE 1024*1024*1024 + if ((i * os_page_size) % ONE_GIGABYTE == 0) + CHECK_FOR_INTERRUPTS(); + } Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS() for every page instead? (I don't see with a 30GB shared buffer). I've the feeling that we could get rid of the "ONE_GIGABYTE" check. === 4 + pfree(os_page_ptrs); + pfree(os_pages_status); Not sure that's needed, we should be in a short-lived memory context here (ExprContext or such). === 5 +/* SQL SRF showing NUMA zones for allocated shared memory */ +Datum +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) +{ That's a good idea. + for (i = 0; i < shm_ent_page_count; i++) + { + /* + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ + volatile uint64 touch pg_attribute_unused(); + + page_ptrs[i] = (char *) ent->location + (i * os_page_size); + pg_numa_touch_mem_if_required(touch, page_ptrs[i]); That sounds right. Could we also avoid some code duplication with pg_get_shmem_allocations()? Also same remarks about pfree() and ONE_GIGABYTE as above. A few other things: ==== 6 +++ b/src/backend/storage/ipc/shmem.c @@ -73,6 +73,7 @@ #include "storage/shmem.h" #include "storage/spin.h" #include "utils/builtins.h" +#include "port/pg_numa.h" Not at the right position, should be between those 2: #include "miscadmin.h" #include "storage/lwlock.h" ==== 7 +/*------------------------------------------------------------------------- + * + * pg_numa.h + * Miscellaneous functions for bit-wise operations. description is not correct. Also the "Copyright (c) 2019-2025" might be "Copyright (c) 2025" instead. === 8 +++ b/src/port/pg_numa.c @@ -0,0 +1,150 @@ +/*------------------------------------------------------------------------- + * + * numa.c + * Basic NUMA portability routines s/numa.c/pg_numa.c/ ? === 9 +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -6,6 +6,7 @@ * contrib/pg_buffercache/pg_buffercache_pages.c *------------------------------------------------------------------------- */ +#include "pg_config.h" #include "postgres.h" Is this new include needed? #include "access/htup_details.h" @@ -13,10 +14,12 @@ #include "funcapi.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" not in the right order. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote: > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > > > > Why is this done before we even have gotten -2 back? Even if we need it, it > > seems like we ought to defer this until necessary. > > Not fixed yet: maybe we could even do a `static` with > `has_this_run_earlier` and just perform this work only once during the > first time? Not sure I get your idea, could you share what the code would look like? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi! On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > I did some tests and it looks like it's giving correct results. I don't see -2 > anymore and every backend reports correct distribution (with or without hp, > with "small" or "large" shared buffer). Cool! Attached is v7 that is fully green on cirrus CI that Andres recommended, we will see how cfbot reacts to this. BTW docs are still missing. When started with proper interleave=all with s_b=64GB, hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench: postgres=# select buffers_used, buffers_unused from pg_buffercache_summary(); buffers_used | buffers_unused --------------+---------------- 170853 | 8217755 ( ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906 numa_zone_id | count --------------+--------- 0 | 42752 1 | 42752 2 | 42752 3 | 42597 | 8217755 Time: 5828.100 ms (00:05.828) postgres=# select 3*42752+42597; ?column? ---------- 170853 postgres=# select * from pg_shmem_numa_allocations order by numa_size desc limit 12; DEBUG: NUMA: page-faulting shared memory segments for proper NUMA readouts name | numa_zone_id | numa_size --------------------+--------------+------------- Buffer Blocks | 0 | 17179869184 Buffer Blocks | 1 | 17179869184 Buffer Blocks | 3 | 17179869184 Buffer Blocks | 2 | 17179869184 Buffer Descriptors | 2 | 134217728 Buffer Descriptors | 1 | 134217728 Buffer Descriptors | 0 | 134217728 Buffer Descriptors | 3 | 134217728 Checkpointer Data | 1 | 67108864 Checkpointer Data | 0 | 67108864 Checkpointer Data | 2 | 67108864 Checkpointer Data | 3 | 67108864 Time: 68.579 ms > A few random comments: > > === 1 > [..] > I think that the if (query_numa) check should also wrap that entire section of code. Done. > === 2 > > + if (query_numa) > + { > + blk2page = (int) i * pages_per_blk; > + j = 0; > + do > + { > > This check is done for every page. I wonder if it would not make sense > to create a brand new function for pg_buffercache_numa and just let the > current pg_buffercache_pages() as it is. That said it would be great to avoid > code duplication as much a possible though, maybe using a shared > populate_buffercache_entry() or such helper function? Well, I've made query_numa a parameter there simply to avoid that code duplication in the first place, look at those TupleDescInitEntry()... IMHO rarely anybody uses pg_buffercache, but we could add unlikely() there maybe to hint compiler with some smaller routine to reduce complexity of that main routine? (assuming NUMA inquiry is going to be rare). > === 3 > > +#define ONE_GIGABYTE 1024*1024*1024 > + if ((i * os_page_size) % ONE_GIGABYTE == 0) > + CHECK_FOR_INTERRUPTS(); > + } > > Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS() > for every page instead? (I don't see with a 30GB shared buffer). I've the > feeling that we could get rid of the "ONE_GIGABYTE" check. You are right and no it was simply my premature optimization attempt, as apparently CFI on closer looks seems to be already having unlikely() and looks really cheap, so yeah I've removed that. > === 4 > > + pfree(os_page_ptrs); > + pfree(os_pages_status); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). Yes, I wanted to have it just for illustrative and stylish purposes, but right, removed. > === 5 > > +/* SQL SRF showing NUMA zones for allocated shared memory */ > +Datum > +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS) > +{ [..] > > + for (i = 0; i < shm_ent_page_count; i++) > + { > + /* > + * In order to get reliable results we also need to touch memory > + * pages so that inquiry about NUMA zone doesn't return -2. > + */ > + volatile uint64 touch pg_attribute_unused(); > + > + page_ptrs[i] = (char *) ent->location + (i * os_page_size); > + pg_numa_touch_mem_if_required(touch, page_ptrs[i]); > > That sounds right. > > Could we also avoid some code duplication with pg_get_shmem_allocations()? Not sure I understand do you want to avoid code duplication pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa = true) ? > > Also same remarks about pfree() and ONE_GIGABYTE as above. Fixed. > A few other things: > > ==== 6 > > +#include "port/pg_numa.h" > Not at the right position, should be between those 2: > > #include "miscadmin.h" > #include "storage/lwlock.h" Fixed. > ==== 7 > > +/*------------------------------------------------------------------------- > + * > + * pg_numa.h > + * Miscellaneous functions for bit-wise operations. > > description is not correct. Also the "Copyright (c) 2019-2025" might be > "Copyright (c) 2025" instead. Fixed. > === 8 > > +++ b/src/port/pg_numa.c > @@ -0,0 +1,150 @@ > +/*------------------------------------------------------------------------- > + * > + * numa.c > + * Basic NUMA portability routines > > s/numa.c/pg_numa.c/ ? Fixed. > === 9 > > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > @@ -6,6 +6,7 @@ > * contrib/pg_buffercache/pg_buffercache_pages.c > *------------------------------------------------------------------------- > */ > +#include "pg_config.h" > #include "postgres.h" > > Is this new include needed? Removed, don't remember how it arrived here, most have been some artifact of earlier attempts. > #include "access/htup_details.h" > @@ -13,10 +14,12 @@ > #include "funcapi.h" > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > +#include "port/pg_numa.h" > +#include "storage/pg_shmem.h" > > not in the right order. Fixed. And also those from nearby message: On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote: > > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Why is this done before we even have gotten -2 back? Even if we need it, it > > > seems like we ought to defer this until necessary. > > > > Not fixed yet: maybe we could even do a `static` with > > `has_this_run_earlier` and just perform this work only once during the > > first time? > > Not sure I get your idea, could you share what the code would look like? Please see pg_buffercache_pages I've just added static bool firstUseInBackend: postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts [..] postgres@postgres:1234 : 25103 # select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id; DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count [..] Same was done to the pg_get_shmem_numa_allocations. Also CFbot/cirrus was getting: > [11:01:05.027] In file included from ../../src/include/postgres.h:49, > [11:01:05.027] from pg_buffercache_pages.c:10: > [11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’: > [11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 hastype ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=] Fixed with %zu (for size_t) instead of %ld. > Linux - Debian Bookworm - Autoconf got: > [10:42:59.216] checking numa.h usability... no > [10:42:59.268] checking numa.h presence... no > [10:42:59.286] checking for numa.h... no > [10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this. > [13:50:47.449] gcc -m32 @src/backend/postgres.rsp > [13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format I've also got an error in 32-bit build as libnuma.h is there, but apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE) and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for such build. -J.
Attachment
Hi, On Tue, Mar 04, 2025 at 11:48:31AM +0100, Jakub Wartak wrote: > Hi! > > On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > I did some tests and it looks like it's giving correct results. I don't see -2 > > anymore and every backend reports correct distribution (with or without hp, > > with "small" or "large" shared buffer). > > Cool! Attached is v7 Thanks for the new version! > > === 2 > > > > + if (query_numa) > > + { > > + blk2page = (int) i * pages_per_blk; > > + j = 0; > > + do > > + { > > > > This check is done for every page. I wonder if it would not make sense > > to create a brand new function for pg_buffercache_numa and just let the > > current pg_buffercache_pages() as it is. That said it would be great to avoid > > code duplication as much a possible though, maybe using a shared > > populate_buffercache_entry() or such helper function? > > Well, I've made query_numa a parameter there simply to avoid that code > duplication in the first place, look at those TupleDescInitEntry()... Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() or such function: to put the "duplicated" code in it and then use this shared function in pg_buffercache_pages() and in the new numa related one. > IMHO rarely anybody uses pg_buffercache, but we could add unlikely() I think unlikely() should be used for optimization based on code path likelihood, not based on how often users might use a feature. > > === 5 > > > > Could we also avoid some code duplication with pg_get_shmem_allocations()? > > Not sure I understand do you want to avoid code duplication > pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or > pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa = > true) ? I meant to say avoid code duplication between pg_get_shmem_allocations() and pg_get_shmem_numa_allocations(). It might be possible to create a shared function for them too. That said, it looks like that the savings (if any), would not be that much, so maybe just forget about it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Mar 4, 2025 at 5:02 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Cool! Attached is v7 > Thanks for the new version! ... and another one: 7b ;) > > > === 2 [..] > > Well, I've made query_numa a parameter there simply to avoid that code > > duplication in the first place, look at those TupleDescInitEntry()... > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > or such function: to put the "duplicated" code in it and then use this > shared function in pg_buffercache_pages() and in the new numa related one. OK, so hastily attempted that in 7b , I had to do a larger refactor there to avoid code duplication between those two. I don't know which attempt is better though (7 vs 7b).. > > IMHO rarely anybody uses pg_buffercache, but we could add unlikely() > > I think unlikely() should be used for optimization based on code path likelihood, > not based on how often users might use a feature. In 7b I've removed the unlikely() - For a moment I was thinking that you are concerned about this loop for NBuffers to be as much optimized as it can and that's the reason for splitting the routines. > > > === 5 [..] > I meant to say avoid code duplication between pg_get_shmem_allocations() and > pg_get_shmem_numa_allocations(). It might be possible to create a shared > function for them too. That said, it looks like that the savings (if any), would > not be that much, so maybe just forget about it. Yeah, OK, so let's leave it at that. -J.
Attachment
Hi, On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: >Hi, > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > or such function: to put the "duplicated" code in it and then use this > > shared function in pg_buffercache_pages() and in the new numa related one. > > OK, so hastily attempted that in 7b , I had to do a larger refactor > there to avoid code duplication between those two. I don't know which > attempt is better though (7 vs 7b).. > I'm attaching basically the earlier stuff (v7b) as v8 with the following minor changes: - docs are included - changed int8 to int4 in one function definition for numa_zone_id -J.
Attachment
On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > Hi, > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > >Hi, > > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > > or such function: to put the "duplicated" code in it and then use this > > > shared function in pg_buffercache_pages() and in the new numa related one. > > > > OK, so hastily attempted that in 7b , I had to do a larger refactor > > there to avoid code duplication between those two. I don't know which > > attempt is better though (7 vs 7b).. > > > > I'm attaching basically the earlier stuff (v7b) as v8 with the > following minor changes: > - docs are included > - changed int8 to int4 in one function definition for numa_zone_id .. and v9 attached because cfbot partially complained about .cirrus.tasks.yml being adjusted recently (it seems master is hot these days). -J.
Attachment
Hi, On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote: > On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > > > Hi, > > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak > > <jakub.wartak@enterprisedb.com> wrote: > > >Hi, > > > > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry() > > > > or such function: to put the "duplicated" code in it and then use this > > > > shared function in pg_buffercache_pages() and in the new numa related one. > > > > > > OK, so hastily attempted that in 7b , I had to do a larger refactor > > > there to avoid code duplication between those two. I don't know which > > > attempt is better though (7 vs 7b).. > > > > > > > I'm attaching basically the earlier stuff (v7b) as v8 with the > > following minor changes: > > - docs are included > > - changed int8 to int4 in one function definition for numa_zone_id > > .. and v9 attached because cfbot partially complained about > .cirrus.tasks.yml being adjusted recently (it seems master is hot > these days). > Thanks for the new version! Some random comments on 0001: === 1 It does not compiles "alone". It's missing: +++ b/src/include/storage/pg_shmem.h @@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */ extern PGDLLIMPORT int shared_memory_type; extern PGDLLIMPORT int huge_pages; extern PGDLLIMPORT int huge_page_size; +extern PGDLLIMPORT int huge_pages_status; and --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -563,7 +563,7 @@ static int ssl_renegotiation_limit; */ int huge_pages = HUGE_PAGES_TRY; int huge_page_size; -static int huge_pages_status = HUGE_PAGES_UNKNOWN; +int huge_pages_status = HUGE_PAGES_UNKNOWN; That come with 0002. So it has to be in 0001 instead. === 2 +else + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 +fi Maybe we should add the same test (checking for numa.h) for meson? === 3 +# FIXME: filter-out / with/without with_libnuma? +LIBS += $(LIBNUMA_LIBS) It looks to me that we can remove those 2 lines. === 4 + Only supported on Linux. s/on Linux/on platforms for which the libnuma library is implemented/? I did a quick grep on "Only supported on" and it looks like that could be a more consistent wording. === 5 +#include "c.h" +#include "postgres.h" +#include "port/pg_numa.h" +#include "storage/pg_shmem.h" +#include <unistd.h> I had a closer look to other header files and it looks like it "should" be: #include "c.h" #include "postgres.h" #include <unistd.h> #ifdef WIN32 #include <windows.h> #endif #include "port/pg_numa.h" #include "storage/pg_shmem.h" And is "#include "c.h"" really needed? === 6 +/* FIXME not tested, might crash */ That's a bit scary. === 7 + * XXX: for now we issue just WARNING, but long-term that might depend on + * numa_set_strict() here s/here/here./ Did not look carefully at all the comments in 0001, 0002 and 0003 though. A few random comments regarding 0002: === 8 # create extension pg_buffercache; ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() RETURNS SETOF RECORD AS '$libdir/pg_buffercache', 'pg_buffercache_pages' LANGUAGE C PARALLEL SAFE" extension script file "pg_buffercache--1.2.sql", near line 7 While that's ok if 0003 is applied. I think that each individual patch should "fully" work individually. === 9 + do + { + if (os_page_ptrs[blk2page + j] == 0) blk2page + j will be repeated multiple times, better to store it in a local variable instead? === 10 + if (firstUseInBackend == true) if (firstUseInBackend) instead? === 11 + int j = 0, + blk2page = (int) i * pages_per_blk; I wonder if size_t is more appropriate for blk2page: size_t blk2page = (size_t)(i * pages_per_blk) maybe? === 12 as we know that we'll iterate until pages_per_blk then would a for loop be more appropriate here, something like? " for (size_t j = 0; j < pages_per_blk; j++) { if (os_page_ptrs[blk2page + j] == 0) { " === 13 + if (buf_state & BM_DIRTY) + fctx->record[i].isdirty = true; + else + fctx->record[i].isdirty = false; fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? === 14 + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) + fctx->record[i].isvalid = true; + else + fctx->record[i].isvalid = false; fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) maybe? === 15 +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) I wonder if "i" should get a more descriptive name? === 16 s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent with pg_stat_io_build_tuples for example). I now realize that I did propose populate_buffercache_entry() up-thread, sorry for changing my mind. === 17 + static bool firstUseInBackend = true; maybe we should give it a more descriptive name? Also I need to think more about how firstUseInBackend is used, for example: ==== 17.1 would it be better to define it as a file scope variable? (at least that would avoid to pass it as an extra parameter in some functions). === 17.2 what happens if firstUseInBackend is set to false and later on the pages are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is called again by a backend that already had set firstUseInBackend to false, would that provide accurate information? === 18 +Datum +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + BufferCachePagesContext *fctx; /* User function context. */ + bool query_numa = true; I don't think we need query_numa anymore in pg_buffercache_numa_pages(). I think that we can just ERROR (or NOTICE and return) here: + if (pg_numa_init() == -1) + { + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); + query_numa = false; and fully get rid of query_numa. === 19 And then here: for (i = 0; i < NBuffers; i++) { populate_buffercache_entry(i, fctx); if (query_numa) pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); } if (query_numa) { if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) elog(ERROR, "failed NUMA pages inquiry: %m"); for (i = 0; i < NBuffers; i++) { int blk2page = (int) i * pages_per_blk; /* * Technically we can get errors too here and pass that to * user. Also we could somehow report single DB block spanning * more than one NUMA zone, but it should be rare. */ fctx->record[i].numa_zone_id = os_pages_status[blk2page]; } maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? A few random comments regarding 0003: === 20 + <indexterm zone="view-pg-shmem-numa-allocations"> + <primary>pg_shmem_numa_allocations</primary> + </indexterm> + + <para> + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes s/pg_shmem_allocations/pg_shmem_numa_allocations/? === 21 + /* FIXME: should we release LWlock here ? */ + elog(ERROR, "failed NUMA pages inquiry status: %m"); There is no need, see src/backend/storage/lmgr/README. === 22 +#define MAX_NUMA_ZONES 32 /* FIXME? */ + Size zones[MAX_NUMA_ZONES]; could we rely on pg_numa_get_max_node() instead? === 23 + if (s >= 0) + zones[s]++; should we also check that s is below a limit? === 24 Regarding how we make use of pg_numa_get_max_node(), are we sure there is no possible holes? I mean could a system have node 0,1 and 3 but not 2? Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I did a little in 0001 though) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Thanks for the new version! v10 is attached with most fixes after review and one new thing introduced: pg_numa_available() for run-time decision inside tests which was needed after simplifying code a little bit as you wanted. I've also fixed -Dlibnuma=disabled as it was not properly implemented. There are couple open items (minor/decision things), but most is fixed or answered: > Some random comments on 0001: > > === 1 > > It does not compiles "alone". It's missing: > [..] > +extern PGDLLIMPORT int huge_pages_status; [..] > -static int huge_pages_status = HUGE_PAGES_UNKNOWN; > +int huge_pages_status = HUGE_PAGES_UNKNOWN; > > That come with 0002. So it has to be in 0001 instead. Ugh, good catch, I haven't thought about it in isolation, they are separate to just ease review, but should be committed together. Fixed. > === 2 > > +else > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > +fi > > Maybe we should add the same test (checking for numa.h) for meson? TBH, I have no idea, libnuma.h may exist but it may not link e.g. when cross-compiling 32-bit on 64-bit. Or is this more about keeping sync between meson and autoconf? > === 3 > > +# FIXME: filter-out / with/without with_libnuma? > +LIBS += $(LIBNUMA_LIBS) > > It looks to me that we can remove those 2 lines. Done. > === 4 > > + Only supported on Linux. > > s/on Linux/on platforms for which the libnuma library is implemented/? > > I did a quick grep on "Only supported on" and it looks like that could be > a more consistent wording. Fixed. > === 5 > > +#include "c.h" > +#include "postgres.h" > +#include "port/pg_numa.h" > +#include "storage/pg_shmem.h" > +#include <unistd.h> > > I had a closer look to other header files and it looks like it "should" be: > > #include "c.h" > #include "postgres.h" > > #include <unistd.h> > #ifdef WIN32 > #include <windows.h> > #endif > > #include "port/pg_numa.h" > #include "storage/pg_shmem.h" > > And is "#include "c.h"" really needed? Fixed both. It seems to compile without c.h. > === 6 > > +/* FIXME not tested, might crash */ > > That's a bit scary. When you are in support for long enough it is becoming the norm ;) But on serious note Andres wanted have numa error/warning handlers (used by libnuma), but current code has no realistic code-path to hit it from numa_available(3), numa_move_pages(3) or numa_max_node(3). The situation won't change until one day in future (I hope!) we start using some more advanced libnuma functionality for interleaving memory, please see my earlier reply: https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com E.g. numa_available(3) is tiny wrapper , see https://github.com/numactl/numactl/blob/master/libnuma.c#L872 For now, I've adjusted that FIXME to XXX, but still don't know we could inject failure to see this triggered... > === 7 > > + * XXX: for now we issue just WARNING, but long-term that might depend on > + * numa_set_strict() here > > s/here/here./ Done. > Did not look carefully at all the comments in 0001, 0002 and 0003 though. > > A few random comments regarding 0002: > > === 8 > > # create extension pg_buffercache; > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > RETURNS SETOF RECORD > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > LANGUAGE C PARALLEL SAFE" > extension script file "pg_buffercache--1.2.sql", near line 7 > > While that's ok if 0003 is applied. I think that each individual patch should > "fully" work individually. STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + 0002 or 0001 + 0003, but here 0002 is complaining about lack of pg_numa_query_pages() which is part of 0001 (?) Should I merge those patches or keep them separate? > === 9 > > + do > + { > + if (os_page_ptrs[blk2page + j] == 0) > > blk2page + j will be repeated multiple times, better to store it in a local > variable instead? Done. > === 10 > > + if (firstUseInBackend == true) > > > if (firstUseInBackend) instead? Done everywhere. > === 11 > > + int j = 0, > + blk2page = (int) i * pages_per_blk; > > I wonder if size_t is more appropriate for blk2page: > > size_t blk2page = (size_t)(i * pages_per_blk) maybe? Sure, done. > === 12 > > as we know that we'll iterate until pages_per_blk then would a for loop be more > appropriate here, something like? > > " > for (size_t j = 0; j < pages_per_blk; j++) > { > if (os_page_ptrs[blk2page + j] == 0) > { > " Sure. > === 13 > > + if (buf_state & BM_DIRTY) > + fctx->record[i].isdirty = true; > + else > + fctx->record[i].isdirty = false; > > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? > > === 14 > > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > + fctx->record[i].isvalid = true; > + else > + fctx->record[i].isvalid = false; > > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > maybe? It is coming from the original pg_buffercache and it is less readable to me, so I don't want to touch that too much because that might open refactoring doors too wide. > === 15 > > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) > > I wonder if "i" should get a more descriptive name? Done, buffer_id. > === 16 > > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent > with pg_stat_io_build_tuples for example). > > I now realize that I did propose populate_buffercache_entry() up-thread, > sorry for changing my mind. OK, I've tried to create a better name, but all my ideas were too long, but pg_buffercache_build_tuple sounds nice, so lets use that. > === 17 > > + static bool firstUseInBackend = true; > > maybe we should give it a more descriptive name? I couldn't come up with anything that wouldn't look too long, so instead I've added a comment explaining the meaning behind this variable, hope that's good enough. > Also I need to think more about how firstUseInBackend is used, for example: > > ==== 17.1 > > would it be better to define it as a file scope variable? (at least that > would avoid to pass it as an extra parameter in some functions). I have no strong opinion on this, but I have one doubt (for future): isn't creating global variables making life harder for upcoming multithreading guys ? > === 17.2 > > what happens if firstUseInBackend is set to false and later on the pages > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > called again by a backend that already had set firstUseInBackend to false, > would that provide accurate information? It is still the correct result. That "touching" (paging-in) is only necessary probably to properly resolve PTEs as the fork() does not seem to carry them over from parent: postgres=# select 'create table foo' || s || ' as select generate_series(1, 100000);' from generate_series(1, 4) s; postgres=# \gexec SELECT 100000 SELECT 100000 SELECT 100000 SELECT 100000 postgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; -- before: numa_zone_id | count --------------+--------- 0 | 256 1 | 4131 | 8384221 postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 1451 -- now use another root (!) session to "migratepages(8)" from numactl will also shift shm segment: # migratepages 1451 1 3 -- while above will be in progress for a lot of time , but the outcome is visible much faster in that backend (pg is functioning): postgres=# select numa_zone_id, count(*) from pg_buffercache_numa group by numa_zone_id order by numa_zone_id; numa_zone_id | count --------------+--------- 0 | 256 3 | 4480 | 8383872 So the above clearly shows that initial touching of shm is required, but just once and it stays valid afterwards. BTW: migratepages(8) was stuck for 1-2 minutes there on "__wait_rcu_gp" according to it's wchan, without any sign of activity on the OS and then out of blue completed just fine, s_b=64GB,HP=on. > === 18 > > +Datum > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > +{ > + FuncCallContext *funcctx; > + BufferCachePagesContext *fctx; /* User function context. */ > + bool query_numa = true; > > I don't think we need query_numa anymore in pg_buffercache_numa_pages(). > > I think that we can just ERROR (or NOTICE and return) here: > > + if (pg_numa_init() == -1) > + { > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); > + query_numa = false; > > and fully get rid of query_numa. Right... fixed it here, but it made the tests blow up, so I've had to find a way to conditionally launch tests based on NUMA availability and that's how pg_numa_available() was born. It's in ipc/shmem.c because I couldn't find a better place for it... > === 19 > > And then here: > > for (i = 0; i < NBuffers; i++) > { > populate_buffercache_entry(i, fctx); > if (query_numa) > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); > } > > if (query_numa) > { > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) > elog(ERROR, "failed NUMA pages inquiry: %m"); > > for (i = 0; i < NBuffers; i++) > { > int blk2page = (int) i * pages_per_blk; > > /* > * Technically we can get errors too here and pass that to > * user. Also we could somehow report single DB block spanning > * more than one NUMA zone, but it should be rare. > */ > fctx->record[i].numa_zone_id = os_pages_status[blk2page]; > } > > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper that prepares **os_page_ptrs, which is then used by pg_numa_query_pages() and then we fill the data. But now after removing query_numa , it reads smoother anyway. Can you please take a look again on this, is this up to the project standards? > A few random comments regarding 0003: > > === 20 > > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes > > s/pg_shmem_allocations/pg_shmem_numa_allocations/? Fixed. > === 21 > > + /* FIXME: should we release LWlock here ? */ > + elog(ERROR, "failed NUMA pages inquiry status: %m"); > > There is no need, see src/backend/storage/lmgr/README. Thanks, fixed. > === 22 > > +#define MAX_NUMA_ZONES 32 /* FIXME? */ > + Size zones[MAX_NUMA_ZONES]; > > could we rely on pg_numa_get_max_node() instead? Sure, done. > === 23 > > + if (s >= 0) > + zones[s]++; > > should we also check that s is below a limit? STILL OPEN QUESTION: I'm not sure it would give us value to report e.g. -2 on per shm entry/per numa node entry, would it? If it would we could somehow overallocate that array and remember negative ones too. > === 24 > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > no possible holes? I mean could a system have node 0,1 and 3 but not 2? I have never seen a hole in numbering as it is established during boot, and the only way that could get close to adjusting it could be making processor books (CPU and RAM together) offline. Even *if* someone would be doing some preventive hardware maintenance, that still wouldn't hurt, as we are just using the Id of the zone to display it -- the max would be already higher. I mean, technically one could use lsmem(1) (it mentions removable flag, after let's pages and processes migrated away and from there) and then use chcpu(1) to --disable CPUs on that zone (to prevent new ones coming there and allocating new local memory there) and then offlining that memory region via chmem(1) --disable. Even with all of that, that still shouldn't cause issues for this code I think, because `numa_max_node()` says it `returns the >> highest node number <<< available on the current system.` > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > did a little in 0001 though) That was an attempt to say "Thank You", OK, I've aligned it that way, so you are still referenced in 0001. I wouldn't find motivation to work on this if you won't respond to those emails ;) There is one known issue, CI returned for numa.out test, that I need to get bottom to (it does not reproduce for me locally) inside numa.out/sql: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 -J.
Attachment
On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for the new version! > > v10 is attached with most fixes after review and one new thing > introduced: pg_numa_available() for run-time decision inside tests > which was needed after simplifying code a little bit as you wanted. > I've also fixed -Dlibnuma=disabled as it was not properly implemented. > There are couple open items (minor/decision things), but most is fixed > or answered: > > > Some random comments on 0001: > > > > === 1 > > > > It does not compiles "alone". It's missing: > > > [..] > > +extern PGDLLIMPORT int huge_pages_status; > [..] > > -static int huge_pages_status = HUGE_PAGES_UNKNOWN; > > +int huge_pages_status = HUGE_PAGES_UNKNOWN; > > > > That come with 0002. So it has to be in 0001 instead. > > Ugh, good catch, I haven't thought about it in isolation, they are > separate to just ease review, but should be committed together. Fixed. > > > === 2 > > > > +else > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > +fi > > > > Maybe we should add the same test (checking for numa.h) for meson? > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > between meson and autoconf? > > > === 3 > > > > +# FIXME: filter-out / with/without with_libnuma? > > +LIBS += $(LIBNUMA_LIBS) > > > > It looks to me that we can remove those 2 lines. > > Done. > > > === 4 > > > > + Only supported on Linux. > > > > s/on Linux/on platforms for which the libnuma library is implemented/? > > > > I did a quick grep on "Only supported on" and it looks like that could be > > a more consistent wording. > > Fixed. > > > === 5 > > > > +#include "c.h" > > +#include "postgres.h" > > +#include "port/pg_numa.h" > > +#include "storage/pg_shmem.h" > > +#include <unistd.h> > > > > I had a closer look to other header files and it looks like it "should" be: > > > > #include "c.h" > > #include "postgres.h" > > > > #include <unistd.h> > > #ifdef WIN32 > > #include <windows.h> > > #endif > > > > #include "port/pg_numa.h" > > #include "storage/pg_shmem.h" > > > > And is "#include "c.h"" really needed? > > Fixed both. It seems to compile without c.h. > > > === 6 > > > > +/* FIXME not tested, might crash */ > > > > That's a bit scary. > > When you are in support for long enough it is becoming the norm ;) But > on serious note Andres wanted have numa error/warning handlers (used > by libnuma), but current code has no realistic code-path to hit it > from numa_available(3), numa_move_pages(3) or numa_max_node(3). The > situation won't change until one day in future (I hope!) we start > using some more advanced libnuma functionality for interleaving > memory, please see my earlier reply: > https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com > E.g. numa_available(3) is tiny wrapper , see > https://github.com/numactl/numactl/blob/master/libnuma.c#L872 > > For now, I've adjusted that FIXME to XXX, but still don't know we > could inject failure to see this triggered... > > > === 7 > > > > + * XXX: for now we issue just WARNING, but long-term that might depend on > > + * numa_set_strict() here > > > > s/here/here./ > > Done. > > > Did not look carefully at all the comments in 0001, 0002 and 0003 though. > > > > A few random comments regarding 0002: > > > > === 8 > > > > # create extension pg_buffercache; > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > RETURNS SETOF RECORD > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > LANGUAGE C PARALLEL SAFE" > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > While that's ok if 0003 is applied. I think that each individual patch should > > "fully" work individually. > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > patches or keep them separate? > > > === 9 > > > > + do > > + { > > + if (os_page_ptrs[blk2page + j] == 0) > > > > blk2page + j will be repeated multiple times, better to store it in a local > > variable instead? > > Done. > > > === 10 > > > > + if (firstUseInBackend == true) > > > > > > if (firstUseInBackend) instead? > > Done everywhere. > > > === 11 > > > > + int j = 0, > > + blk2page = (int) i * pages_per_blk; > > > > I wonder if size_t is more appropriate for blk2page: > > > > size_t blk2page = (size_t)(i * pages_per_blk) maybe? > > Sure, done. > > > === 12 > > > > as we know that we'll iterate until pages_per_blk then would a for loop be more > > appropriate here, something like? > > > > " > > for (size_t j = 0; j < pages_per_blk; j++) > > { > > if (os_page_ptrs[blk2page + j] == 0) > > { > > " > > Sure. > > > === 13 > > > > + if (buf_state & BM_DIRTY) > > + fctx->record[i].isdirty = true; > > + else > > + fctx->record[i].isdirty = false; > > > > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe? > > > > === 14 > > > > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > + fctx->record[i].isvalid = true; > > + else > > + fctx->record[i].isvalid = false; > > > > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) > > maybe? > > It is coming from the original pg_buffercache and it is less readable > to me, so I don't want to touch that too much because that might open > refactoring doors too wide. > > > === 15 > > > > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx) > > > > I wonder if "i" should get a more descriptive name? > > Done, buffer_id. > > > === 16 > > > > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent > > with pg_stat_io_build_tuples for example). > > > > I now realize that I did propose populate_buffercache_entry() up-thread, > > sorry for changing my mind. > > OK, I've tried to create a better name, but all my ideas were too > long, but pg_buffercache_build_tuple sounds nice, so lets use that. > > > === 17 > > > > + static bool firstUseInBackend = true; > > > > maybe we should give it a more descriptive name? > > I couldn't come up with anything that wouldn't look too long, so > instead I've added a comment explaining the meaning behind this > variable, hope that's good enough. > > > Also I need to think more about how firstUseInBackend is used, for example: > > > > ==== 17.1 > > > > would it be better to define it as a file scope variable? (at least that > > would avoid to pass it as an extra parameter in some functions). > > I have no strong opinion on this, but I have one doubt (for future): > isn't creating global variables making life harder for upcoming > multithreading guys ? > > > === 17.2 > > > > what happens if firstUseInBackend is set to false and later on the pages > > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > > called again by a backend that already had set firstUseInBackend to false, > > would that provide accurate information? > > It is still the correct result. That "touching" (paging-in) is only > necessary probably to properly resolve PTEs as the fork() does not > seem to carry them over from parent: > > postgres=# select 'create table foo' || s || ' as select > generate_series(1, 100000);' from generate_series(1, 4) s; > postgres=# \gexec > SELECT 100000 > SELECT 100000 > SELECT 100000 > SELECT 100000 > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; -- before: > numa_zone_id | count > --------------+--------- > 0 | 256 > 1 | 4131 > | 8384221 > postgres=# select pg_backend_pid(); > pg_backend_pid > ---------------- > 1451 > > -- now use another root (!) session to "migratepages(8)" from numactl > will also shift shm segment: > # migratepages 1451 1 3 > > -- while above will be in progress for a lot of time , but the outcome > is visible much faster in that backend (pg is functioning): > postgres=# select numa_zone_id, count(*) from pg_buffercache_numa > group by numa_zone_id order by numa_zone_id; > numa_zone_id | count > --------------+--------- > 0 | 256 > 3 | 4480 > | 8383872 > > So the above clearly shows that initial touching of shm is required, > but just once and it stays valid afterwards. > > BTW: migratepages(8) was stuck for 1-2 minutes there on > "__wait_rcu_gp" according to it's wchan, without any sign of activity > on the OS and then out of blue completed just fine, s_b=64GB,HP=on. > > > === 18 > > > > +Datum > > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS) > > +{ > > + FuncCallContext *funcctx; > > + BufferCachePagesContext *fctx; /* User function context. */ > > + bool query_numa = true; > > > > I don't think we need query_numa anymore in pg_buffercache_numa_pages(). > > > > I think that we can just ERROR (or NOTICE and return) here: > > > > + if (pg_numa_init() == -1) > > + { > > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMAdata might be unavailable."); > > + query_numa = false; > > > > and fully get rid of query_numa. > > Right... fixed it here, but it made the tests blow up, so I've had to > find a way to conditionally launch tests based on NUMA availability > and that's how pg_numa_available() was born. It's in ipc/shmem.c > because I couldn't find a better place for it... > > > === 19 > > > > And then here: > > > > for (i = 0; i < NBuffers; i++) > > { > > populate_buffercache_entry(i, fctx); > > if (query_numa) > > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend); > > } > > > > if (query_numa) > > { > > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) > > elog(ERROR, "failed NUMA pages inquiry: %m"); > > > > for (i = 0; i < NBuffers; i++) > > { > > int blk2page = (int) i * pages_per_blk; > > > > /* > > * Technically we can get errors too here and pass that to > > * user. Also we could somehow report single DB block spanning > > * more than one NUMA zone, but it should be rare. > > */ > > fctx->record[i].numa_zone_id = os_pages_status[blk2page]; > > } > > > > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"? > > Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper > that prepares **os_page_ptrs, which is then used by > pg_numa_query_pages() and then we fill the data. But now after > removing query_numa , it reads smoother anyway. Can you please take a > look again on this, is this up to the project standards? > > > A few random comments regarding 0003: > > > > === 20 > > > > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes > > > > s/pg_shmem_allocations/pg_shmem_numa_allocations/? > > Fixed. > > > === 21 > > > > + /* FIXME: should we release LWlock here ? */ > > + elog(ERROR, "failed NUMA pages inquiry status: %m"); > > > > There is no need, see src/backend/storage/lmgr/README. > > Thanks, fixed. > > > === 22 > > > > +#define MAX_NUMA_ZONES 32 /* FIXME? */ > > + Size zones[MAX_NUMA_ZONES]; > > > > could we rely on pg_numa_get_max_node() instead? > > Sure, done. > > > === 23 > > > > + if (s >= 0) > > + zones[s]++; > > > > should we also check that s is below a limit? > > STILL OPEN QUESTION: I'm not sure it would give us value to report > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > could somehow overallocate that array and remember negative ones too. > > > === 24 > > > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > > no possible holes? I mean could a system have node 0,1 and 3 but not 2? > > I have never seen a hole in numbering as it is established during > boot, and the only way that could get close to adjusting it could be > making processor books (CPU and RAM together) offline. Even *if* > someone would be doing some preventive hardware maintenance, that > still wouldn't hurt, as we are just using the Id of the zone to > display it -- the max would be already higher. I mean, technically one > could use lsmem(1) (it mentions removable flag, after let's pages and > processes migrated away and from there) and then use chcpu(1) to > --disable CPUs on that zone (to prevent new ones coming there and > allocating new local memory there) and then offlining that memory > region via chmem(1) --disable. Even with all of that, that still > shouldn't cause issues for this code I think, because > `numa_max_node()` says it `returns the >> highest node number <<< > available on the current system.` > > > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > > did a little in 0001 though) > > That was an attempt to say "Thank You", OK, I've aligned it that way, > so you are still referenced in 0001. I wouldn't find motivation to > work on this if you won't respond to those emails ;) > > There is one known issue, CI returned for numa.out test, that I need > to get bottom to (it does not reproduce for me locally) inside > numa.out/sql: > > SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90 > Hi, ok, so v11 should fix this last problem and make cfbot happy. -J.
Attachment
Hi, On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote: > On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for the new version! > > v10 is attached with most fixes after review and one new thing > introduced: pg_numa_available() for run-time decision inside tests > which was needed after simplifying code a little bit as you wanted. > I've also fixed -Dlibnuma=disabled as it was not properly implemented. > There are couple open items (minor/decision things), but most is fixed > or answered: Thanks for the new version! > > === 2 > > > > +else > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > +fi > > > > Maybe we should add the same test (checking for numa.h) for meson? > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > between meson and autoconf? Yeah, idea was to have both in sync. > > === 8 > > > > # create extension pg_buffercache; > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > RETURNS SETOF RECORD > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > LANGUAGE C PARALLEL SAFE" > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > While that's ok if 0003 is applied. I think that each individual patch should > > "fully" work individually. > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > patches or keep them separate? Applying 0001 + 0002 only does produce the issue above. I think that each sub-patch once applied should pass make check-world, that means: 0001 : should pass 0001 + 0002 : should pass 0001 + 0002 + 0003 : should pass > > Also I need to think more about how firstUseInBackend is used, for example: > > > > ==== 17.1 > > > > would it be better to define it as a file scope variable? (at least that > > would avoid to pass it as an extra parameter in some functions). > > I have no strong opinion on this, but I have one doubt (for future): > isn't creating global variables making life harder for upcoming > multithreading guys ? That would be "just" one more. I think it's better to use it to avoid the "current" code using "useless" function parameters. > > === 17.2 > > > > what happens if firstUseInBackend is set to false and later on the pages > > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is > > called again by a backend that already had set firstUseInBackend to false, > > would that provide accurate information? > > It is still the correct result. That "touching" (paging-in) is only > necessary probably to properly resolve PTEs as the fork() does not > seem to carry them over from parent: > > So the above clearly shows that initial touching of shm is required, > but just once and it stays valid afterwards. Great, thanks for the demo and the explanation! > > === 19 > > > > Can you please take a look again on this Sure, will do. > > === 23 > > > > + if (s >= 0) > > + zones[s]++; > > > > should we also check that s is below a limit? > > STILL OPEN QUESTION: I'm not sure it would give us value to report > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > could somehow overallocate that array and remember negative ones too. I meant to say, ensure that it is below the max node number. > > === 24 > > > > Regarding how we make use of pg_numa_get_max_node(), are we sure there is > > no possible holes? I mean could a system have node 0,1 and 3 but not 2? > > I have never seen a hole in numbering as it is established during > boot, and the only way that could get close to adjusting it could be > making processor books (CPU and RAM together) offline. Yeah probably. > Even with all of that, that still > shouldn't cause issues for this code I think, because > `numa_max_node()` says it `returns the >> highest node number <<< > available on the current system.` Yeah, I agree, thanks for clearing my doubts on it. > > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I > > did a little in 0001 though) > > That was an attempt to say "Thank You", Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-) I'll have a look at v11 soon. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Mar 13, 2025 at 02:15:14PM +0000, Bertrand Drouvot wrote: > > > === 19 > > > > > > > Can you please take a look again on this > > Sure, will do. > I'll have a look at v11 soon. About 0001: === 1 git am produces: .git/rebase-apply/patch:378: new blank line at EOF. + .git/rebase-apply/patch:411: new blank line at EOF. + warning: 2 lines add whitespace errors. === 2 + Returns true if a <acronym>NUMA</acronym> support is available. What about "Returns true if the server has been compiled with NUMA support"? === 3 +Datum +pg_numa_available(PG_FUNCTION_ARGS) +{ + if(pg_numa_init() == -1) + PG_RETURN_BOOL(false); + PG_RETURN_BOOL(true); +} What about PG_RETURN_BOOL(pg_numa_init() != -1)? Also I wonder if pg_numa.c would not be a better place for it. === 4 +{ oid => '5102', descr => 'Is NUMA compilation available?', + proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool', + proargtypes => '', prosrc => 'pg_numa_available' }, + Not sure that 5102 is a good choice. src/include/catalog/unused_oids is telling me: Best practice is to start with a random choice in the range 8000-9999. Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here) So maybe use 9685 instead? === 5 @@ -12477,3 +12481,4 @@ prosrc => 'gist_stratnum_common' }, ] + garbage? === 6 Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c and src/port/pg_numa.c. === 7 +Size +pg_numa_get_pagesize(void) +{ + Size os_page_size = sysconf(_SC_PAGESIZE); + if (huge_pages_status == HUGE_PAGES_ON) + GetHugePageSize(&os_page_size, NULL); + return os_page_size; +} I think that's more usual to: Size pg_numa_get_pagesize(void) { Size os_page_size = sysconf(_SC_PAGESIZE); if (huge_pages_status == HUGE_PAGES_ON) GetHugePageSize(&os_page_size, NULL); return os_page_size; } I think that makes sense to check huge_pages_status as you did. === 8 +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3); +extern void numa_error(char *where); I wonder if that would make sense to add a comment mentioning github.com/numactl/numactl/blob/master/libnuma.c here. I still need to look at 0002 and 0003. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: Hi, Thank you very much for the review! I'm answering to both reviews in one go and the results is attached v12, seems it all should be solved now: > > > === 2 > > > > > > +else > > > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5 > > > +fi > > > > > > Maybe we should add the same test (checking for numa.h) for meson? > > > > TBH, I have no idea, libnuma.h may exist but it may not link e.g. when > > cross-compiling 32-bit on 64-bit. Or is this more about keeping sync > > between meson and autoconf? > > Yeah, idea was to have both in sync. Added. > > > === 8 > > > > > > # create extension pg_buffercache; > > > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so:undefined symbol: pg_numa_query_pages > > > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages() > > > RETURNS SETOF RECORD > > > AS '$libdir/pg_buffercache', 'pg_buffercache_pages' > > > LANGUAGE C PARALLEL SAFE" > > > extension script file "pg_buffercache--1.2.sql", near line 7 > > > > > > While that's ok if 0003 is applied. I think that each individual patch should > > > "fully" work individually. > > > > STILL OPEN QUESTION: Not sure I understand: you need to have 0001 + > > 0002 or 0001 + 0003, but here 0002 is complaining about lack of > > pg_numa_query_pages() which is part of 0001 (?) Should I merge those > > patches or keep them separate? > > Applying 0001 + 0002 only does produce the issue above. I think that each > sub-patch once applied should pass make check-world, that means: > > 0001 : should pass > 0001 + 0002 : should pass > 0001 + 0002 + 0003 : should pass OK, I've retested v11 for all three of them. It worked fine (I think in v10 I've moved one function to 0001, but pg_numa_query_pages() as per error message above was always in the 0001). > > > Also I need to think more about how firstUseInBackend is used, for example: > > > > > > ==== 17.1 > > > > > > would it be better to define it as a file scope variable? (at least that > > > would avoid to pass it as an extra parameter in some functions). > > > > I have no strong opinion on this, but I have one doubt (for future): > > isn't creating global variables making life harder for upcoming > > multithreading guys ? > > That would be "just" one more. I think it's better to use it to avoid the "current" > code using "useless" function parameters. Done. > > > === 23 > > > > > > + if (s >= 0) > > > + zones[s]++; > > > > > > should we also check that s is below a limit? > > > > STILL OPEN QUESTION: I'm not sure it would give us value to report > > e.g. -2 on per shm entry/per numa node entry, would it? If it would we > > could somehow overallocate that array and remember negative ones too. > > I meant to say, ensure that it is below the max node number. Done, but I doubt the kernel would return a value higher than numa_max_nodes(), but who knows. Additional defense for this array is now there. > SECOND REVIEW//v11-0001 review > > === 1 > > git am produces: > > .git/rebase-apply/patch:378: new blank line at EOF. > + > .git/rebase-apply/patch:411: new blank line at EOF. > + > warning: 2 lines add whitespace errors. Should be gone, but in at least one case (0003/numa.out) we need to have empty EOF because otherwise expected tests don't pass (even if numa.sql doesnt have EOF in numa.sql) > === 2 > > + Returns true if a <acronym>NUMA</acronym> support is available. > > What about "Returns true if the server has been compiled with NUMA support"? Done. > === 3 > > +Datum > +pg_numa_available(PG_FUNCTION_ARGS) > +{ > + if(pg_numa_init() == -1) > + PG_RETURN_BOOL(false); > + PG_RETURN_BOOL(true); > +} > > What about PG_RETURN_BOOL(pg_numa_init() != -1)? > > Also I wonder if pg_numa.c would not be a better place for it. Both done. > === 4 > > +{ oid => '5102', descr => 'Is NUMA compilation available?', > + proname => 'pg_numa_available', provolatile => 'v', prorettype => 'bool', > + proargtypes => '', prosrc => 'pg_numa_available' }, > + > > Not sure that 5102 is a good choice. > > src/include/catalog/unused_oids is telling me: > > Best practice is to start with a random choice in the range 8000-9999. > Suggested random unused OID: 9685 (23 consecutive OID(s) available starting here) > > So maybe use 9685 instead? It's because i've got 5101 there earlier (for pg_shm_numa_allocs view), but I've aligned both (5102@0001 and 5101@0003) to 968[56]. > === 5 > > @@ -12477,3 +12481,4 @@ > prosrc => 'gist_stratnum_common' }, > > ] > + > > garbage? Yea, fixed. > === 6 > > Run pgindent as it looks like it's finding some things to do in src/backend/storage/ipc/shmem.c > and src/port/pg_numa.c. Fixed. > === 7 > > +Size > +pg_numa_get_pagesize(void) > +{ > + Size os_page_size = sysconf(_SC_PAGESIZE); > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + return os_page_size; > +} > > I think that's more usual to: > > Size > pg_numa_get_pagesize(void) > { > Size os_page_size = sysconf(_SC_PAGESIZE); > > if (huge_pages_status == HUGE_PAGES_ON) > GetHugePageSize(&os_page_size, NULL); > > return os_page_size; > } > > I think that makes sense to check huge_pages_status as you did. Done. > === 8 > > +extern void numa_warn(int num, char *fmt,...) pg_attribute_printf(2, 3); > +extern void numa_error(char *where); > > I wonder if that would make sense to add a comment mentioning > github.com/numactl/numactl/blob/master/libnuma.c here. Sure thing, added. -J.
Attachment
Hi, On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote: > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > Thank you very much for the review! I'm answering to both reviews in > one go and the results is attached v12, seems it all should be solved > now: Thanks for v12! I'll review 0001 and 0003 later, but want to share what I've done for 0002. I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply on top of v11 0002 (I just rebased it a bit so that it now applies on top of v12 0002). The main changes are: === 1 Some doc wording changes. === 2 Some comments, pgindent and friends changes. === 3 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, - pinning_backends int4, numa_zone_id int4); + pinning_backends int4, zone_id int4); I changed numa_zone_id to zone_id as we are already in "numa" related functions and/or views. === 4 - CHECK_FOR_INTERRUPTS(); } + + CHECK_FOR_INTERRUPTS(); I think that it's better to put the CFI outside of the if condition, so that it can be called for every page. === 5 -pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx) +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 6 -get_buffercache_tuple(int i, BufferCachePagesContext *fctx) +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 7 - int os_page_count = 0; + uint64 os_page_count = 0; I think that's better. === 8 But then, here, we need to change its format to %lu and and to cast to (unsigned long) to make the CI CompilerWarnings happy. That's fine because we don't provide NUMA support for 32 bits anyway. - elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f", - os_page_count, os_page_size, pages_per_blk); + elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", + (unsigned long) os_page_count, os_page_size, pages_per_blk); === 9 -static bool firstUseInBackend = true; +static bool firstNumaTouch = true; Looks better to me but still not 100% convinced by the name. === 10 static BufferCachePagesContext * -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is a lof of examples in the code that make use of "FunctionCallInfo" for non fmgr-compatible function. and also: === 11 I don't like the fact that we iterate 2 times over NBuffers in pg_buffercache_numa_pages(). But I'm having having hard time finding a better approach given the fact that pg_numa_query_pages() needs all the pointers "prepared" before it can be called... Those 2 loops are probably the best approach, unless someone has a better idea. === 12 Upthread you asked "Can you please take a look again on this, is this up to the project standards?" Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper? What do you think? The comments, doc and code changes are just proposals and are fully open to discussion. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote: > > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > Thank you very much for the review! I'm answering to both reviews in > > one go and the results is attached v12, seems it all should be solved > > now: > > Thanks for v12! > > I'll review 0001 and 0003 later, but want to share what I've done for 0002. > > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply > on top of v11 0002 (I just rebased it a bit so that it now applies on top of > v12 0002). Hey Bertrand, all LGTM (good ideas), so here's v13 attached with applied all of that (rebased, tested). BTW: I'm sending to make cfbot as it still tried to apply that .patch (on my side it .patch, not .txt) > === 9 > > -static bool firstUseInBackend = true; > +static bool firstNumaTouch = true; > > Looks better to me but still not 100% convinced by the name. IMHO, Yes, it looks much better. > === 10 > > static BufferCachePagesContext * > -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) > +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) > > as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is > a lof of examples in the code that make use of "FunctionCallInfo" for non > fmgr-compatible function. Cool, thanks. > and also: > > === 11 > > I don't like the fact that we iterate 2 times over NBuffers in > pg_buffercache_numa_pages(). > > But I'm having having hard time finding a better approach given the fact that > pg_numa_query_pages() needs all the pointers "prepared" before it can be called... > > Those 2 loops are probably the best approach, unless someone has a better idea. IMHO, it doesn't hurt and I've also not been able to think of any better idea. > === 12 > > Upthread you asked "Can you please take a look again on this, is this up to the > project standards?" > > Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper? Yes, this was for an earlier doubt regarding question "19" about reviewing the code after removal of `query_numa` variable. This is the same code for 2 loops, IMHO it is good now. > What do you think? The comments, doc and code changes are just proposals and are > fully open to discussion. They are great, thank You! -J.
Attachment
Hi, On Mon, Mar 17, 2025 at 08:28:46AM +0100, Jakub Wartak wrote: > On Fri, Mar 14, 2025 at 1:08 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply > > on top of v11 0002 (I just rebased it a bit so that it now applies on top of > > v12 0002). > > Hey Bertrand, > > all LGTM (good ideas), so here's v13 attached with applied all of that > (rebased, tested). Thanks for v13! Looking at 0003: === 1 + <entry>NUMA mappings for shared memory allocations</entry> s/NUMA mappings/NUMA node mappings/ maybe? === 2 + <para> + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes + assigned allocations made from the server's main shared memory segment. What about? " shows how shared memory allocations in the server's main shared memory segment are distributed across NUMA nodes" ? === 3 + <structfield>numa_zone_id</structfield> <type>int4</type> s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in 0002. BTW, I wonder if "node_id" would be better (to match the descriptions...). If so, would also need to be done in 0002. === 4 + ID of NUMA node <acronym>NUMA</acronym> node ID? (to be consistent with 0002). === 5 +static bool firstUseInBackend = true; Let's use firstNumaTouch to be consistent with 0002. === 6 + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");; There is 2 ";" + I think that we should used the same wording as in pg_buffercache_numa_pages(). === 7 What about using ERROR instead? (like in pg_buffercache_numa_pages()) === 8 + /* + * This is for gathering some NUMA statistics. We might be using various + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in + * various different OS memory pages sizes, so first we need to understand + * the OS memory page size before calling move_pages() + */ + os_page_size = pg_numa_get_pagesize(); Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling pg_numa_get_pagesize()? === 9 + max_zones = pg_numa_get_max_node(); I think we are mixing "zone" and "node". I think we should standardize on one and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to vote for node, but zone is fine too if you prefer. === 10 + /* + * Preallocate memory all at once without going into details which shared + * memory segment is the biggest (technically min s_b can be as low as + * 16xBLCKSZ) + */ What about? " Allocate memory for page pointers and status based on total shared memory size. This simplified approach allocates enough space for all pages in shared memory rather than calculating the exact requirements for each segment. " instead? === 11 + int shm_total_page_count, + shm_ent_page_count, I think those 2 should be uint64. === 12 + /* + * XXX: We are ignoring in NUMA version reporting of the following regions + * (compare to pg_get_shmem_allocations() case): 1. output shared memory + * allocated but not counted via the shmem index 2. output as-of-yet + * unused shared memory + */ why XXX? what about? " We are ignoring the following memory regions (as compared to pg_get_shmem_allocations()).... === 13 + page_ptrs = palloc(sizeof(void *) * shm_total_page_count); + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count); maybe we could use palloc0() here? === 14 and I realize that we could probably use it in 0002 for os_page_ptrs. === 15 I think there is still some multi-lines comments that are missing a period. I probably also missed some in 0002 during the previous review. I think that's worth another check. === 16 + * In order to get reliable results we also need to touch memory + * pages so that inquiry about NUMA zone doesn't return -2. + */ maybe use the same wording as in 0002? === 17 The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count but (as for 0002) it looks like we can not avoid it (or at least I don't see a way to avoid it). I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is updated. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Thanks for v13! Rebased and fixes inside in the attached v14 (it passes CI too): > Looking at 0003: > > === 1 > > + <entry>NUMA mappings for shared memory allocations</entry> > > s/NUMA mappings/NUMA node mappings/ maybe? Done. > === 2 > > + <para> > + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes > + assigned allocations made from the server's main shared memory segment. > > What about? > > " > shows how shared memory allocations in the server's main shared memory segment > are distributed across NUMA nodes" ? Done. > === 3 > > + <structfield>numa_zone_id</structfield> <type>int4</type> > > s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in > 0002. > > BTW, I wonder if "node_id" would be better (to match the descriptions...). > If so, would also need to be done in 0002. Somewhat duplicate, please see answer for #9 > === 4 > > + ID of NUMA node > > <acronym>NUMA</acronym> node ID? (to be consistent with 0002). > > === 5 > > +static bool firstUseInBackend = true; > > Let's use firstNumaTouch to be consistent with 0002. Done. > === 6 > > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");; > > There is 2 ";" + I think that we should used the same wording as in > pg_buffercache_numa_pages(). > > === 7 > > What about using ERROR instead? (like in pg_buffercache_numa_pages()) Both are synced now. > === 8 > > + /* > + * This is for gathering some NUMA statistics. We might be using various > + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in > + * various different OS memory pages sizes, so first we need to understand > + * the OS memory page size before calling move_pages() > + */ > + os_page_size = pg_numa_get_pagesize(); > > Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling > pg_numa_get_pagesize()? Done, improved style of the comment there and synced pg_buffercache one to shmem.c one. > === 9 > > + max_zones = pg_numa_get_max_node(); > > I think we are mixing "zone" and "node". I think we should standardize on one > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to > vote for node, but zone is fine too if you prefer. Given that numa(7) does not use "zone" keyword at all and both /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split into zones, we can conclude that "zone" is simply a subdivision within a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without internal information about zones), therefore we should stay away from using "zone" within the patch at all, as we are just fetching NUMA node info. My bad, it's a terminology error on my side from start - I've probably saw "zone" info in some command output back then when we had that workshop and started using it and somehow it propagated through the patchset up to this day... I've adjusted it all and settled on "numa_node_id" column name. > === 10 > > + /* > + * Preallocate memory all at once without going into details which shared > + * memory segment is the biggest (technically min s_b can be as low as > + * 16xBLCKSZ) > + */ > > What about? > > " > Allocate memory for page pointers and status based on total shared memory size. > This simplified approach allocates enough space for all pages in shared memory > rather than calculating the exact requirements for each segment. > " instead? Done. > === 11 > > + int shm_total_page_count, > + shm_ent_page_count, > > I think those 2 should be uint64. Right... > === 12 > > + /* > + * XXX: We are ignoring in NUMA version reporting of the following regions > + * (compare to pg_get_shmem_allocations() case): 1. output shared memory > + * allocated but not counted via the shmem index 2. output as-of-yet > + * unused shared memory > + */ > > why XXX? > > what about? > > " > We are ignoring the following memory regions (as compared to > pg_get_shmem_allocations()).... Fixed , it was apparently leftover of when I was thinking if we should still report it. > === 13 > > + page_ptrs = palloc(sizeof(void *) * shm_total_page_count); > + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count); > > maybe we could use palloc0() here? Of course!++ > === 14 > > and I realize that we could probably use it in 0002 for os_page_ptrs. Of course!++ > === 15 > > I think there is still some multi-lines comments that are missing a period. I > probably also missed some in 0002 during the previous review. I think that's > worth another check. Please do such a check, I've tried pgident on all .c files, but I'm apparently blind to such issues. BTW if patch has anything left that causes pgident to fix, that is not picked by CI but it is picked by buildfarm?? > === 16 > > + * In order to get reliable results we also need to touch memory > + * pages so that inquiry about NUMA zone doesn't return -2. > + */ > > maybe use the same wording as in 0002? But 0002 used: "In order to get reliable results we also need to touch memory pages, so that inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated pages)" or are you looking at something different? > === 17 > > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count > but (as for 0002) it looks like we can not avoid it (or at least I don't see > a way to avoid it). Hm, it's literally debug code. Why would we care so much if it is 2 loops rather than 1? (as stated earlier we need to pack ptrs and then analyze it) > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is > updated. Cool, thanks in advance. -J.
Attachment
Hi, On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote: > On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Thanks for v13! > > Rebased and fixes inside in the attached v14 (it passes CI too): Thanks! > > === 9 > > > > + max_zones = pg_numa_get_max_node(); > > > > I think we are mixing "zone" and "node". I think we should standardize on one > > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to > > vote for node, but zone is fine too if you prefer. > > Given that numa(7) does not use "zone" keyword at all and both > /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split > into zones, we can conclude that "zone" is simply a subdivision within > a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA, > ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without > internal information about zones), therefore we should stay away from > using "zone" within the patch at all, as we are just fetching NUMA > node info. My bad, it's a terminology error on my side from start - > I've probably saw "zone" info in some command output back then when we > had that workshop and started using it and somehow it propagated > through the patchset up to this day... Thanks for the explanation. > I've adjusted it all and settled on "numa_node_id" column name. Yeah, I can see, things like: + <para> + The definitions of the columns exposed are identical to the + <structname>pg_buffercache</structname> view, except that this one includes + one additional <structfield>numa_node_id</structfield> column as defined in + <xref linkend="pgbuffercache-numa-columns"/>. and like: + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>numa_size</structfield> <type>int4</type> + </para> I think that you re-introduced the "numa_" in the column(s) name that we get rid (or agreed to) of previously. I think that we can get rid of the "numa_" stuff in column(s) name as the column(s) are part of "numa" relation views/output anyway. I think "node_id", "size" as column(s) name should be enough. Or maybe that re-adding "numa_" was intentional? > > === 15 > > > > I think there is still some multi-lines comments that are missing a period. I > > probably also missed some in 0002 during the previous review. I think that's > > worth another check. > > Please do such a check, Found much more: + * In order to get reliable results we also need to touch memory pages, so that + * inquiry about NUMA memory node doesn't return -2 (which indicates + * unmapped/unallocated pages) is missing the period and + /* + * Switch context when allocating stuff to be used in later calls + */ should be as before, meaning on current HEAD: /* Switch context when allocating stuff to be used in later calls */ and + /* + * Return to original context when allocating transient memory + */ should be as before, meaning on current HEAD: /* Return to original context when allocating transient memory */ and + /* + * Note if the buffer is valid, and has storage created + */ should be as before, meaning on current HEAD: /* Note if the buffer is valid, and has storage created */ and + /* + * unused for v1.0 callers, but the array is always long enough + */ should be as before, meaning on current HEAD: /* unused for v1.0 callers, but the array is always long enough */ and +/* + * This is almost identical to the above, but performs + * NUMA inuqiry about memory mappings is missing the period and + * To correctly map between them, we need to: - Determine the OS + * memory page size - Calculate how many OS pages are used by all + * buffer blocks - Calculate how many OS pages are contained within + * each database block is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here is missing the period (2 times as this comment appears in 0002 and 0003) and + /* + * We are ignoring the following memory regions (as compared to + * pg_get_shmem_allocations()): 1. output shared memory allocated but not + * counted via the shmem index 2. output as-of-yet unused shared memory is missing the period > I've tried pgident on all .c files, but I'm > apparently blind to such issues. I don't think pgident would report missing period. > BTW if patch has anything left that > causes pgident to fix, that is not picked by CI but it is picked by > buildfarm?? I think it has to be done manually before each commit and that this is anyway done at least once per release cycle. > > === 16 > > > > + * In order to get reliable results we also need to touch memory > > + * pages so that inquiry about NUMA zone doesn't return -2. > > + */ > > > > maybe use the same wording as in 0002? > > But 0002 used: > > "In order to get reliable results we also need to touch memory pages, so that > inquiry about NUMA zone doesn't return -2 (which indicates > unmapped/unallocated > pages)" > > or are you looking at something different? Nope, I meant to say that it could make sense to have the exact same comment. > > === 17 > > > > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count > > but (as for 0002) it looks like we can not avoid it (or at least I don't see > > a way to avoid it). > > Hm, it's literally debug code. Why would we care so much if it is 2 > loops rather than 1? (as stated earlier we need to pack ptrs and then > analyze it) Yeah, but if we could just loop one time I'm pretty sure we'd have done that. > > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is > > updated. > :w > Cool, thanks in advance. 0001 looks in a good shape from my point of view. For 0002: === 1 I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could deserve their own patch. That would ease the review for the "real" numa stuff. Maybe something like: 0001 as it is 0002 introduces (and uses) pg_buffercache_init_entries() and pg_buffercache_build_tuple() 0003 current 0002 attached minus 0002 above We did it that way in c2a50ac678e and ff7c40d7fd6 for example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: Hi! v15 attached, rebased, CI-tested, all fixed incorporated. > > I've adjusted it all and settled on "numa_node_id" column name. > [...] > I think that we can get rid of the "numa_" stuff in column(s) name as > the column(s) are part of "numa" relation views/output anyway. [...] Done, you are probably right (it was done to keep consistency between those two views probably), I'm just not that strongly attached to the naming things. > > Please do such a check, > > Found much more: > [.. 9 issues with missing dots at the end of sentences in comments + fixes to comment structure in relation to HEAD..] All fixed. > > BTW if patch has anything left that > > causes pgident to fix, that is not picked by CI but it is picked by > > buildfarm?? > > I think it has to be done manually before each commit and that this is anyway > done at least once per release cycle. OK, thanks for clarification. [..] > > > > But 0002 used: > > > > "In order to get reliable results we also need to touch memory pages, so that > > inquiry about NUMA zone doesn't return -2 (which indicates > > unmapped/unallocated > > pages)" > > > > or are you looking at something different? > > Nope, I meant to say that it could make sense to have the exact same comment. Synced those two. [..] > > 0001 looks in a good shape from my point of view. Cool! > For 0002: > > === 1 > > I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could > deserve their own patch. That would ease the review for the "real" numa stuff. > Done, 0001+0002 alone passes the meson test. -J.
Attachment
Hi, Thank you for working on this! On Wed, 19 Mar 2025 at 12:06, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Hi! v15 attached, rebased, CI-tested, all fixed incorporated. This needs to be rebased after 8eadd5c73c. -- Regards, Nazir Bilal Yavuz Microsoft
On Thu, Mar 27, 2025 at 12:31 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > Thank you for working on this! > > On Wed, 19 Mar 2025 at 12:06, Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > > > On Tue, Mar 18, 2025 at 3:29 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi! v15 attached, rebased, CI-tested, all fixed incorporated. > > This needs to be rebased after 8eadd5c73c. Hi Nazir, thanks for spotting! I've not connected the dots with AIO going in and my libnuma dependency blowing up... Attached is rebased v16 that passed my CI run. -J.
Attachment
Hello I think you should remove numa_warn() and numa_error() from 0001. AFAICS they are dead code (even with all your patches applied), and furthermore would get you in trouble regarding memory allocation because src/port is not allowed to use palloc et al. If you wanted to keep them you'd have to have them in src/common, but looking at the rest of the code in that patch, ISTM src/port is the right place for it. If in the future you discover that you do need numa_warn(), you can create a src/common/ file for it then. Is pg_buffercache really the best place for these NUMA introspection routines? I'm not saying that it isn't, maybe we're okay with that (particularly if we can avoid duplicated code), but it seems a bit weird to me. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No me acuerdo, pero no es cierto. No es cierto, y si fuera cierto, no me acuerdo." (Augusto Pinochet a una corte de justicia)
Hi, On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > setup_additional_packages_script: | > - #apt-get update > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > + apt-get update > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > + libnuma1 \ > + libnuma-dev I think libnuma is installed on the relevant platforms, so you shouldn't need to install it manually. > +# > +# libnuma > +# > +AC_MSG_CHECKING([whether to build with libnuma support]) > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], Most other dependencies say "build with libxyz ..." > +/*------------------------------------------------------------------------- > + * > + * pg_numa.h > + * Basic NUMA portability routines > + * > + * > + * Copyright (c) 2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/include/port/pg_numa.h > + * > + *------------------------------------------------------------------------- > + */ > +#ifndef PG_NUMA_H > +#define PG_NUMA_H > + > +#include "c.h" > +#include "postgres.h" Headers should never include either of those headers. Nor should .c files include both. > @@ -200,6 +200,8 @@ pgxs_empty = [ > > 'ICU_LIBS', > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > + > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > ] Maybe I am missing something, but are you actually defining and using those LIBNUMA_* vars anywhere? > +Size > +pg_numa_get_pagesize(void) > +{ > + Size os_page_size = sysconf(_SC_PAGESIZE); > + > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + > + return os_page_size; > +} Should this have a comment or an assertion that it can only be used after shared memory startup? Because before that huge_pages_status won't be meaningful? > +#ifndef FRONTEND > +/* > + * XXX: not really tested as there is no way to trigger this in our > + * current usage of libnuma. > + * > + * The libnuma built-in code can be seen here: > + * https://github.com/numactl/numactl/blob/master/libnuma.c > + * > + */ > +void > +numa_warn(int num, char *fmt,...) > +{ > + va_list ap; > + int olde = errno; > + int needed; > + StringInfoData msg; > + > + initStringInfo(&msg); > + > + va_start(ap, fmt); > + needed = appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + if (needed > 0) > + { > + enlargeStringInfo(&msg, needed); > + va_start(ap, fmt); > + appendStringInfoVA(&msg, fmt, ap); > + va_end(ap); > + } > + > + ereport(WARNING, > + (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION), > + errmsg_internal("libnuma: WARNING: %s", msg.data))); I think you would at least have to hold interrupts across this, as ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump out of libnuma in case an interrupt has arrived. > +Size > +pg_numa_get_pagesize(void) > +{ > +#ifndef WIN32 > + Size os_page_size = sysconf(_SC_PAGESIZE); > +#else > + Size os_page_size; > + SYSTEM_INFO sysinfo; > + > + GetSystemInfo(&sysinfo); > + os_page_size = sysinfo.dwPageSize; > +#endif > + if (huge_pages_status == HUGE_PAGES_ON) > + GetHugePageSize(&os_page_size, NULL); > + return os_page_size; > +} I would probably implement this once, outside of the big ifdef, with one more ifdef inside, given that you're sharing the same implementation. > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > Date: Wed, 19 Mar 2025 09:34:56 +0100 > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > primary function to separate functions. > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > and get_buffercache_tuple() that help fill result tuplestores based on the > buffercache contents. This will be used in a follow-up commit that implements > NUMA observability in pg_buffercache. > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > --- > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > 1 file changed, 178 insertions(+), 139 deletions(-) > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > index 62602af1775..86e0b8afe01 100644 > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > @@ -14,7 +14,6 @@ > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > > - > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 Independent change. > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > -Datum > -pg_buffercache_pages(PG_FUNCTION_ARGS) > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > + * memory mapping inquiries to display NUMA node information for each buffer. > + */ If it's a helper routine it's probably not identical to pg_buffercache_numa_pages()? > +/* > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > + * > + * Build buffer cache information for a single buffer. > + */ > +static void > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > +{ This isn't really building a tuple tuple? Seems somewhat confusing, because get_buffercache_tuple() does actually build one. > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + bufHdr = GetBufferDescriptor(record_id); > + /* Lock each buffer header before inspecting. */ > + buf_state = LockBufHdr(bufHdr); > + > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); As above, I think this would be more readable if you put fctx->record[record_id] into a local var. > +static Datum > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > +{ > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > + HeapTuple tuple; > + > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > + nulls[0] = false; > + > + /* > + * Set all fields except the bufferid to null if the buffer is unused or > + * not valid. > + */ > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > + fctx->record[record_id].isvalid == false) > + { > + nulls[1] = true; > + nulls[2] = true; > + nulls[3] = true; > + nulls[4] = true; > + nulls[5] = true; > + nulls[6] = true; > + nulls[7] = true; > + > + /* unused for v1.0 callers, but the array is always long enough */ > + nulls[8] = true; I'd probably just memset the entire nulls array to either true or false, instead of doing it one-by-one. > + } > + else > + { > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > + nulls[1] = false; > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > + nulls[2] = false; > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > + nulls[3] = false; > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > + nulls[4] = false; > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > + nulls[5] = false; > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > + nulls[6] = false; > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > + nulls[7] = false; Seems like it would end up a lot more readable if you put fctx->record[record_id] into a local variable. Unfortunately that'd probably be best done in one more commit ahead of the rest of the this one... > @@ -0,0 +1,28 @@ > +SELECT NOT(pg_numa_available()) AS skip_test \gset > +\if :skip_test > +\quit > +\endif You could avoid the need for an alternative output file if you instead made the queries do something like SELECT NOT pg_numa_available() OR count(*) ... > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,42 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the new function. > +DROP VIEW pg_buffercache; > +DROP FUNCTION pg_buffercache_pages(); I don't think we can just drop a view in the upgrade script. That will fail if anybody created a view depending on pg_buffercache. (Sorry, ran out of time / energy here, i had originally just wanted to comment on the apt-get thing in the tests) Greetings, Andres Freund
On Thu, Mar 27, 2025 at 2:15 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hello Good morning :) > I think you should remove numa_warn() and numa_error() from 0001. > AFAICS they are dead code (even with all your patches applied), and > furthermore would get you in trouble regarding memory allocation because > src/port is not allowed to use palloc et al. If you wanted to keep them > you'd have to have them in src/common, but looking at the rest of the > code in that patch, ISTM src/port is the right place for it. If in the > future you discover that you do need numa_warn(), you can create a > src/common/ file for it then. Understood, trimmed it out from the patch. I'm going to respond also within minutes to Andres' review and I'm going to post a new version (v17) there. > Is pg_buffercache really the best place for these NUMA introspection > routines? I'm not saying that it isn't, maybe we're okay with that > (particularly if we can avoid duplicated code), but it seems a bit weird > to me. I think it is, because as I understand, Andres wanted to have observability per single database *page* and to avoid code duplication we are just putting it there (it's natural fit). Imagine looking up an 8kB root btree memory page being hit hard from CPUs on other NUMA nodes (this just gives ability to see that, but you could of course also get aggregation to get e.g. NUMA node balance for single relation and so on). -J.
On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, Hi Andres, > On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote: > > setup_additional_packages_script: | > > - #apt-get update > > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ... > > + apt-get update > > + DEBIAN_FRONTEND=noninteractive apt-get -y install \ > > + libnuma1 \ > > + libnuma-dev > > I think libnuma is installed on the relevant platforms, so you shouldn't need > to install it manually. Fixed. Right, you mentioned this earlier, I just didnt know when it went online. > > +# > > +# libnuma > > +# > > +AC_MSG_CHECKING([whether to build with libnuma support]) > > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness], > > Most other dependencies say "build with libxyz ..." Done. > > + * pg_numa.h [..] > > +#include "c.h" > > +#include "postgres.h" > > Headers should never include either of those headers. Nor should .c files > include both. Fixed, huh, I've found explanation: https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us > > @@ -200,6 +200,8 @@ pgxs_empty = [ > > > > 'ICU_LIBS', > > > > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS', > > + > > 'LIBURING_CFLAGS', 'LIBURING_LIBS', > > ] > > Maybe I am missing something, but are you actually defining and using those > LIBNUMA_* vars anywhere? OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)` in configure.ac that would set those *FLAGS. I'm little bit loss dependent in how to gurantee that meson is synced with autoconf as per project requirements - trying to use past commits as reference, but I still could get something wrong here (especially in src/Makefile.global.in) > > +Size > > +pg_numa_get_pagesize(void) [..] > > Should this have a comment or an assertion that it can only be used after > shared memory startup? Because before that huge_pages_status won't be > meaningful? Added both. > > +#ifndef FRONTEND > > +/* > > + * XXX: not really tested as there is no way to trigger this in our > > + * current usage of libnuma. > > + * > > + * The libnuma built-in code can be seen here: > > + * https://github.com/numactl/numactl/blob/master/libnuma.c > > + * > > + */ > > +void > > +numa_warn(int num, char *fmt,...) [..] > > I think you would at least have to hold interrupts across this, as > ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump > out of libnuma in case an interrupt has arrived. On request by Alvaro I've removed it as that code is simply unreachable/untestable, but point taken - I'm planning to re-add this with holding interrupting in future when we start using proper numa_interleave() one day. Anyway, please let me know if you want still to keep it as deadcode. BTW for context , why this is deadcode is explained in the latter part of [1] message (TL;DR; unless we use pining/numa_interleave/local_alloc() we probably never reach that warnings/error handlers). "Another question without an easy answer as I never hit this error from numa_move_pages(), one gets invalid stuff in *os_pages_status instead. BUT!: most of our patch just uses things that cannot fail as per libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700 /sys` (because it's hard to unmount it) and then still most of numactl stuff works as euid != 0, but numactl --hardware gets at least "libnuma: Warning: Cannot parse distance information in sysfs: Permission denied" or same story with numactl -C 678 date. So unless we start way more heavy use of libnuma (not just for observability) there's like no point in that right now (?) Contrary to that: we can do just do variadic elog() for that, I've put some code, but no idea if that works fine..." > > +Size > > +pg_numa_get_pagesize(void) [..] > > I would probably implement this once, outside of the big ifdef, with one more > ifdef inside, given that you're sharing the same implementation. Done. > > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001 > > From: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Date: Wed, 19 Mar 2025 09:34:56 +0100 > > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's > > primary function to separate functions. > > > > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple() > > and get_buffercache_tuple() that help fill result tuplestores based on the > > buffercache contents. This will be used in a follow-up commit that implements > > NUMA observability in pg_buffercache. > > > > Author: Jakub Wartak <jakub.wartak@enterprisedb.com> > > Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> > > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com > > --- > > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++-------- > > 1 file changed, 178 insertions(+), 139 deletions(-) > > > > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c > > index 62602af1775..86e0b8afe01 100644 > > --- a/contrib/pg_buffercache/pg_buffercache_pages.c > > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c > > @@ -14,7 +14,6 @@ > > #include "storage/buf_internals.h" > > #include "storage/bufmgr.h" > > > > - > > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8 > > Independent change. Fixed. > > #define NUM_BUFFERCACHE_PAGES_ELEM 9 > > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5 > > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); > > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); > > PG_FUNCTION_INFO_V1(pg_buffercache_evict); > > > > -Datum > > -pg_buffercache_pages(PG_FUNCTION_ARGS) > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs > > + * memory mapping inquiries to display NUMA node information for each buffer. > > + */ > > If it's a helper routine it's probably not identical to > pg_buffercache_numa_pages()? Of course not, fixed by removing that comment. > > +/* > > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). > > + * > > + * Build buffer cache information for a single buffer. > > + */ > > +static void > > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > This isn't really building a tuple tuple? Seems somewhat confusing, because > get_buffercache_tuple() does actually build one. s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless someone wants to come with better name. > > + BufferDesc *bufHdr; > > + uint32 buf_state; > > + > > + bufHdr = GetBufferDescriptor(record_id); > > + /* Lock each buffer header before inspecting. */ > > + buf_state = LockBufHdr(bufHdr); > > + > > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); > > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); > > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; > > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; > > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); > > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; > > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); > > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); > > As above, I think this would be more readable if you put > fctx->record[record_id] into a local var. Done. > > +static Datum > > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) > > +{ > > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; > > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; > > + HeapTuple tuple; > > + > > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); > > + nulls[0] = false; > > + > > + /* > > + * Set all fields except the bufferid to null if the buffer is unused or > > + * not valid. > > + */ > > + if (fctx->record[record_id].blocknum == InvalidBlockNumber || > > + fctx->record[record_id].isvalid == false) > > + { > > + nulls[1] = true; > > + nulls[2] = true; > > + nulls[3] = true; > > + nulls[4] = true; > > + nulls[5] = true; > > + nulls[6] = true; > > + nulls[7] = true; > > + > > + /* unused for v1.0 callers, but the array is always long enough */ > > + nulls[8] = true; > > I'd probably just memset the entire nulls array to either true or false, > instead of doing it one-by-one. Done. > > + } > > + else > > + { > > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); > > + nulls[1] = false; > > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); > > + nulls[2] = false; > > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); > > + nulls[3] = false; > > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); > > + nulls[4] = false; > > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); > > + nulls[5] = false; > > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); > > + nulls[6] = false; > > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); > > + nulls[7] = false; > > Seems like it would end up a lot more readable if you put > fctx->record[record_id] into a local variable. Unfortunately that'd probably > be best done in one more commit ahead of the rest of the this one... Done, i've put it those refactorig changes into the commit already dedicated only for a refactor. For the record Bertrand also asked for something about this, but I was somehow afraid to touch Tom's code. > > @@ -0,0 +1,28 @@ > > +SELECT NOT(pg_numa_available()) AS skip_test \gset > > +\if :skip_test > > +\quit > > +\endif > > You could avoid the need for an alternative output file if you instead made > the queries do something like > SELECT NOT pg_numa_available() OR count(*) ... ITEM REMAINING: Is this for the future or can it stay like that? I don't have a hard opinion on this, but I've already wasted lots of cycles to discover that one can have those ".1" alternative expected result files. > > --- /dev/null > > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > > @@ -0,0 +1,42 @@ > > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > > + > > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > > + > > +-- Register the new function. > > +DROP VIEW pg_buffercache; > > +DROP FUNCTION pg_buffercache_pages(); > > I don't think we can just drop a view in the upgrade script. That will fail if > anybody created a view depending on pg_buffercache. Ugh, fixed, thanks. That must have been some leftover (we later do CREATE OR REPLACE those anyway). > (Sorry, ran out of time / energy here, i had originally just wanted to comment > on the apt-get thing in the tests) Thanks! AIO intensifies ... :) -J. [1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
Attachment
Hi, On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote: > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > > > > +Size > > > +pg_numa_get_pagesize(void) > [..] > > > > Should this have a comment or an assertion that it can only be used after > > shared memory startup? Because before that huge_pages_status won't be > > meaningful? > > Added both. Thanks for the updated version! + Assert(IsUnderPostmaster); I wonder if that would make more sense to add an assertion on huge_pages_status and HUGE_PAGES_UNKNOWN instead (more or less as it is done in CreateSharedMemoryAndSemaphores()). === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch Once applied I can see mention to pg_buffercache_numa_pages() while it only comes in v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch. I think pg_buffercache_numa_pages() should not be mentioned before it's actually implemented. === 1 + bufRecord->isvalid == false) { int i; - funcctx = SRF_FIRSTCALL_INIT(); - - /* Switch context when allocating stuff to be used in later calls */ - oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); - - /* Create a user function context for cross-call persistence */ - fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext)); + for (i = 1; i <= 9; i++) + nulls[i] = true; "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM is increased to 10). In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM). That could also make sense to remove the loop and use memset() that way: " memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool)); " instead. It's done that way in some other places (hbafuncs.c for example). === 2 - if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM) - TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", - INT4OID, -1, 0); + if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", + INT4OID, -1, 0); I think we should not change the "expected_tupledesc->natts" check here until v17-0003 is applied. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Hi, Hi Bertrand, happy to see you back, thanks for review and here's v18 attached (an ideal fit for PG18 ;)) > On Mon, Mar 31, 2025 at 11:27:50AM +0200, Jakub Wartak wrote: > > On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > +Size > > > > +pg_numa_get_pagesize(void) > > [..] > > > > > > Should this have a comment or an assertion that it can only be used after > > > shared memory startup? Because before that huge_pages_status won't be > > > meaningful? > > > > Added both. > > Thanks for the updated version! > > + Assert(IsUnderPostmaster); > > I wonder if that would make more sense to add an assertion on huge_pages_status > and HUGE_PAGES_UNKNOWN instead (more or less as it is done in > CreateSharedMemoryAndSemaphores()). Ok, let's have both just in case (this status is by default initialized to _UNKNOWN, so I hope you haven't had in mind using GetConfigOption() as this would need guc.h in port?) > === About v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch [..] > I think pg_buffercache_numa_pages() should not be mentioned before it's actually > implemented. Right, fixed. > === 1 > [..] > "i <= 9" will be correct only once v17-0003 is applied (when NUM_BUFFERCACHE_PAGES_ELEM > is increased to 10). > > In v17-0002 that should be i < 9 (even better i < NUM_BUFFERCACHE_PAGES_ELEM). > > That could also make sense to remove the loop and use memset() that way: > > " > memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool)); > " > > instead. It's done that way in some other places (hbafuncs.c for example). Ouch, good catch. > === 2 > > - if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM) > - TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", > - INT4OID, -1, 0); > > + if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1) > + TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends", > + INT4OID, -1, 0); > > I think we should not change the "expected_tupledesc->natts" check here until > v17-0003 is applied. Right, I've moved that into 003 where it belongs and now 002 has no single NUMA reference. I've thrown 0001+0002 alone onto CI and it passed too. -J.
Attachment
Hi Jakub, On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote: > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > Hi, > > Hi Bertrand, happy to see you back, thanks for review and here's v18 > attached (an ideal fit for PG18 ;)) Thanks for the new version! === About v18-0002 It looks in a good shape to me. The helper's name might still be debatable though. I just have 2 comments: === 1 + if (bufRecord->blocknum == InvalidBlockNumber || + bufRecord->isvalid == false) It seems to me that this check could now fit in one line. === 2 + { SRF_RETURN_DONE(funcctx); + } Extra parentheses are not needed. === About v18-0003 === 3 I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain only the necessary changes when updating from 1.5. It means that it should "only" create the new objects (views and functions in our case) that come in v18-0003 and grant appropriate privs. Also it should mention: " \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit " and not: " +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit " The already existing pg_buffercache--1.N--1.(N+1).sql are good examples. === 4 + for (i = 0; i < NBuffers; i++) + { + int blk2page = (int) i * pages_per_blk; + I think that should be: int blk2page = (int) (i * pages_per_blk); === About v18-0004 === 5 When running: select c.name, c.size as num_size, s.size as shmem_size from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss where c.name = s.name; I can see: - pg_shmem_numa_allocations reporting a lot of times the same size - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size Do you observe the same? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, I've spent a bit of time reviewing this. In general I haven't found anything I'd call a bug, but here's a couple comments for v18 ... Most of this is in separate "review" commits, with a couple exceptions. 1) Please update the commit messages, with proper formatting, etc. I tried to do that in the attached v19, but please go through that, add relevant details, update list of reviewers, etc. The subject should not be overly long, etc. 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd use just "libnuma support" similar to other libraries. 3) I don't think we need pg_numa.h to have this: extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why would it be necessary here? It's enough to have a prototype in .c file. 4) Improved .sgml to have acronym/productname in a couple places. 5) I don't think the comment for pg_buffercache_init_entries() is very useful. That it's helper for pg_buffercache_pages() tells me nothing about how to use it, what the arguments are, etc. 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better comment too. I mean, there's no info about what the arguments are, which arguments are input or output, etc. And it only discussed one option (block page < memory page), but what happens in the other case? The formulas with blk2page/blk2pageoff are not quite clear to me (I'm not saying it's wrong). However, it seems rather suspicious that pages_per_blk is calculated as float, and pg_buffercache_numa_prepare_ptrs() then does this: for (size_t j = 0; j < pages_per_blk; j++) { ... } I mean, isn't this vulnerable to rounding errors, which might trigger some weird behavior? If not, it'd be good to add a comment why this is fine, it confuses me a lot. I personally would probably prefer doing just integer arithmetic here. 7) This para in the docs seems self-contradictory: <para> The <function>pg_buffercache_numa_pages()</function> provides the same information as <function>pg_buffercache_pages()</function> but is slower because it also provides the <acronym>NUMA</acronym> node ID per shared buffer entry. The <structname>pg_buffercache_numa</structname> view wraps the function for convenient use. </para> I mean, "provides the same information, but is slower because it provides different information" is strange. I understand the point, but maybe rephrase somehow? 8) Why is pg_numa_available() marked as volatile? Should not change in a running cluster, no? 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe let's not do that. <sect2 id="pgbuffercache-pg-buffercache_numa"> 10) I think it'd be good to mention/explain why move_pages is used instead of get_mempolicy - more efficient with batching, etc. This would be useful both in the commit message and before the move_pages call (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the pointers like this etc.). 11) This could use UINT64_FORMAT, instead of a cast: elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", (unsigned long) os_page_count, os_page_size, pages_per_blk); regards -- Tomas Vondra
Attachment
- v19-0001-Add-support-for-basic-NUMA-awareness.patch
- v19-0002-pgindent.patch
- v19-0003-review.patch
- v19-0004-pg_buffercache-split-pg_buffercache_pages-into-p.patch
- v19-0005-review.patch
- v19-0006-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch
- v19-0007-review.patch
- v19-0008-Add-pg_shmem_numa_allocations-to-show-NUMA-node.patch
On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi Jakub, > > On Tue, Apr 01, 2025 at 12:56:06PM +0200, Jakub Wartak wrote: > > On Mon, Mar 31, 2025 at 4:59 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > Hi, > > > > Hi Bertrand, happy to see you back, thanks for review and here's v18 > > attached (an ideal fit for PG18 ;)) > > Thanks for the new version! > Hi Bertrand, I'll be attaching v20 when responding to the follow-up review by Tomas to avoid double-sending this in the list, but review findings from here are fixed in v20. > === About v18-0002 > > It looks in a good shape to me. The helper's name might still be debatable > though. > > I just have 2 comments: > > === 1 > > + if (bufRecord->blocknum == InvalidBlockNumber || > + bufRecord->isvalid == false) > > It seems to me that this check could now fit in one line. OK. > === 2 > > + { > SRF_RETURN_DONE(funcctx); > + } > > Extra parentheses are not needed. This also should be fixed in v20. > > === About v18-0003 > > === 3 > > I think that pg_buffercache--1.5--1.6.sql is not correct. It should contain > only the necessary changes when updating from 1.5. It means that it should "only" > create the new objects (views and functions in our case) that come in v18-0003 > and grant appropriate privs. > > Also it should mention: > > " > \echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit > " > and not: > > " > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > " > > The already existing pg_buffercache--1.N--1.(N+1).sql are good examples. Hm, good find, it might be leftover from first attempts where we were aiming just for adding column numa_node_id to pg_buffercache_pages() (rather than adding new view), all hopefully fixed. > === 4 > > + for (i = 0; i < NBuffers; i++) > + { > + int blk2page = (int) i * pages_per_blk; > + > > I think that should be: > > int blk2page = (int) (i * pages_per_blk); OK, but I still fail to grasp why pg_indent doesnt fix this stuff on it's own... I believe orginal ident, would fix this on it's own? > === About v18-0004 > > === 5 > > When running: > > select c.name, c.size as num_size, s.size as shmem_size > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > where c.name = s.name; > > I can see: > > - pg_shmem_numa_allocations reporting a lot of times the same size > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size > > Do you observe the same? > Yes, it is actually by design: the pg_shmem_allocations.size is sum of page sizes not size of struct, e.g. with "order by 3 desc": name | num_size | shmem_size ------------------------------------------------+-----------+------------ WaitEventCustomCounterData | 4096 | 8 Archiver Data | 4096 | 8 SerialControlData | 4096 | 16 I was even wondering if it does make sense to output "shm pointer address" (or at least offset) there to see which shm structures are on the same page, but we have the same pg_shm_allocations already. -J.
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > I've spent a bit of time reviewing this. In general I haven't found > anything I'd call a bug, but here's a couple comments for v18 ... Most > of this is in separate "review" commits, with a couple exceptions. Hi, thank you very much for help on this, yes I did not anticipate this patch to organically grow like that... I've squashed those review findings into v20 and provided answers for the "review:". > 1) Please update the commit messages, with proper formatting, etc. I > tried to do that in the attached v19, but please go through that, add > relevant details, update list of reviewers, etc. The subject should not > be overly long, etc. Fixed by you. > 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd > use just "libnuma support" similar to other libraries. Fixed by you. > 3) I don't think we need pg_numa.h to have this: > > extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); > > AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why > would it be necessary here? It's enough to have a prototype in .c file. Right, probably the result of ENOTENOUGHCOFFEE and copy/paste. > 4) Improved .sgml to have acronym/productname in a couple places. Great. > 5) I don't think the comment for pg_buffercache_init_entries() is very > useful. That it's helper for pg_buffercache_pages() tells me nothing > about how to use it, what the arguments are, etc. I've added an explanation (in 0003 though), so that this is covered. I've always assumed that 'static' functions don't need that much of that(?) > 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better > comment too. I mean, there's no info about what the arguments are, which > arguments are input or output, etc. And it only discussed one option > (block page < memory page), but what happens in the other case? The > formulas with blk2page/blk2pageoff are not quite clear to me (I'm not > saying it's wrong). > > However, it seems rather suspicious that pages_per_blk is calculated as > float, and pg_buffercache_numa_prepare_ptrs() then does this: > > for (size_t j = 0; j < pages_per_blk; j++) > { ... } > > I mean, isn't this vulnerable to rounding errors, which might trigger > some weird behavior? If not, it'd be good to add a comment why this is > fine, it confuses me a lot. I personally would probably prefer doing > just integer arithmetic here. Please bear with me: If you set client_min_messages to debug1 and then pg_buffercache_numa will dump: a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.00 b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152 pages_per_blk=0.003906 so we need to be agile to support two cases as you mention (BLCKSZ > PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are 4kB..1GB, thus we can get in that float the following sample values: BLCKSZ pagesize 2kB 4kB = 0.5 2kB 2048kb = .0009765625 2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case? 8kB 4kB = 2 8kB 2048kb = .003906250 # example from above (x86_64, 2M HP) 8kB 1024*1024kb # 1GB = .00000762939453 32kB 4kB = 8 32kB 2048kb = .0156250 32kB 1024*1024kb # 1GB = .000030517578125 So that loop: for (size_t j = 0; j < pages_per_blk; j++) is quite generic and launches in both cases. I've somehow failed to somehow come up with integer-based math and generic code for this (without special cases which seem to be no-go here?). So, that loop then will: a) launch many times to support BLCKSZ > pagesize, that is when single DB block spans multiple memory pages b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the example above) Loop touches && stores addresses into os_page_ptrs[] as input to this one big move_pages(2) query. So we basically ask for all memory pages for NBuffers. Once we get our NUMA information we then use blk2page = up_to_NBuffers * pages_per_blk to resolve memory pointers back to Buffers, if anywhere it could be a problem here. So let's say we have s_b=4TB (it wont work for sure for other reasons, let's assume we have it), let's also assume we have no huge pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824 which multiplied by 2 = INT_MAX (integer overflow bug), so I think that int is not big enough there in pg_buffercache_numa_pages() (it should be "size_t blk2page" there as in pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20) Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB => NBuffers=2097152 * 0.003906250 = 8192.0 . OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with float f = 2097152 * 0.003906250; under clang -Weverything I got "implicit conversion increases floating-point precision: 'float' to 'double'", so either it is: - we somehow rewrite all of the core arithmetics here to integer? - or simply go with doubles just to be sure? I went with doubles in v20, comments explaining are not there yet. > 7) This para in the docs seems self-contradictory: > > <para> > The <function>pg_buffercache_numa_pages()</function> provides the same > information > as <function>pg_buffercache_pages()</function> but is slower because > it also > provides the <acronym>NUMA</acronym> node ID per shared buffer entry. > The <structname>pg_buffercache_numa</structname> view wraps the > function for > convenient use. > </para> > > I mean, "provides the same information, but is slower because it > provides different information" is strange. I understand the point, but > maybe rephrase somehow? Oh my... yes, now it looks way better. > 8) Why is pg_numa_available() marked as volatile? Should not change in a > running cluster, no? No it shouldn't, good find, made it 's'table. > 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe > let's not do that. > > <sect2 id="pgbuffercache-pg-buffercache_numa"> Fixed. > 10) I think it'd be good to mention/explain why move_pages is used > instead of get_mempolicy - more efficient with batching, etc. This would > be useful both in the commit message and before the move_pages call Ok, added in 0001. > (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the > pointers like this etc.). Added reference to that earlier new comment here too in 0003. > 11) This could use UINT64_FORMAT, instead of a cast: > > elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu > pages_per_blk=%.2f", > (unsigned long) os_page_count, os_page_size, pages_per_blk); Done. 12) You have also raised "why not pg_shm_allocations_numa" instead of "pg_shm_numa_allocations" OPEN_QUESTION: To be honest, I'm not attached to any of those two (or naming things in general), I can change if you want. 13) In the patch: "review: What if we get multiple pages per buffer (the default). Could we get multiple nodes per buffer?" OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to output multiple rows per single buffer (with "page_no") then we could get this: buffer1:..:page0:numanodeID1 buffer1:..:page1:numanodeID2 buffer2:..:page0:numanodeID1 Should we add such functionality? -J.
Attachment
Hi Jakub, On Wed, Apr 02, 2025 at 04:45:53PM +0200, Jakub Wartak wrote: > On Tue, Apr 1, 2025 at 5:13 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > === 4 > > > > + for (i = 0; i < NBuffers; i++) > > + { > > + int blk2page = (int) i * pages_per_blk; > > + > > > > I think that should be: > > > > int blk2page = (int) (i * pages_per_blk); > > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on > it's own... I believe orginal ident, would fix this on it's own? My comment was not about indention but about the fact that I think that the casting is not a the right place. I think that's the result of the multiplication that we want to be casted (cast operator has higher precedence than Multiplication operator). > > > > select c.name, c.size as num_size, s.size as shmem_size > > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > > where c.name = s.name; > > > > I can see: > > > > - pg_shmem_numa_allocations reporting a lot of times the same size > > - pg_shmem_numa_allocations and pg_shmem_allocations not reporting the same size > > > > Do you observe the same? > > > > Yes, it is actually by design: the pg_shmem_allocations.size is sum of > page sizes not size of struct, Ok, but then does it make sense to see some num_size < shmem_size? postgres=# select c.name, c.size as num_size, s.size as shmem_size from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss where c.name = s.name and s.size > c.size; name | num_size | shmem_size ---------------+-----------+------------ XLOG Ctl | 4194304 | 4208200 Buffer Blocks | 134217728 | 134221824 AioHandleIOV | 2097152 | 2850816 (3 rows) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 4/2/25 16:46, Jakub Wartak wrote: > On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi, >> >> I've spent a bit of time reviewing this. In general I haven't found >> anything I'd call a bug, but here's a couple comments for v18 ... Most >> of this is in separate "review" commits, with a couple exceptions. > > Hi, thank you very much for help on this, yes I did not anticipate > this patch to organically grow like that... > I've squashed those review findings into v20 and provided answers for > the "review:". > Thanks. >> 1) Please update the commit messages, with proper formatting, etc. I >> tried to do that in the attached v19, but please go through that, add >> relevant details, update list of reviewers, etc. The subject should not >> be overly long, etc. > > Fixed by you. > OK, so you agree the commit messages are complete / correct? >> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd >> use just "libnuma support" similar to other libraries. > > Fixed by you. > OK. FWIW if you disagree with some of my proposed changes, feel free to push back. I'm sure some may be more a matter of personal preference. >> 3) I don't think we need pg_numa.h to have this: >> >> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS); >> >> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why >> would it be necessary here? It's enough to have a prototype in .c file. > > Right, probably the result of ENOTENOUGHCOFFEE and copy/paste. > >> 4) Improved .sgml to have acronym/productname in a couple places. > > Great. > >> 5) I don't think the comment for pg_buffercache_init_entries() is very >> useful. That it's helper for pg_buffercache_pages() tells me nothing >> about how to use it, what the arguments are, etc. > > I've added an explanation (in 0003 though), so that this is covered. > I've always assumed that 'static' functions don't need that much of > that(?) > I think that's mostly true - the (non-static) functions that are part of the API for a module need better / more detailed docs. But that doesn't mean static functions shouldn't have at least basic docs (unless the function is trivial / obvious, but I don't think that's the case here). If I happen to work a pg_buffercache patch in a couple months, I'll still need to understand what the function does. It won't save me that I'm working on the same file ... I'm not saying this needs a detailed docs, but "helper for X" adds very little information - I can easily see where it's called from, right? >> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better >> comment too. I mean, there's no info about what the arguments are, which >> arguments are input or output, etc. And it only discussed one option >> (block page < memory page), but what happens in the other case? The >> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not >> saying it's wrong). >> >> However, it seems rather suspicious that pages_per_blk is calculated as >> float, and pg_buffercache_numa_prepare_ptrs() then does this: >> >> for (size_t j = 0; j < pages_per_blk; j++) >> { ... } >> >> I mean, isn't this vulnerable to rounding errors, which might trigger >> some weird behavior? If not, it'd be good to add a comment why this is >> fine, it confuses me a lot. I personally would probably prefer doing >> just integer arithmetic here. > > Please bear with me: If you set client_min_messages to debug1 and then > pg_buffercache_numa will dump: > a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096 > pages_per_blk=2.00 > b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152 > pages_per_blk=0.003906 > > so we need to be agile to support two cases as you mention (BLCKSZ > > PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are > 4kB..1GB, thus we can get in that float the following sample values: > BLCKSZ pagesize > 2kB 4kB = 0.5 > 2kB 2048kb = .0009765625 > 2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case? > 8kB 4kB = 2 > 8kB 2048kb = .003906250 # example from above (x86_64, 2M HP) > 8kB 1024*1024kb # 1GB = .00000762939453 > 32kB 4kB = 8 > 32kB 2048kb = .0156250 > 32kB 1024*1024kb # 1GB = .000030517578125 > > So that loop: > for (size_t j = 0; j < pages_per_blk; j++) > is quite generic and launches in both cases. I've somehow failed to > somehow come up with integer-based math and generic code for this > (without special cases which seem to be no-go here?). So, that loop > then will: > a) launch many times to support BLCKSZ > pagesize, that is when single > DB block spans multiple memory pages > b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the > example above) > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really confusing and difficult to reason about ... I agree we don't want special cases for each possible combination of page sizes (I'm not sure we even know all the combinations). What I was thinking about is two branches, one for (block >= page) and another for (block < page). AFAICK both values have to be 2^k, so this would guarantee we have either (block/page) or (page/block) as integer. I wonder if you could even just calculate both, and have one loop that deals with both. > Loop touches && stores addresses into os_page_ptrs[] as input to this > one big move_pages(2) query. So we basically ask for all memory pages > for NBuffers. Once we get our NUMA information we then use blk2page = > up_to_NBuffers * pages_per_blk to resolve memory pointers back to > Buffers, if anywhere it could be a problem here. > IMHO this is the information I'd expect in the function comment. > So let's say we have s_b=4TB (it wont work for sure for other reasons, > let's assume we have it), let's also assume we have no huge > pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824 > which multiplied by 2 = INT_MAX (integer overflow bug), so I think > that int is not big enough there in pg_buffercache_numa_pages() (it > should be "size_t blk2page" there as in > pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20) > > Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB => > NBuffers=2097152 * 0.003906250 = 8192.0 . > > OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with > float f = 2097152 * 0.003906250; > under clang -Weverything I got "implicit conversion increases > floating-point precision: 'float' to 'double'", so either it is: > - we somehow rewrite all of the core arithmetics here to integer? > - or simply go with doubles just to be sure? I went with doubles in > v20, comments explaining are not there yet. > When I say "integer arithmetic" I don't mean it should use 32-bit ints, or any other data type. I mean that it works with non-floating point values. It could be int64, Size or whatever is large enough to not overflow. I really don't see how changing stuff to double makes this easier to understand. >> 7) This para in the docs seems self-contradictory: >> >> <para> >> The <function>pg_buffercache_numa_pages()</function> provides the same >> information >> as <function>pg_buffercache_pages()</function> but is slower because >> it also >> provides the <acronym>NUMA</acronym> node ID per shared buffer entry. >> The <structname>pg_buffercache_numa</structname> view wraps the >> function for >> convenient use. >> </para> >> >> I mean, "provides the same information, but is slower because it >> provides different information" is strange. I understand the point, but >> maybe rephrase somehow? > > Oh my... yes, now it looks way better. > >> 8) Why is pg_numa_available() marked as volatile? Should not change in a >> running cluster, no? > > No it shouldn't, good find, made it 's'table. > >> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe >> let's not do that. >> >> <sect2 id="pgbuffercache-pg-buffercache_numa"> > > Fixed. > >> 10) I think it'd be good to mention/explain why move_pages is used >> instead of get_mempolicy - more efficient with batching, etc. This would >> be useful both in the commit message and before the move_pages call > > Ok, added in 0001. > >> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the >> pointers like this etc.). > > Added reference to that earlier new comment here too in 0003. > Will take a look in the evening. >> 11) This could use UINT64_FORMAT, instead of a cast: >> >> elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu >> pages_per_blk=%.2f", >> (unsigned long) os_page_count, os_page_size, pages_per_blk); > > Done. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > "pg_shm_numa_allocations" > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > naming things in general), I can change if you want. > Me neither. I wonder if there's some precedent when adding similar variants for other catalogs ... can you check? I've been thinking about pg_stats and pg_stats_ext, but maybe there's a better example? > 13) In the patch: "review: What if we get multiple pages per buffer > (the default). Could we get multiple nodes per buffer?" > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > output multiple rows per single buffer (with "page_no") then we could > get this: > buffer1:..:page0:numanodeID1 > buffer1:..:page1:numanodeID2 > buffer2:..:page0:numanodeID1 > > Should we add such functionality? > When you say "today no" does that mean we know all pages will be on the same node, or that there may be pages from different nodes and we can't display that? That'd not be great, IMHO. I'm not a huge fan of returning multiple rows per buffer, with one row per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The rest of the fields is for the whole buffer, it'd be wrong to duplicate that for each page. I wonder if we should have a bitmap of nodes for the buffer (but then what if there are multiple pages from the same node?), or maybe just an array of nodes, with one element per page. regards -- Tomas Vondra
On Wed, Apr 2, 2025 at 5:27 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi Jakub, Hi Bertrand, > > OK, but I still fail to grasp why pg_indent doesnt fix this stuff on > > it's own... I believe orginal ident, would fix this on it's own? > > My comment was not about indention but about the fact that I think that the > casting is not a the right place. I think that's the result of the multiplication > that we want to be casted (cast operator has higher precedence than Multiplication > operator). Oh! I've missed that, but v21 got a rewrite (still not polished) just to show it can be done without float points as Tomas requested. [..] > Ok, but then does it make sense to see some num_size < shmem_size? > > postgres=# select c.name, c.size as num_size, s.size as shmem_size > from (select n.name as name, sum(n.size) as size from pg_shmem_numa_allocations n group by n.name) c, pg_shmem_allocationss > where c.name = s.name and s.size > c.size; > name | num_size | shmem_size > ---------------+-----------+------------ > XLOG Ctl | 4194304 | 4208200 > Buffer Blocks | 134217728 | 134221824 > AioHandleIOV | 2097152 | 2850816 This was a real bug, fixed in v21 , the ent->allocated_size was not properly page aligned. Thanks for the attention to detail. -J.
On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > OK, so you agree the commit messages are complete / correct? Yes. > OK. FWIW if you disagree with some of my proposed changes, feel free to > push back. I'm sure some may be more a matter of personal preference. No, it's all fine. I will probably have lots of questions about setting proper env for development that cares itself about style, but that's for another day. > [..floats..] > Hmmm, OK. Maybe it's correct. I still find the float arithmetic really > confusing and difficult to reason about ... > > I agree we don't want special cases for each possible combination of > page sizes (I'm not sure we even know all the combinations). What I was > thinking about is two branches, one for (block >= page) and another for > (block < page). AFAICK both values have to be 2^k, so this would > guarantee we have either (block/page) or (page/block) as integer. > > I wonder if you could even just calculate both, and have one loop that > deals with both. > [..] > When I say "integer arithmetic" I don't mean it should use 32-bit ints, > or any other data type. I mean that it works with non-floating point > values. It could be int64, Size or whatever is large enough to not > overflow. I really don't see how changing stuff to double makes this > easier to understand. I hear you, attached v21 / 0003 is free of float/double arithmetics and uses non-float point values. It should be more readable too with those comments. I have not put it into its own function, because now it fits the whole screen, so hopefully one can follow visually. Please let me know if that code solves the doubts or feel free to reformat it. That _numa_prepare_ptrs() is unused and will need to be removed, but we can still move some code there if necessary. > > 12) You have also raised "why not pg_shm_allocations_numa" instead of > > "pg_shm_numa_allocations" > > > > OPEN_QUESTION: To be honest, I'm not attached to any of those two (or > > naming things in general), I can change if you want. > > > > Me neither. I wonder if there's some precedent when adding similar > variants for other catalogs ... can you check? I've been thinking about > pg_stats and pg_stats_ext, but maybe there's a better example? Hm, it seems we always go with suffix "_somethingnew": * pg_stat_database -> pg_stat_database_conflicts * pg_stat_subscription -> pg_stat_subscription_stats * even here: pg_buffercache -> pg_buffercache_numa @Bertrand: do you have anything against pg_shm_allocations_numa instead of pg_shm_numa_allocations? I don't mind changing it... > > 13) In the patch: "review: What if we get multiple pages per buffer > > (the default). Could we get multiple nodes per buffer?" > > > > OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to > > output multiple rows per single buffer (with "page_no") then we could > > get this: > > buffer1:..:page0:numanodeID1 > > buffer1:..:page1:numanodeID2 > > buffer2:..:page0:numanodeID1 > > > > Should we add such functionality? > > When you say "today no" does that mean we know all pages will be on the > same node, or that there may be pages from different nodes and we can't > display that? That'd not be great, IMHO. > > I'm not a huge fan of returning multiple rows per buffer, with one row > per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The > rest of the fields is for the whole buffer, it'd be wrong to duplicate > that for each page. OPEN_QUESTION: With v21 we have all the information available, we are just unable to display this in pg_buffercache_numa right now. We could trim the view so that it has 3 columns (and user needs to JOIN to pg_buffercache for more details like relationoid), but then what the whole refactor (0002) was for if we would just return bufferId like below: buffer1:page0:numanodeID1 buffer1:page1:numanodeID2 buffer2:page0:numanodeID1 buffer2:page1:numanodeID1 There's also the problem that reading/joining could be inconsistent and even slower. > I wonder if we should have a bitmap of nodes for the buffer (but then > what if there are multiple pages from the same node?), or maybe just an > array of nodes, with one element per page. AFAIR this has been discussed back in end of January, and the conclusion was more or less - on Discord - that everything sucks (bitmaps, BIT() datatype, arrays,...) either from implementation or user side, but apparently arrays [] would suck the least from implementation side. So we could probably do something like up to node_max_nodes(): buffer1:..:{0, 2, 0, 0} buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes buffer3:..:{0, 0, 0, 2} Other idea is JSON or even simple string with numa_node_id<->count: buffer1:..:"1=2" buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes buffer3:..:"3=2" I find all of those non-user friendly and I'm afraid I won't be able to pull that alone in time... -J.
Attachment
Hi, On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > > > OK, so you agree the commit messages are complete / correct? > > Yes. Not 100% sure on my side. === v21-0002 Says: " This introduces three new functions: - pg_buffercache_init_entries - pg_buffercache_build_tuple - get_buffercache_tuple " While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() is). About v21-0002: === 1 I can see that the pg_buffercache_init_entries() helper comments are added in v21-0003 but I think that it would be better to add them in v21-0002 (where the helper is actually created). About v21-0003: === 2 > I hear you, attached v21 / 0003 is free of float/double arithmetics > and uses non-float point values. + if (buffers_per_page > 1) + os_page_query_count = NBuffers; + else + os_page_query_count = NBuffers * pages_per_buffer; yeah, that's more elegant. I think that it properly handles the relationships between buffer and page sizes without relying on float arithmetic. === 3 + if (buffers_per_page > 1) + { As buffers_per_page does not change, I think I'd put this check outside of the for (i = 0; i < NBuffers; i++) loop, something like: " if (buffers_per_page > 1) { /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ for (i = 0; i < NBuffers; i++) { . . . . } else { /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ for (i = 0; i < NBuffers; i++) { . . . " === 4 > That _numa_prepare_ptrs() is unused and will need to be removed, > but we can still move some code there if necessary. Yeah I think that it can be simply removed then. === 5 > @Bertrand: do you have anything against pg_shm_allocations_numa > instead of pg_shm_numa_allocations? I don't mind changing it... I think that pg_shm_allocations_numa() is better (given the examples you just shared). === 6 > I find all of those non-user friendly and I'm afraid I won't be able > to pull that alone in time... Maybe we could add a few words in the doc to mention the "multiple nodes per buffer" case? And try to improve it for say 19? Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 4/3/25 09:01, Jakub Wartak wrote: > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > >> OK, so you agree the commit messages are complete / correct? > > Yes. > >> OK. FWIW if you disagree with some of my proposed changes, feel free to >> push back. I'm sure some may be more a matter of personal preference. > > No, it's all fine. I will probably have lots of questions about > setting proper env for development that cares itself about style, but > that's for another day. > >> [..floats..] >> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really >> confusing and difficult to reason about ... >> >> I agree we don't want special cases for each possible combination of >> page sizes (I'm not sure we even know all the combinations). What I was >> thinking about is two branches, one for (block >= page) and another for >> (block < page). AFAICK both values have to be 2^k, so this would >> guarantee we have either (block/page) or (page/block) as integer. >> >> I wonder if you could even just calculate both, and have one loop that >> deals with both. >> > [..] >> When I say "integer arithmetic" I don't mean it should use 32-bit ints, >> or any other data type. I mean that it works with non-floating point >> values. It could be int64, Size or whatever is large enough to not >> overflow. I really don't see how changing stuff to double makes this >> easier to understand. > > I hear you, attached v21 / 0003 is free of float/double arithmetics > and uses non-float point values. It should be more readable too with > those comments. I have not put it into its own function, because now > it fits the whole screen, so hopefully one can follow visually. Please > let me know if that code solves the doubts or feel free to reformat > it. That _numa_prepare_ptrs() is unused and will need to be removed, > but we can still move some code there if necessary. > IMHO the code in v21 is much easier to understand. It's not quite clear to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though. >>> 12) You have also raised "why not pg_shm_allocations_numa" instead of >>> "pg_shm_numa_allocations" >>> >>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or >>> naming things in general), I can change if you want. >>> >> >> Me neither. I wonder if there's some precedent when adding similar >> variants for other catalogs ... can you check? I've been thinking about >> pg_stats and pg_stats_ext, but maybe there's a better example? > > Hm, it seems we always go with suffix "_somethingnew": > > * pg_stat_database -> pg_stat_database_conflicts > * pg_stat_subscription -> pg_stat_subscription_stats > * even here: pg_buffercache -> pg_buffercache_numa > > @Bertrand: do you have anything against pg_shm_allocations_numa > instead of pg_shm_numa_allocations? I don't mind changing it... > +1 to pg_shmem_allocations_numa >>> 13) In the patch: "review: What if we get multiple pages per buffer >>> (the default). Could we get multiple nodes per buffer?" >>> >>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to >>> output multiple rows per single buffer (with "page_no") then we could >>> get this: >>> buffer1:..:page0:numanodeID1 >>> buffer1:..:page1:numanodeID2 >>> buffer2:..:page0:numanodeID1 >>> >>> Should we add such functionality? >> >> When you say "today no" does that mean we know all pages will be on the >> same node, or that there may be pages from different nodes and we can't >> display that? That'd not be great, IMHO. >> >> I'm not a huge fan of returning multiple rows per buffer, with one row >> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The >> rest of the fields is for the whole buffer, it'd be wrong to duplicate >> that for each page. > > OPEN_QUESTION: With v21 we have all the information available, we are > just unable to display this in pg_buffercache_numa right now. We could > trim the view so that it has 3 columns (and user needs to JOIN to > pg_buffercache for more details like relationoid), but then what the > whole refactor (0002) was for if we would just return bufferId like > below: > > buffer1:page0:numanodeID1 > buffer1:page1:numanodeID2 > buffer2:page0:numanodeID1 > buffer2:page1:numanodeID1 > > There's also the problem that reading/joining could be inconsistent > and even slower. > I think a view with just 3 columns would be a good solution. It's what pg_shmem_allocations_numa already does, so it'd be consistent with that part too. I'm not too worried about the cost of the extra join - it's going to be a couple dozen milliseconds at worst, I guess, and that's negligible in the bigger scheme of things (e.g. compared to how long the move_pages is expected to take). Also, it's not like having everything in the same view is free - people would have to do some sort of post-processing, and that has a cost too. So unless someone can demonstrate a use case where this would matter, I'd not worry about it too much. >> I wonder if we should have a bitmap of nodes for the buffer (but then >> what if there are multiple pages from the same node?), or maybe just an >> array of nodes, with one element per page. > > AFAIR this has been discussed back in end of January, and the > conclusion was more or less - on Discord - that everything sucks > (bitmaps, BIT() datatype, arrays,...) either from implementation or > user side, but apparently arrays [] would suck the least from > implementation side. So we could probably do something like up to > node_max_nodes(): > buffer1:..:{0, 2, 0, 0} > buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes > buffer3:..:{0, 0, 0, 2} > > Other idea is JSON or even simple string with numa_node_id<->count: > buffer1:..:"1=2" > buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes > buffer3:..:"3=2" > > I find all of those non-user friendly and I'm afraid I won't be able > to pull that alone in time... I'm -1 on JSON, I don't see how would that solve anything better than e.g. a regular array, and it's going to be harder to work with. So if we don't want to go with the 3-column view proposed earlier, I'd stick to a simple array. I don't think there's a huge difference between those two approaches, it should be easy to convert between those approaches using unnest() and array_agg(). Attached is v22, with some minor review comments: 1) I suggested we should just use "libnuma support" in configure, instead of talking about "NUMA awareness support", and AFAICS you agreed. But I still see the old text in configure ... is that intentional or a bit of forgotten text? 2) I added a couple asserts to pg_buffercache_numa_pages() and comments, and simplified a couple lines (but that's a matter of preference). 3) I don't think it's correct for pg_get_shmem_numa_allocations to just silently ignore nodes outside the valid range. I suggest we simply do elog(ERROR), as it's an internal error we don't expect to happen. regards -- Tomas Vondra
Attachment
On 4/3/25 10:23, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: >> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> Hi Tomas, >> >>> OK, so you agree the commit messages are complete / correct? >> >> Yes. > > Not 100% sure on my side. > > === v21-0002 > > Says: > > " > This introduces three new functions: > > - pg_buffercache_init_entries > - pg_buffercache_build_tuple > - get_buffercache_tuple > " > > While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() > is). > Ah, OK. Jakub, can you correct (and double-check) this in the next version of the patch? > About v21-0002: > > === 1 > > I can see that the pg_buffercache_init_entries() helper comments are added in > v21-0003 but I think that it would be better to add them in v21-0002 > (where the helper is actually created). > +1 to that > About v21-0003: > > === 2 > >> I hear you, attached v21 / 0003 is free of float/double arithmetics >> and uses non-float point values. > > + if (buffers_per_page > 1) > + os_page_query_count = NBuffers; > + else > + os_page_query_count = NBuffers * pages_per_buffer; > > yeah, that's more elegant. I think that it properly handles the relationships > between buffer and page sizes without relying on float arithmetic. > In the review I just submitted, I changed this to os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer); but maybe it's less clear. Feel free to undo my change. > === 3 > > + if (buffers_per_page > 1) > + { > > As buffers_per_page does not change, I think I'd put this check outside of the > for (i = 0; i < NBuffers; i++) loop, something like: > > " > if (buffers_per_page > 1) { > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > for (i = 0; i < NBuffers; i++) { > . > . > . > . > } else { > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > for (i = 0; i < NBuffers; i++) { > . > . I don't know. It's a matter of opinion, but I find the current code fairly understandable. Maybe if it meaningfully reduced the code nesting, but even with the extra branch we'd still need the for loop. I'm not against doing this differently, but I'd have to see how that looks. Until then I think it's fine to have the code as is. > . > " > > === 4 > >> That _numa_prepare_ptrs() is unused and will need to be removed, >> but we can still move some code there if necessary. > > Yeah I think that it can be simply removed then. > I'm not particularly attached on having the _ptrs() function, but why couldn't it build the os_page_ptrs array as before? > === 5 > >> @Bertrand: do you have anything against pg_shm_allocations_numa >> instead of pg_shm_numa_allocations? I don't mind changing it... > > I think that pg_shm_allocations_numa() is better (given the examples you just > shared). > > === 6 > >> I find all of those non-user friendly and I'm afraid I won't be able >> to pull that alone in time... > > Maybe we could add a few words in the doc to mention the "multiple nodes per > buffer" case? And try to improve it for say 19? Also maybe we should just focus > till v21-0003 (and discard v21-0004 for 18). > IMHO it's not enough to paper this over by mentioning it in the docs. I'm a bit confused about which patch you suggest to leave out. I think 0004 is pg_shmem_allocations_numa, but I think that part is fine, no? We've been discussing how to represent nodes for buffers in 0003. I understand it'd require a bit of coding, but AFAICS adding an array would be fairly trivial amount of code. Something like values[i++] = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID)); would do, I think. But if we decide to do the 3-column view, that'd be even simpler, I think. AFAICS it means we could mostly ditch the pg_buffercache refactoring in patch 0002, and 0003 would get much simpler too I think. If Jakub doesn't have time to work on this, I can take a stab at it later today. regards -- Tomas Vondra
On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, Hi Bertrand, > On Thu, Apr 03, 2025 at 09:01:43AM +0200, Jakub Wartak wrote: [..] > === v21-0002 > While pg_buffercache_build_tuple() is not added (pg_buffercache_save_tuple() > is). Fixed > About v21-0002: > > === 1 > > I can see that the pg_buffercache_init_entries() helper comments are added in > v21-0003 but I think that it would be better to add them in v21-0002 > (where the helper is actually created). Moved > About v21-0003: > > === 2 > > > I hear you, attached v21 / 0003 is free of float/double arithmetics > > and uses non-float point values. > > + if (buffers_per_page > 1) > + os_page_query_count = NBuffers; > + else > + os_page_query_count = NBuffers * pages_per_buffer; > > yeah, that's more elegant. I think that it properly handles the relationships > between buffer and page sizes without relying on float arithmetic. Cool > === 3 > > + if (buffers_per_page > 1) > + { > > As buffers_per_page does not change, I think I'd put this check outside of the > for (i = 0; i < NBuffers; i++) loop, something like: > > " > if (buffers_per_page > 1) { > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > for (i = 0; i < NBuffers; i++) { > . > . > . > . > } else { > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > for (i = 0; i < NBuffers; i++) { > . > . > . > " Done. > === 4 > > > That _numa_prepare_ptrs() is unused and will need to be removed, > > but we can still move some code there if necessary. > > Yeah I think that it can be simply removed then. Removed. > === 5 > > > @Bertrand: do you have anything against pg_shm_allocations_numa > > instead of pg_shm_numa_allocations? I don't mind changing it... > > I think that pg_shm_allocations_numa() is better (given the examples you just > shared). OK, let's go with this name then (in v22). > === 6 > > > I find all of those non-user friendly and I'm afraid I won't be able > > to pull that alone in time... > > Maybe we could add a few words in the doc to mention the "multiple nodes per > buffer" case? And try to improve it for say 19? Right, we could also put it as a limitation. I would be happy to leave it as it must be a rare condition, but Tomas stated he's not. > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). Do you mean discard pg_buffercache_numa (0002+0003) and instead go with pg_shm_allocations_numa (0004) ? BTW: I've noticed that Tomas responded with his v22 to this after I've solved all of the above in mine v22, so I'll drop v23 soon and then let's continue there. -J.
Hi Jakub, On Thu, Apr 03, 2025 at 02:36:57PM +0200, Jakub Wartak wrote: > On Thu, Apr 3, 2025 at 10:23 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > Right, we could also put it as a limitation. I would be happy to leave > it as it must be a rare condition, but Tomas stated he's not. > > > Also maybe we should just focus till v21-0003 (and discard v21-0004 for 18). > > Do you mean discard pg_buffercache_numa (0002+0003) and instead go > with pg_shm_allocations_numa (0004) ? No I meant the opposite: focus on 0001, 0002 and 0003 for 18. But if Tomas is confident enough to also focus in addition to 0004, that's fine too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Apr 3, 2025 at 2:15 PM Tomas Vondra <tomas@vondra.me> wrote: > Ah, OK. Jakub, can you correct (and double-check) this in the next > version of the patch? Done. > > About v21-0002: > > > > === 1 > > > > I can see that the pg_buffercache_init_entries() helper comments are added in > > v21-0003 but I think that it would be better to add them in v21-0002 > > (where the helper is actually created). > > > > +1 to that Done. > > About v21-0003: > > > > === 2 > > > >> I hear you, attached v21 / 0003 is free of float/double arithmetics > >> and uses non-float point values. > > > > + if (buffers_per_page > 1) > > + os_page_query_count = NBuffers; > > + else > > + os_page_query_count = NBuffers * pages_per_buffer; > > > > yeah, that's more elegant. I think that it properly handles the relationships > > between buffer and page sizes without relying on float arithmetic. > > > > In the review I just submitted, I changed this to > > os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer); > > but maybe it's less clear. Feel free to undo my change. Cool, thanks, will send shortly v23 with this applied. > > > === 3 > > > > + if (buffers_per_page > 1) > > + { > > > > As buffers_per_page does not change, I think I'd put this check outside of the > > for (i = 0; i < NBuffers; i++) loop, something like: > > > > " > > if (buffers_per_page > 1) { > > /* BLCKSZ < PAGESIZE: one page hosts many Buffers */ > > for (i = 0; i < NBuffers; i++) { > > . > > . > > . > > . > > } else { > > /* BLCKSZ >= PAGESIZE: Buffer occupies more than one OS page */ > > for (i = 0; i < NBuffers; i++) { > > . > > . > > I don't know. It's a matter of opinion, but I find the current code > fairly understandable. Maybe if it meaningfully reduced the code > nesting, but even with the extra branch we'd still need the for loop. > I'm not against doing this differently, but I'd have to see how that > looks. Until then I think it's fine to have the code as is. v23 will have incorporated Bertrand's idea soon. No hard feelings, but it's kind of painful to switch like that ;) > > === 4 > > > >> That _numa_prepare_ptrs() is unused and will need to be removed, > >> but we can still move some code there if necessary. > > > > Yeah I think that it can be simply removed then. > > > > I'm not particularly attached on having the _ptrs() function, but why > couldn't it build the os_page_ptrs array as before? I've removed it in v23, the code for me just didn't have flow... > > === 5 > > > >> @Bertrand: do you have anything against pg_shm_allocations_numa > >> instead of pg_shm_numa_allocations? I don't mind changing it... > > > > I think that pg_shm_allocations_numa() is better (given the examples you just > > shared). Done. > > === 6 > > > >> I find all of those non-user friendly and I'm afraid I won't be able > >> to pull that alone in time... > > > > Maybe we could add a few words in the doc to mention the "multiple nodes per > > buffer" case? And try to improve it for say 19? Also maybe we should just focus > > till v21-0003 (and discard v21-0004 for 18). > > > > IMHO it's not enough to paper this over by mentioning it in the docs. OK > I'm a bit confused about which patch you suggest to leave out. I think > 0004 is pg_shmem_allocations_numa, but I think that part is fine, no? > We've been discussing how to represent nodes for buffers in 0003. IMHO 0001 + 0004 is good. 0003 is probably the last troublemaker, but we settled on arrays right? > I understand it'd require a bit of coding, but AFAICS adding an array > would be fairly trivial amount of code. Something like > > values[i++] > = PointerGetDatum(construct_array_builtin(nodes, nodecnt, INT4OID)); > > would do, I think. But if we decide to do the 3-column view, that'd be > even simpler, I think. AFAICS it means we could mostly ditch the > pg_buffercache refactoring in patch 0002, and 0003 would get much > simpler too I think. > > If Jakub doesn't have time to work on this, I can take a stab at it > later today. I won't be able to even start on this today, so if you have cycles for please do so... -J.
On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > On 4/3/25 09:01, Jakub Wartak wrote: > > On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, Here's v23 attached (had to rework it because the you sent v22 just the moment you I wanted to send it) Change include: - your review should be incorporated - name change for shm view - pg_buffercache_numa refactor a little to keep out the if outside loops (as Bertrand wanted initially) So let's continue in this subthread. > I think a view with just 3 columns would be a good solution. It's what > pg_shmem_allocations_numa already does, so it'd be consistent with that > part too. > > I'm not too worried about the cost of the extra join - it's going to be > a couple dozen milliseconds at worst, I guess, and that's negligible in > the bigger scheme of things (e.g. compared to how long the move_pages is > expected to take). Also, it's not like having everything in the same > view is free - people would have to do some sort of post-processing, and > that has a cost too. > > So unless someone can demonstrate a use case where this would matter, > I'd not worry about it too much. OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, it's just that I don't have cycles left today and probably lack skills (i've never dealt with arrays so far) thus it would be slow to get it right... but I can pick up anything tomorrow morning. > I'm -1 on JSON, I don't see how would that solve anything better than > e.g. a regular array, and it's going to be harder to work with. So if we > don't want to go with the 3-column view proposed earlier, I'd stick to a > simple array. I don't think there's a huge difference between those two > approaches, it should be easy to convert between those approaches using > unnest() and array_agg(). > > Attached is v22, with some minor review comments: > > 1) I suggested we should just use "libnuma support" in configure, > instead of talking about "NUMA awareness support", and AFAICS you > agreed. But I still see the old text in configure ... is that > intentional or a bit of forgotten text? It was my bad, too many rebases and mishaps with git voodoo.. > 2) I added a couple asserts to pg_buffercache_numa_pages() and comments, > and simplified a couple lines (but that's a matter of preference). Great, thanks. > 3) I don't think it's correct for pg_get_shmem_numa_allocations to just > silently ignore nodes outside the valid range. I suggest we simply do > elog(ERROR), as it's an internal error we don't expect to happen. It's there too. -J.
Attachment
On 4/3/25 15:12, Jakub Wartak wrote: > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > >> ... >> >> So unless someone can demonstrate a use case where this would matter, >> I'd not worry about it too much. > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > it's just that I don't have cycles left today and probably lack skills > (i've never dealt with arrays so far) thus it would be slow to get it > right... but I can pick up anything tomorrow morning. > OK, I took a stab at reworking/simplifying this the way I proposed. Here's v24 - needs more polishing, but hopefully enough to show what I had in mind. It does these changes: 1) Drops 0002 with the pg_buffercache refactoring, because the new view is not "extending" the existing one. 2) Reworks pg_buffercache_num to return just three columns, bufferid, page_num and node_id. page_num is a sequence starting from 0 for each buffer. 3) It now builds an array of records, with one record per buffer/page. 4) I realized we don't really need to worry about buffers_per_page very much, except for logging/debugging. There's always "at least one page" per buffer, even if an incomplete one, so we can do this: os_page_count = NBuffers * Max(1, pages_per_buffer); and then for (i = 0; i < NBuffers; i++) { for (j = 0; j < Max(1, pages_per_buffer); j++) { .. } } and everything just works fine, I think. Opinions? I personally find this much cleaner / easier to understand. regards -- Tomas Vondra
Attachment
Hi, On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: > On 4/3/25 15:12, Jakub Wartak wrote: > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > > >> ... > >> > >> So unless someone can demonstrate a use case where this would matter, > >> I'd not worry about it too much. > > > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > > it's just that I don't have cycles left today and probably lack skills > > (i've never dealt with arrays so far) thus it would be slow to get it > > right... but I can pick up anything tomorrow morning. > > > > OK, I took a stab at reworking/simplifying this the way I proposed. > Here's v24 - needs more polishing, but hopefully enough to show what I > had in mind. > > It does these changes: > > 1) Drops 0002 with the pg_buffercache refactoring, because the new view > is not "extending" the existing one. I think that makes sense. One would just need to join on the pg_buffercache view to get more information about the buffer if needed. The pg_buffercache_numa_pages() doc needs an update though as I don't think that "+ The <function>pg_buffercache_numa_pages()</function> provides the same information as <function>pg_buffercache_pages()</function>" is still true. > 2) Reworks pg_buffercache_num to return just three columns, bufferid, > page_num and node_id. page_num is a sequence starting from 0 for each > buffer. +1 on the idea > 3) It now builds an array of records, with one record per buffer/page. > > 4) I realized we don't really need to worry about buffers_per_page very > much, except for logging/debugging. There's always "at least one page" > per buffer, even if an incomplete one, so we can do this: > > os_page_count = NBuffers * Max(1, pages_per_buffer); > > and then > > for (i = 0; i < NBuffers; i++) > { > for (j = 0; j < Max(1, pages_per_buffer); j++) That's a nice simplification as we always need to take care of at least one page per buffer. > and everything just works fine, I think. I think the same. > Opinions? I personally find this much cleaner / easier to understand. I agree that's easier to understand and that that looks correct. A few random comments: === 1 It looks like that firstNumaTouch is not set to false anymore. === 2 + pfree(os_page_status); + pfree(os_page_ptrs); Not sure that's needed, we should be in a short-lived memory context here (ExprContext or such). === 3 + ro_volatile_var = *(uint64 *)ptr space missing before "ptr"? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: > > On 4/3/25 15:12, Jakub Wartak wrote: > > > On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > > > > >> ... > > >> > > >> So unless someone can demonstrate a use case where this would matter, > > >> I'd not worry about it too much. > > > > > > OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, > > > it's just that I don't have cycles left today and probably lack skills > > > (i've never dealt with arrays so far) thus it would be slow to get it > > > right... but I can pick up anything tomorrow morning. > > > > > > > OK, I took a stab at reworking/simplifying this the way I proposed. > > Here's v24 - needs more polishing, but hopefully enough to show what I > > had in mind. > > > > It does these changes: > > > > 1) Drops 0002 with the pg_buffercache refactoring, because the new view > > is not "extending" the existing one. > > I think that makes sense. One would just need to join on the pg_buffercache > view to get more information about the buffer if needed. > > The pg_buffercache_numa_pages() doc needs an update though as I don't think that > "+ The <function>pg_buffercache_numa_pages()</function> provides the same > information as <function>pg_buffercache_pages()</function>" is still true. > > > 2) Reworks pg_buffercache_num to return just three columns, bufferid, > > page_num and node_id. page_num is a sequence starting from 0 for each > > buffer. > > +1 on the idea > > > 3) It now builds an array of records, with one record per buffer/page. > > > > 4) I realized we don't really need to worry about buffers_per_page very > > much, except for logging/debugging. There's always "at least one page" > > per buffer, even if an incomplete one, so we can do this: > > > > os_page_count = NBuffers * Max(1, pages_per_buffer); > > > > and then > > > > for (i = 0; i < NBuffers; i++) > > { > > for (j = 0; j < Max(1, pages_per_buffer); j++) > > That's a nice simplification as we always need to take care of at least one page > per buffer. > > > and everything just works fine, I think. > > I think the same. > > > Opinions? I personally find this much cleaner / easier to understand. > > I agree that's easier to understand and that that looks correct. > > A few random comments: > > === 1 > > It looks like that firstNumaTouch is not set to false anymore. > > === 2 > > + pfree(os_page_status); > + pfree(os_page_ptrs); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). > > === 3 > > + ro_volatile_var = *(uint64 *)ptr > > space missing before "ptr"? > +my feedback as I've noticed that Bertrand already provided a review. Right, the code is now simple , and that Max() is brilliant. I've attached some review findings as .txt 0001 100%LGTM 0002 doc fix + pgident + Tomas, you should take Authored-by yourself there for sure, I couldn't pull this off alone in time! So big thanks! 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have NUMA on ming32...:)) When started with interleave=all on serious hardware, I'm getting (~5s for s_b=64GB) from pg_buffercache_numa node_id | count ---------+--------- 3 | 2097152 0 | 2097152 2 | 2097152 1 | 2097152 so this is valid result (2097152*4 numa nodes*8192 buffer size/1024/1024/1024 = 64GB) Also with pgbench -i -s 20 , after ~8s: select c.relname, n.node_id, count(*) from pg_buffercache_numa n join pg_buffercache b on (b.bufferid = n.bufferid) join pg_class c on (c.relfilenode = b.relfilenode) group by c.relname, n.node_id order by count(*) desc; relname | node_id | count -----------------------------------------------+---------+------- pgbench_accounts | 2 | 8217 pgbench_accounts | 0 | 8190 pgbench_accounts | 3 | 8189 pgbench_accounts | 1 | 8187 pg_statistic | 2 | 32 pg_operator | 2 | 14 pg_depend | 3 | 14 [..] pg_shm_allocations_numa also looks good. I think v24+tiny fixes is good enough to go in. -J.
Attachment
On 4/4/25 09:35, Jakub Wartak wrote: > On Fri, Apr 4, 2025 at 8:50 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: >>> On 4/3/25 15:12, Jakub Wartak wrote: >>>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: >>>> >>>>> ... >>>>> >>>>> So unless someone can demonstrate a use case where this would matter, >>>>> I'd not worry about it too much. >>>> >>>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, >>>> it's just that I don't have cycles left today and probably lack skills >>>> (i've never dealt with arrays so far) thus it would be slow to get it >>>> right... but I can pick up anything tomorrow morning. >>>> >>> >>> OK, I took a stab at reworking/simplifying this the way I proposed. >>> Here's v24 - needs more polishing, but hopefully enough to show what I >>> had in mind. >>> >>> It does these changes: >>> >>> 1) Drops 0002 with the pg_buffercache refactoring, because the new view >>> is not "extending" the existing one. >> >> I think that makes sense. One would just need to join on the pg_buffercache >> view to get more information about the buffer if needed. >> >> The pg_buffercache_numa_pages() doc needs an update though as I don't think that >> "+ The <function>pg_buffercache_numa_pages()</function> provides the same >> information as <function>pg_buffercache_pages()</function>" is still true. >> >>> 2) Reworks pg_buffercache_num to return just three columns, bufferid, >>> page_num and node_id. page_num is a sequence starting from 0 for each >>> buffer. >> >> +1 on the idea >> >>> 3) It now builds an array of records, with one record per buffer/page. >>> >>> 4) I realized we don't really need to worry about buffers_per_page very >>> much, except for logging/debugging. There's always "at least one page" >>> per buffer, even if an incomplete one, so we can do this: >>> >>> os_page_count = NBuffers * Max(1, pages_per_buffer); >>> >>> and then >>> >>> for (i = 0; i < NBuffers; i++) >>> { >>> for (j = 0; j < Max(1, pages_per_buffer); j++) >> >> That's a nice simplification as we always need to take care of at least one page >> per buffer. >> >>> and everything just works fine, I think. >> >> I think the same. >> >>> Opinions? I personally find this much cleaner / easier to understand. >> >> I agree that's easier to understand and that that looks correct. >> >> A few random comments: >> >> === 1 >> >> It looks like that firstNumaTouch is not set to false anymore. >> >> === 2 >> >> + pfree(os_page_status); >> + pfree(os_page_ptrs); >> >> Not sure that's needed, we should be in a short-lived memory context here >> (ExprContext or such). >> >> === 3 >> >> + ro_volatile_var = *(uint64 *)ptr >> >> space missing before "ptr"? >> > > +my feedback as I've noticed that Bertrand already provided a review. > > Right, the code is now simple , and that Max() is brilliant. I've > attached some review findings as .txt > > 0001 100%LGTM > 0002 doc fix + pgident + Tomas, you should take Authored-by yourself > there for sure, I couldn't pull this off alone in time! So big thanks! > 0003 fixes elog UINT64_FORMAT for ming32 (a little bit funny to have > NUMA on ming32...:)) > OK > When started with interleave=all on serious hardware, I'm getting (~5s > for s_b=64GB) from pg_buffercache_numa > > node_id | count > ---------+--------- > 3 | 2097152 > 0 | 2097152 > 2 | 2097152 > 1 | 2097152 > > so this is valid result (2097152*4 numa nodes*8192 buffer > size/1024/1024/1024 = 64GB) > > Also with pgbench -i -s 20 , after ~8s: > select c.relname, n.node_id, count(*) from pg_buffercache_numa n > join pg_buffercache b on (b.bufferid = n.bufferid) > join pg_class c on (c.relfilenode = b.relfilenode) > group by c.relname, n.node_id order by count(*) desc; > relname | node_id | count > -----------------------------------------------+---------+------- > pgbench_accounts | 2 | 8217 > pgbench_accounts | 0 | 8190 > pgbench_accounts | 3 | 8189 > pgbench_accounts | 1 | 8187 > pg_statistic | 2 | 32 > pg_operator | 2 | 14 > pg_depend | 3 | 14 > [..] > > pg_shm_allocations_numa also looks good. > > I think v24+tiny fixes is good enough to go in. > OK. Do you have any suggestions regarding the column names in the new view? I'm not sure I like node_id and page_num. regards -- Tomas Vondra
On 4/4/25 08:50, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 03, 2025 at 08:53:57PM +0200, Tomas Vondra wrote: >> On 4/3/25 15:12, Jakub Wartak wrote: >>> On Thu, Apr 3, 2025 at 1:52 PM Tomas Vondra <tomas@vondra.me> wrote: >>> >>>> ... >>>> >>>> So unless someone can demonstrate a use case where this would matter, >>>> I'd not worry about it too much. >>> >>> OK, fine for me - just 3 cols for pg_buffercache_numa is fine for me, >>> it's just that I don't have cycles left today and probably lack skills >>> (i've never dealt with arrays so far) thus it would be slow to get it >>> right... but I can pick up anything tomorrow morning. >>> >> >> OK, I took a stab at reworking/simplifying this the way I proposed. >> Here's v24 - needs more polishing, but hopefully enough to show what I >> had in mind. >> >> It does these changes: >> >> 1) Drops 0002 with the pg_buffercache refactoring, because the new view >> is not "extending" the existing one. > > I think that makes sense. One would just need to join on the pg_buffercache > view to get more information about the buffer if needed. > > The pg_buffercache_numa_pages() doc needs an update though as I don't think that > "+ The <function>pg_buffercache_numa_pages()</function> provides the same > information as <function>pg_buffercache_pages()</function>" is still true. > Right, thanks for checking the docs. >> 2) Reworks pg_buffercache_num to return just three columns, bufferid, >> page_num and node_id. page_num is a sequence starting from 0 for each >> buffer. > > +1 on the idea > >> 3) It now builds an array of records, with one record per buffer/page. >> >> 4) I realized we don't really need to worry about buffers_per_page very >> much, except for logging/debugging. There's always "at least one page" >> per buffer, even if an incomplete one, so we can do this: >> >> os_page_count = NBuffers * Max(1, pages_per_buffer); >> >> and then >> >> for (i = 0; i < NBuffers; i++) >> { >> for (j = 0; j < Max(1, pages_per_buffer); j++) > > That's a nice simplification as we always need to take care of at least one page > per buffer. > OK. I think I'll consider moving some of this code "building" the entries into a separate function, to keep the main function easier to understand. >> and everything just works fine, I think. > > I think the same. > >> Opinions? I personally find this much cleaner / easier to understand. > > I agree that's easier to understand and that that looks correct. > > A few random comments: > > === 1 > > It looks like that firstNumaTouch is not set to false anymore. > Damn, my mistake. > === 2 > > + pfree(os_page_status); > + pfree(os_page_ptrs); > > Not sure that's needed, we should be in a short-lived memory context here > (ExprContext or such). > Yeah, maybe. It's not allocated in the multi-call context, but I wasn't sure. Will check. > === 3 > > + ro_volatile_var = *(uint64 *)ptr > > space missing before "ptr"? > Interesting the pgindent didn't tweak this. regards -- Tomas Vondra
On Fri, Apr 4, 2025 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > Do you have any suggestions regarding the column names in the new view? > I'm not sure I like node_id and page_num. They actually look good to me. We've discussed earlier dropping s/numa_//g for column names (after all views contain it already) so they are fine in this regard. There's also the question of consistency: (bufferid, page_num, node_id) -- maybe should just drop "_" and that's it? Well I would even possibly consider page_num -> ospagenumber, but that's ugly. -J.
OK, here's v25 after going through the patches once more, fixing the issues mentioned by Bertrand, etc. I think 0001 and 0002 are fine, I have a couple minor questions about 0003. 0002 ---- - Adds the new types to typedefs.list, to make pgindent happy. - Improves comment for pg_buffercache_numa_pages - Minor formatting tweaks. - I was wondering if maybe we should have some "global ID" of memory page, so that with large memory pages it's indicated the buffers are on the same memory page. Right now each buffer starts page_num from 0, but it should not be very hard to have a global counter. Opinions? 0003 ---- - Minor formatting tweaks, comment improvements. - Isn't this comment a bit confusing / misleading? /* Get number of OS aligned pages */ AFAICS the point is to adjust the allocated_size to be a multiple of os-page_size, to get "all" memory pages the segment uses. But that's not what I understand by "aligned page" (which is about there the page is expected to start). Or did I get this wrong? - There's a comment at the end which talks about "ignored segments". IMHO that type of information should be in the function comment, but I'm also not quite sure I understand what "output shared memory" is ... regards -- Tomas Vondra