Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | Z64Pr8CTG0RTrGR3@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
List | pgsql-hackers |
Hi, On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote: > As I have promised to Andres on the Discord hacking server some time > ago, I'm attaching the very brief (and potentially way too rushed) > draft of the first step into NUMA observability on PostgreSQL that was > based on his presentation [0]. It might be rough, but it is to get us > started. The patches were not really even basically tested, they are > more like input for discussion - rather than solid code - to shake out > what should be the proper form of this. > > Right now it gives: > > postgres=# select numa_zone_id, count(*) from pg_buffercache group by > numa_zone_id; > NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 > numa_zone_id | count > --------------+------- > | 16127 > 6 | 256 > 1 | 1 Thanks for the patch! Not doing a code review but sharing some experimentation. First, I had to: @@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) Size os_page_size; void **os_page_ptrs; int *os_pages_status; - int os_page_count; + uint64 os_page_count; and - os_page_count = (NBuffers * BLCKSZ) / os_page_size; + os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size; to make it work with non tiny shared_buffers. Observations: when using 2 sessions: Session 1 first loads buffers (e.g., by querying a relation) and then runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;' I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear as unmapped when viewed from a process that hasn't accessed them, even though those same pages appear as allocated on a NUMA node in session 1. To double check, I created a function pg_buffercache_pages_from_pid() that is exactly the same as pg_buffercache_pages() (with your patch) except that it takes a pid as input and uses it in move_pages(<pid>, …). Let me show the results: In session 1 (that "accessed/loaded" the ~65K buffers): postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177310 0 | 65192 -2 | 378 (3 rows) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 1662580 In session 2: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 85 -2 | 65494 (3 rows) ^ postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 90 -2 | 65489 (3 rows) But when session's 1 pid is used: postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id; NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+--------- | 5177301 0 | 65195 -2 | 384 (3 rows) Results show: Correct NUMA distribution in session 1 Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid() with the pid of session 1 as a parameter (the session that actually accessed the buffers) Which makes me wondering if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though. The initial idea that you shared on discord was to use get_mempolicy() but as Andres stated: " One annoying thing about get_mempolicy() is this: If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read (load) access to that address, and return the ID of the node where that page was allocated. Forcing the allocation to happen inside a monitoring function is decidedly not great. " The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>") while using get_mempolicy(). But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: