Re: Draft for basic NUMA observability - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Draft for basic NUMA observability |
Date | |
Msg-id | CAKZiRmwt7t0wyLwhUKiWchgdpJfemW-ae+7x_MdW-CN1gbfbqA@mail.gmail.com Whole thread Raw |
In response to | Re: Draft for basic NUMA observability (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Draft for basic NUMA observability
|
List | pgsql-hackers |
On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote: > > So it looks like that the new way to iterate on the buffers that has been introduced > > in v26/v27 has some issue? > > > > Yeah, the calculations of the end pointers were wrong - we need to round > up (using TYPEALIGN()) when calculating number of pages, and just add > BLCKSZ (without any rounding) when calculating end of buffer. The 0004 > fixes this for me (I tried this with various blocksizes / page sizes). > > Thanks for noticing this! Hi, v28-0001 LGTM v28-0002 got this warning Andres was talking about, so LGTM v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite some time we have 2nd bug there, but it appears that PG never properly aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume count(*) pg_buffercache_numa == count(*) pg_buffercache. So before anybody else reports this as bug about duplicate bufferids: # select * from pg_buffercache_numa where os_page_num <= 2; bufferid | os_page_num | numa_node ----------+-------------+----------- [..] 195 | 0 | 0 196 | 0 | 0 <-- duplicate? 196 | 1 | 0 <-- duplicate? 197 | 1 | 0 198 | 1 | 0 That is strange because on first look one could assume we get 257x 8192 blocks per os_page (2^21) that way, which is impossible. Exercises in pointers show this: # select * from pg_buffercache_numa where os_page_num <= 2; DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152 DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts -- custom elog(DEBUG1) DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real buffptr=0x7f8661079000 [..] DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real buffptr=0x7f86611fb000 DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real buffptr=0x7f86611fd000 DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real buffptr=0x7f86611ff000 (!) DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000 endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real buffptr=0x7f86611ff000 (!) DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real buffptr=0x7f8661201000 DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real buffptr=0x7f8661203000 so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big (and ends up at 0x7f8661201000), while we also have HP that hosts it between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2 hugepages. Open question for another day is shouldn't (of course outside of this $thread) align s_b to HP size or not? As per above even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000 (that's 495616 bytes difference). -J.
pgsql-hackers by date: