Home > mailing lists

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

From	Jakub Wartak
Subject	Re: Draft for basic NUMA observability
Date	April 8 00:01:17
Msg-id	CAKZiRmwt7t0wyLwhUKiWchgdpJfemW-ae+7x_MdW-CN1gbfbqA@mail.gmail.com Whole thread Raw
In response to	Re: Draft for basic NUMA observability (Tomas Vondra <tomas@vondra.me>)
Responses	Re: Draft for basic NUMA observability
List	pgsql-hackers

Tree view

On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <tomas@vondra.me> wrote:

> > So it looks like that the new way to iterate on the buffers that has been introduced
> > in v26/v27 has some issue?
> >
>
> Yeah, the calculations of the end pointers were wrong - we need to round
> up (using TYPEALIGN()) when calculating number of pages, and just add
> BLCKSZ (without any rounding) when calculating end of buffer. The 0004
> fixes this for me (I tried this with various blocksizes / page sizes).
>
> Thanks for noticing this!

Hi,

v28-0001 LGTM
v28-0002 got this warning Andres was talking about, so LGTM
v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
some time we have 2nd bug there, but it appears that PG never properly
aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
count(*) pg_buffercache_numa == count(*) pg_buffercache.

So before anybody else reports this as bug about duplicate bufferids:

# select * from pg_buffercache_numa where os_page_num <= 2;
 bufferid | os_page_num | numa_node
----------+-------------+-----------
[..]
      195 |           0 |         0
      196 |           0 |         0 <-- duplicate?
      196 |           1 |         0 <-- duplicate?
      197 |           1 |         0
      198 |           1 |         0

That is strange because on first look one could assume we get 257x
8192 blocks per os_page (2^21) that way, which is impossible.
Exercises in pointers show this:

# select * from pg_buffercache_numa where os_page_num <= 2;
DEBUG:  NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
-- custom elog(DEBUG1)
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
buffptr=0x7f8661079000
[..]
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
buffptr=0x7f86611fb000
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
buffptr=0x7f86611fd000
DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
buffptr=0x7f86611ff000 (!)
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661000000
endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
buffptr=0x7f86611ff000 (!)
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
buffptr=0x7f8661201000
DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
buffptr=0x7f8661203000

so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
(and ends up at 0x7f8661201000), while we also have HP that hosts it
between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
hugepages. Open question for another day is shouldn't (of course
outside of this $thread) align s_b to HP size or not? As per above
even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
(that's 495616 bytes difference).

-J.

pgsql-hackers by date:

From: Greg Sabino Mullane
Date: 07 April, 23:54:20
Subject: Re: psql suggestion "select " offers nothing, can we get functions like "\df "

From: Tom Lane
Date: 08 April, 00:25:32
Subject: Re: Horribly slow pg_upgrade performance with many Large Objects

Re: Draft for basic NUMA observability - Mailing list pgsql-hackers

Previous

Next