Thread: Add os_page_num to pg_buffercache

Add os_page_num to pg_buffercache

From

Bertrand Drouvot

Date:

10 April, 16:17:55

Hi hackers,

I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and
thought that seeing how buffers are spread across multiple OS pages (if that's
the case) thanks to the os_page_num field is good information to have.

The thing that I think is annoying is that to get this information (os_page_num):

- One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache
- One needs a system with NUMA support enabled

So why not also add this information (os_page_num) in pg_buffercache?

- It would make this information available on all systems, not just NUMA-enabled ones
- It would help understand the memory layout implications of configuration changes
such as database block size, OS page size (huge pages for example) and see how the
buffers are spread across OS pages (if that's the case).

So, please find attached a patch to $SUBJECT then.

Remarks:

- Maybe we could create a helper function to reduce the code duplication between
pg_buffercache_pages() and pg_buffercache_numa_pages()
- I think it would have made sense to also add this information while working
on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18
post freeze (it looks more a feature enhancement than anything else)
- It's currently doing the changes in pg_buffercache v1.6 but will need to
create v1.7 for 19 (if the above stands true)

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Add-os_page_num-to-pg_buffercache.patch

Re: Add os_page_num to pg_buffercache

From

Tomas Vondra

Date:

10 April, 16:35:24


On 4/10/25 15:17, Bertrand Drouvot wrote:
> Hi hackers,
> 
> I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and
> thought that seeing how buffers are spread across multiple OS pages (if that's
> the case) thanks to the os_page_num field is good information to have.
> 
> The thing that I think is annoying is that to get this information (os_page_num):
> 
> - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache
> - One needs a system with NUMA support enabled
> 
> So why not also add this information (os_page_num) in pg_buffercache?
> 
> - It would make this information available on all systems, not just NUMA-enabled ones
> - It would help understand the memory layout implications of configuration changes
> such as database block size, OS page size (huge pages for example) and see how the
> buffers are spread across OS pages (if that's the case).
> 
> So, please find attached a patch to $SUBJECT then.
> 
> Remarks:
> 
> - Maybe we could create a helper function to reduce the code duplication between
> pg_buffercache_pages() and pg_buffercache_numa_pages()
> - I think it would have made sense to also add this information while working
> on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18
> post freeze (it looks more a feature enhancement than anything else)
> - It's currently doing the changes in pg_buffercache v1.6 but will need to
> create v1.7 for 19 (if the above stands true)
> 

This seems like a good idea in principle, but at this point it has to
wait for PG19. Please add it to the July commitfest.


regards

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

From

Nathan Bossart

Date:

10 April, 17:58:18

On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> This seems like a good idea in principle, but at this point it has to
> wait for PG19. Please add it to the July commitfest.

+1.  From a glance, this seems to fall in the "new feature" bucket and
should likely wait for v19.

-- 
nathan

Re: Add os_page_num to pg_buffercache

From

Bertrand Drouvot

Date:

10 April, 18:05:29

Hi,

On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> > This seems like a good idea in principle, but at this point it has to
> > wait for PG19. Please add it to the July commitfest.
> 
> +1.  From a glance, this seems to fall in the "new feature" bucket and
> should likely wait for v19.

Thank you both for providing your thoughts that confirm my initial doubt. Let's
come back to that one later then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

From

Bertrand Drouvot

Date:

01 July, 16:45:55

Hi,

On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
> > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> > > This seems like a good idea in principle, but at this point it has to
> > > wait for PG19. Please add it to the July commitfest.
> > 
> > +1.  From a glance, this seems to fall in the "new feature" bucket and
> > should likely wait for v19.
> 
> Thank you both for providing your thoughts that confirm my initial doubt. Let's
> come back to that one later then.
> 

Here we are.

Please find attached a rebased version and while at it, v2 adds a new macro and
a function to avoid some code duplication between pg_buffercache_pages() and
pg_buffercache_numa_pages().

So, PFA:

0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries

Those new macro and function are extracted from pg_buffercache_numa_pages() and
pg_buffercache_numa_pages() makes use of them.

0002 - Add os_page_num to pg_buffercache

Making use of the new macro and function from 0001.

As it's for v19, also bumping pg_buffercache's version to 1.7.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Add os_page_num to pg_buffercache

From

Tomas Vondra

Date:

01 July, 17:31:01

On 7/1/25 15:45, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote:
>> Hi,
>>
>> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
>>> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
>>>> This seems like a good idea in principle, but at this point it has to
>>>> wait for PG19. Please add it to the July commitfest.
>>>
>>> +1.  From a glance, this seems to fall in the "new feature" bucket and
>>> should likely wait for v19.
>>
>> Thank you both for providing your thoughts that confirm my initial doubt. Let's
>> come back to that one later then.
>>
> 
> Here we are.
> 
> Please find attached a rebased version and while at it, v2 adds a new macro and
> a function to avoid some code duplication between pg_buffercache_pages() and
> pg_buffercache_numa_pages().
> 
> So, PFA:
> 
> 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries
> 
> Those new macro and function are extracted from pg_buffercache_numa_pages() and
> pg_buffercache_numa_pages() makes use of them.
> 
> 0002 - Add os_page_num to pg_buffercache
> 
> Making use of the new macro and function from 0001.
> 
> As it's for v19, also bumping pg_buffercache's version to 1.7.
> 

Thanks for the updated patch!

I took a quick look on this, and I doubt we want to change the schema of
pg_buffercache like this. Adding columns is fine, but it seems rather
wrong to change the cardinality. The view is meant to be 1:1 mapping for
buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
page), to be precise.

I think this will break a lot of monitoring queries, and possibly in a
very subtle way - especially on systems with huge pages, where most
buffers will have one row, but then a buffer that happens to be split on
two pages will have two rows. That seems not great.

Just look at the changes needed in regression tests, where the first
test now needs to be

  -select count(*) = (select setting::bigint
  +select count(*) >= (select setting::bigint
                    from pg_settings
                    where name = 'shared_buffers')

which seems like a much weaker check.

IMHO it'd be better to have a new view for this info, something like
pg_buffercache_pages, or something like that.

But I'm also starting to question if the patch really is that useful.
Sure, people may not have NUMA support enabled (e.g. on non-linux
platforms), and even if they do the _numa view is quite expensive.

But I don't recall ever asking how the buffers map to memory pages. At
least not before/without the NUMA stuff.

regards

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

From

Bertrand Drouvot

Date:

01 July, 19:34:56

Hi,

On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote:
> On 7/1/25 15:45, Bertrand Drouvot wrote:
> 
> I took a quick look on this,

Thanks for looking at it!

> and I doubt we want to change the schema of
> pg_buffercache like this. Adding columns is fine, but it seems rather
> wrong to change the cardinality. The view is meant to be 1:1 mapping for
> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
> page), to be precise.
> 
> I think this will break a lot of monitoring queries, and possibly in a
> very subtle way - especially on systems with huge pages, where most
> buffers will have one row, but then a buffer that happens to be split on
> two pages will have two rows. That seems not great.
> 
> IMHO it'd be better to have a new view for this info, something like
> pg_buffercache_pages, or something like that.

That's a good point, fully agree!

> But I'm also starting to question if the patch really is that useful.
> Sure, people may not have NUMA support enabled (e.g. on non-linux
> platforms), and even if they do the _numa view is quite expensive.
> 

Yeah, it's not for day to day activities, more for configuration testing and
also for development activity/testing.

For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may
expect to see buffers not spread across pages.

But what I can see is:

SELECT
    pages_per_buffer,
    COUNT(*) as buffer_count
FROM (
    SELECT bufferid, COUNT(*) as pages_per_buffer
    FROM pg_buffercache
    GROUP BY bufferid
) subq
GROUP BY pages_per_buffer
ORDER BY pages_per_buffer;

 pages_per_buffer | buffer_count
------------------+--------------
                1 |       261120
                2 |         1024

This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE.

If I change it to:

BufferManagerShmemInit(void)

        /* Align buffer pool on IO page size boundary. */
        BufferBlocks = (char *)
-               TYPEALIGN(PG_IO_ALIGN_SIZE,
+               TYPEALIGN(2 * 1024 * 1024,
                                  ShmemInitStruct("Buffer Blocks",
-                                                                 NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+                                                                 NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024),
                                                                  &foundBufs));

Then I get:

 pages_per_buffer | buffer_count
------------------+--------------
                1 |       262144
(1 row)


So we've been able to see that some buffers were spread across pages due to 
shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment
to be set to 2MB then I don't see any buffers spread across pages anymore.

I think that it helps "visualize" some configuration or code changes.

What are your thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

From

Tomas Vondra

Date:

01 July, 19:45:37

On 7/1/25 18:34, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote:
>> On 7/1/25 15:45, Bertrand Drouvot wrote:
>>
>> I took a quick look on this,
> 
> Thanks for looking at it!
> 
>> and I doubt we want to change the schema of
>> pg_buffercache like this. Adding columns is fine, but it seems rather
>> wrong to change the cardinality. The view is meant to be 1:1 mapping for
>> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
>> page), to be precise.
>>
>> I think this will break a lot of monitoring queries, and possibly in a
>> very subtle way - especially on systems with huge pages, where most
>> buffers will have one row, but then a buffer that happens to be split on
>> two pages will have two rows. That seems not great.
>>
>> IMHO it'd be better to have a new view for this info, something like
>> pg_buffercache_pages, or something like that.
> 
> That's a good point, fully agree!
> 
>> But I'm also starting to question if the patch really is that useful.
>> Sure, people may not have NUMA support enabled (e.g. on non-linux
>> platforms), and even if they do the _numa view is quite expensive.
>>
> 
> Yeah, it's not for day to day activities, more for configuration testing and
> also for development activity/testing.
> 
> For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may
> expect to see buffers not spread across pages.
> 
> But what I can see is:
> 
> SELECT
>     pages_per_buffer,
>     COUNT(*) as buffer_count
> FROM (
>     SELECT bufferid, COUNT(*) as pages_per_buffer
>     FROM pg_buffercache
>     GROUP BY bufferid
> ) subq
> GROUP BY pages_per_buffer
> ORDER BY pages_per_buffer;
> 
>  pages_per_buffer | buffer_count
> ------------------+--------------
>                 1 |       261120
>                 2 |         1024
> 
> This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE.
> 
> If I change it to:
> 
> BufferManagerShmemInit(void)
> 
>         /* Align buffer pool on IO page size boundary. */
>         BufferBlocks = (char *)
> -               TYPEALIGN(PG_IO_ALIGN_SIZE,
> +               TYPEALIGN(2 * 1024 * 1024,
>                                   ShmemInitStruct("Buffer Blocks",
> -                                                                 NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
> +                                                                 NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024),
>                                                                   &foundBufs));
> 
> Then I get:
> 
>  pages_per_buffer | buffer_count
> ------------------+--------------
>                 1 |       262144
> (1 row)
> 
> 
> So we've been able to see that some buffers were spread across pages due to 
> shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment
> to be set to 2MB then I don't see any buffers spread across pages anymore.
> 
> I think that it helps "visualize" some configuration or code changes.
> 
> What are your thoughts?
> 

But isn't the _numa view good enough for this? Sure, you need NUMA
support for it, and it may take a fair amount of time, but how often you
need to do such queries? I don't plan to block improving this use case,
but I'm not sure it's worth the effort.


cheers

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

From

Bertrand Drouvot

Date:

01 July, 20:20:06

Hi,

On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote:
> On 7/1/25 18:34, Bertrand Drouvot wrote:
> 
> But isn't the _numa view good enough for this? Sure, you need NUMA
> support for it, and it may take a fair amount of time, but how often you
> need to do such queries?

Not that often, but my reasoning was more like:

why people managing engines and/or developing on platform that does not support
libnuma would not deserve access to this information?

> I don't plan to block improving this use case,

No worries at all, I do appreciate that you're looking at it and provide feedback
whatever the outcome would be.

> but I'm not sure it's worth the effort.

I think that the hard work has already been done while creating
pg_buffercache_numa_pages().

Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages()
so that:

* the new view could make use of it
* the maintenance burden should be low (thanks to code dedeuplication)
* people that don't have access to a platform that supports libnuma can have
access to this information

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

From

Tomas Vondra

Date:

01 July, 20:46:30

On 7/1/25 19:20, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote:
>> On 7/1/25 18:34, Bertrand Drouvot wrote:
>>
>> But isn't the _numa view good enough for this? Sure, you need NUMA
>> support for it, and it may take a fair amount of time, but how often you
>> need to do such queries?
> 
> Not that often, but my reasoning was more like:
> 
> why people managing engines and/or developing on platform that does not support
> libnuma would not deserve access to this information?
> 

True. I always forget we only support libnuma on linux for now.

>> I don't plan to block improving this use case,
> 
> No worries at all, I do appreciate that you're looking at it and provide feedback
> whatever the outcome would be.
> 
>> but I'm not sure it's worth the effort.
> 
> I think that the hard work has already been done while creating
> pg_buffercache_numa_pages().
> 
> Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages()
> so that:
> 
> * the new view could make use of it
> * the maintenance burden should be low (thanks to code dedeuplication)
> * people that don't have access to a platform that supports libnuma can have
> access to this information
> 

+1


regards

-- 
Tomas Vondra