Thread: Add os_page_num to pg_buffercache
Hi hackers, I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and thought that seeing how buffers are spread across multiple OS pages (if that's the case) thanks to the os_page_num field is good information to have. The thing that I think is annoying is that to get this information (os_page_num): - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache - One needs a system with NUMA support enabled So why not also add this information (os_page_num) in pg_buffercache? - It would make this information available on all systems, not just NUMA-enabled ones - It would help understand the memory layout implications of configuration changes such as database block size, OS page size (huge pages for example) and see how the buffers are spread across OS pages (if that's the case). So, please find attached a patch to $SUBJECT then. Remarks: - Maybe we could create a helper function to reduce the code duplication between pg_buffercache_pages() and pg_buffercache_numa_pages() - I think it would have made sense to also add this information while working on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18 post freeze (it looks more a feature enhancement than anything else) - It's currently doing the changes in pg_buffercache v1.6 but will need to create v1.7 for 19 (if the above stands true) Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 4/10/25 15:17, Bertrand Drouvot wrote: > Hi hackers, > > I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and > thought that seeing how buffers are spread across multiple OS pages (if that's > the case) thanks to the os_page_num field is good information to have. > > The thing that I think is annoying is that to get this information (os_page_num): > > - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache > - One needs a system with NUMA support enabled > > So why not also add this information (os_page_num) in pg_buffercache? > > - It would make this information available on all systems, not just NUMA-enabled ones > - It would help understand the memory layout implications of configuration changes > such as database block size, OS page size (huge pages for example) and see how the > buffers are spread across OS pages (if that's the case). > > So, please find attached a patch to $SUBJECT then. > > Remarks: > > - Maybe we could create a helper function to reduce the code duplication between > pg_buffercache_pages() and pg_buffercache_numa_pages() > - I think it would have made sense to also add this information while working > on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18 > post freeze (it looks more a feature enhancement than anything else) > - It's currently doing the changes in pg_buffercache v1.6 but will need to > create v1.7 for 19 (if the above stands true) > This seems like a good idea in principle, but at this point it has to wait for PG19. Please add it to the July commitfest. regards -- Tomas Vondra
On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > This seems like a good idea in principle, but at this point it has to > wait for PG19. Please add it to the July commitfest. +1. From a glance, this seems to fall in the "new feature" bucket and should likely wait for v19. -- nathan
Hi, On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > > This seems like a good idea in principle, but at this point it has to > > wait for PG19. Please add it to the July commitfest. > > +1. From a glance, this seems to fall in the "new feature" bucket and > should likely wait for v19. Thank you both for providing your thoughts that confirm my initial doubt. Let's come back to that one later then. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: > > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > > > This seems like a good idea in principle, but at this point it has to > > > wait for PG19. Please add it to the July commitfest. > > > > +1. From a glance, this seems to fall in the "new feature" bucket and > > should likely wait for v19. > > Thank you both for providing your thoughts that confirm my initial doubt. Let's > come back to that one later then. > Here we are. Please find attached a rebased version and while at it, v2 adds a new macro and a function to avoid some code duplication between pg_buffercache_pages() and pg_buffercache_numa_pages(). So, PFA: 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries Those new macro and function are extracted from pg_buffercache_numa_pages() and pg_buffercache_numa_pages() makes use of them. 0002 - Add os_page_num to pg_buffercache Making use of the new macro and function from 0001. As it's for v19, also bumping pg_buffercache's version to 1.7. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 7/1/25 15:45, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote: >> Hi, >> >> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: >>> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: >>>> This seems like a good idea in principle, but at this point it has to >>>> wait for PG19. Please add it to the July commitfest. >>> >>> +1. From a glance, this seems to fall in the "new feature" bucket and >>> should likely wait for v19. >> >> Thank you both for providing your thoughts that confirm my initial doubt. Let's >> come back to that one later then. >> > > Here we are. > > Please find attached a rebased version and while at it, v2 adds a new macro and > a function to avoid some code duplication between pg_buffercache_pages() and > pg_buffercache_numa_pages(). > > So, PFA: > > 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries > > Those new macro and function are extracted from pg_buffercache_numa_pages() and > pg_buffercache_numa_pages() makes use of them. > > 0002 - Add os_page_num to pg_buffercache > > Making use of the new macro and function from 0001. > > As it's for v19, also bumping pg_buffercache's version to 1.7. > Thanks for the updated patch! I took a quick look on this, and I doubt we want to change the schema of pg_buffercache like this. Adding columns is fine, but it seems rather wrong to change the cardinality. The view is meant to be 1:1 mapping for buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, page), to be precise. I think this will break a lot of monitoring queries, and possibly in a very subtle way - especially on systems with huge pages, where most buffers will have one row, but then a buffer that happens to be split on two pages will have two rows. That seems not great. Just look at the changes needed in regression tests, where the first test now needs to be -select count(*) = (select setting::bigint +select count(*) >= (select setting::bigint from pg_settings where name = 'shared_buffers') which seems like a much weaker check. IMHO it'd be better to have a new view for this info, something like pg_buffercache_pages, or something like that. But I'm also starting to question if the patch really is that useful. Sure, people may not have NUMA support enabled (e.g. on non-linux platforms), and even if they do the _numa view is quite expensive. But I don't recall ever asking how the buffers map to memory pages. At least not before/without the NUMA stuff. regards -- Tomas Vondra
Hi, On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote: > On 7/1/25 15:45, Bertrand Drouvot wrote: > > I took a quick look on this, Thanks for looking at it! > and I doubt we want to change the schema of > pg_buffercache like this. Adding columns is fine, but it seems rather > wrong to change the cardinality. The view is meant to be 1:1 mapping for > buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, > page), to be precise. > > I think this will break a lot of monitoring queries, and possibly in a > very subtle way - especially on systems with huge pages, where most > buffers will have one row, but then a buffer that happens to be split on > two pages will have two rows. That seems not great. > > IMHO it'd be better to have a new view for this info, something like > pg_buffercache_pages, or something like that. That's a good point, fully agree! > But I'm also starting to question if the patch really is that useful. > Sure, people may not have NUMA support enabled (e.g. on non-linux > platforms), and even if they do the _numa view is quite expensive. > Yeah, it's not for day to day activities, more for configuration testing and also for development activity/testing. For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may expect to see buffers not spread across pages. But what I can see is: SELECT pages_per_buffer, COUNT(*) as buffer_count FROM ( SELECT bufferid, COUNT(*) as pages_per_buffer FROM pg_buffercache GROUP BY bufferid ) subq GROUP BY pages_per_buffer ORDER BY pages_per_buffer; pages_per_buffer | buffer_count ------------------+-------------- 1 | 261120 2 | 1024 This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE. If I change it to: BufferManagerShmemInit(void) /* Align buffer pool on IO page size boundary. */ BufferBlocks = (char *) - TYPEALIGN(PG_IO_ALIGN_SIZE, + TYPEALIGN(2 * 1024 * 1024, ShmemInitStruct("Buffer Blocks", - NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, + NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024), &foundBufs)); Then I get: pages_per_buffer | buffer_count ------------------+-------------- 1 | 262144 (1 row) So we've been able to see that some buffers were spread across pages due to shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment to be set to 2MB then I don't see any buffers spread across pages anymore. I think that it helps "visualize" some configuration or code changes. What are your thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/1/25 18:34, Bertrand Drouvot wrote: > Hi, > > On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote: >> On 7/1/25 15:45, Bertrand Drouvot wrote: >> >> I took a quick look on this, > > Thanks for looking at it! > >> and I doubt we want to change the schema of >> pg_buffercache like this. Adding columns is fine, but it seems rather >> wrong to change the cardinality. The view is meant to be 1:1 mapping for >> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, >> page), to be precise. >> >> I think this will break a lot of monitoring queries, and possibly in a >> very subtle way - especially on systems with huge pages, where most >> buffers will have one row, but then a buffer that happens to be split on >> two pages will have two rows. That seems not great. >> >> IMHO it'd be better to have a new view for this info, something like >> pg_buffercache_pages, or something like that. > > That's a good point, fully agree! > >> But I'm also starting to question if the patch really is that useful. >> Sure, people may not have NUMA support enabled (e.g. on non-linux >> platforms), and even if they do the _numa view is quite expensive. >> > > Yeah, it's not for day to day activities, more for configuration testing and > also for development activity/testing. > > For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may > expect to see buffers not spread across pages. > > But what I can see is: > > SELECT > pages_per_buffer, > COUNT(*) as buffer_count > FROM ( > SELECT bufferid, COUNT(*) as pages_per_buffer > FROM pg_buffercache > GROUP BY bufferid > ) subq > GROUP BY pages_per_buffer > ORDER BY pages_per_buffer; > > pages_per_buffer | buffer_count > ------------------+-------------- > 1 | 261120 > 2 | 1024 > > This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE. > > If I change it to: > > BufferManagerShmemInit(void) > > /* Align buffer pool on IO page size boundary. */ > BufferBlocks = (char *) > - TYPEALIGN(PG_IO_ALIGN_SIZE, > + TYPEALIGN(2 * 1024 * 1024, > ShmemInitStruct("Buffer Blocks", > - NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, > + NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024), > &foundBufs)); > > Then I get: > > pages_per_buffer | buffer_count > ------------------+-------------- > 1 | 262144 > (1 row) > > > So we've been able to see that some buffers were spread across pages due to > shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment > to be set to 2MB then I don't see any buffers spread across pages anymore. > > I think that it helps "visualize" some configuration or code changes. > > What are your thoughts? > But isn't the _numa view good enough for this? Sure, you need NUMA support for it, and it may take a fair amount of time, but how often you need to do such queries? I don't plan to block improving this use case, but I'm not sure it's worth the effort. cheers -- Tomas Vondra
Hi, On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote: > On 7/1/25 18:34, Bertrand Drouvot wrote: > > But isn't the _numa view good enough for this? Sure, you need NUMA > support for it, and it may take a fair amount of time, but how often you > need to do such queries? Not that often, but my reasoning was more like: why people managing engines and/or developing on platform that does not support libnuma would not deserve access to this information? > I don't plan to block improving this use case, No worries at all, I do appreciate that you're looking at it and provide feedback whatever the outcome would be. > but I'm not sure it's worth the effort. I think that the hard work has already been done while creating pg_buffercache_numa_pages(). Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages() so that: * the new view could make use of it * the maintenance burden should be low (thanks to code dedeuplication) * people that don't have access to a platform that supports libnuma can have access to this information Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/1/25 19:20, Bertrand Drouvot wrote: > Hi, > > On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote: >> On 7/1/25 18:34, Bertrand Drouvot wrote: >> >> But isn't the _numa view good enough for this? Sure, you need NUMA >> support for it, and it may take a fair amount of time, but how often you >> need to do such queries? > > Not that often, but my reasoning was more like: > > why people managing engines and/or developing on platform that does not support > libnuma would not deserve access to this information? > True. I always forget we only support libnuma on linux for now. >> I don't plan to block improving this use case, > > No worries at all, I do appreciate that you're looking at it and provide feedback > whatever the outcome would be. > >> but I'm not sure it's worth the effort. > > I think that the hard work has already been done while creating > pg_buffercache_numa_pages(). > > Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages() > so that: > > * the new view could make use of it > * the maintenance burden should be low (thanks to code dedeuplication) > * people that don't have access to a platform that supports libnuma can have > access to this information > +1 regards -- Tomas Vondra