Thread: Add tracking of backend memory allocated to pg_stat_activity
Hi Hackers, Attached is a patch to Add tracking of backend memory allocated to pg_stat_activity This new field displays the current bytes of memory allocated to the backend process. It is updated as memory for the process is malloc'd/free'd. Memory allocated to items on the freelist is included in the displayed value. Dynamic shared memory allocations are included only in the value displayed for the backend that created them, they are not included in the value for backends that are attached to them to avoid double counting. On occasion, orphaned memory segments may be cleaned up on postmaster startup. This may result in decreasing the sum without a prior increment. We limit the floor of backend_mem_allocated to zero. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > Hi Hackers, > > Attached is a patch to > Add tracking of backend memory allocated to pg_stat_activity > + proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', In the past, there was concern about making pg_stat_activity wider by adding information that's less-essential than what's been there for years. This is only an int64, so it's not "wide", but I wonder if there's another way to expose this information? Like adding backends to pg_get_backend_memory_contexts() , maybe with another view on top of that ? + * shown allocated in pgstat_activity when the creator destroys the pg_stat > + * Posix creation calls dsm_impl_posix_resize implying that resizing > + * occurs or may be added in the future. As implemented > + * dsm_impl_posix_resize utilizes fallocate or truncate, passing the > + * whole new size as input, growing the allocation as needed * (only > + * truncate supports shrinking). We update by replacing the * old wrapping caused extraneous stars > + * Do not allow backend_mem_allocated to go below zero. ereport if we > + * would have. There's no need for a lock around the read here asit's as it's > + ereport(LOG, (errmsg("decrease reduces reported backend memory allocated below zero; setting reported to 0"))); errmsg() doesn't require the outside paranthesis since a couple years ago. > + /* > + * Until header allocation is included in context->mem_allocated cast to > + * slab and decrement the headerSize add a comma before "cast" ? -- Justin
At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > > Hi Hackers, > > > > Attached is a patch to > > Add tracking of backend memory allocated to pg_stat_activity > > > + proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', > > In the past, there was concern about making pg_stat_activity wider by > adding information that's less-essential than what's been there for > years. This is only an int64, so it's not "wide", but I wonder if > there's another way to expose this information? Like adding backends to The view looks already too wide to me. I don't want the numbers for metrics are added to the view. > pg_get_backend_memory_contexts() , maybe with another view on top of > that ? +1 > + * shown allocated in pgstat_activity when the creator destroys the > > pg_stat > > > + * Posix creation calls dsm_impl_posix_resize implying that resizing > > + * occurs or may be added in the future. As implemented > > + * dsm_impl_posix_resize utilizes fallocate or truncate, passing the > > + * whole new size as input, growing the allocation as needed * (only > > + * truncate supports shrinking). We update by replacing the * old > > wrapping caused extraneous stars > > > + * Do not allow backend_mem_allocated to go below zero. ereport if we > > + * would have. There's no need for a lock around the read here asit's > > as it's > > > + ereport(LOG, (errmsg("decrease reduces reported backend memory allocated below zero; setting reported to 0"))); > > errmsg() doesn't require the outside paranthesis since a couple years > ago. +1 > > + /* > > + * Until header allocation is included in context->mem_allocated cast to > > + * slab and decrement the headerSize > > add a comma before "cast" ? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 31 Aug 2022 12:03:06 -0400, Reid Thompson <reid.thompson@crunchydata.com> wrote in > Attached is a patch to > Add tracking of backend memory allocated to pg_stat_activity > @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size) > return NULL; > > context->mem_allocated += blksize; > + pgstat_report_backend_mem_allocated_increase(blksize); I'm not sure this is acceptable. The function adds a branch even when the feature is turned off, which I think may cause a certain extent of performance degradation. A past threads [1], [2] and [3] might be informative. [1] https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop [2] https://www.postgresql.org/message-id/72a656e0f71d0860161e0b3f67e4d771%40oss.nttdata.com [3] https://www.postgresql.org/message-id/0271f440ac77f2a4180e0e56ebd944d1%40oss.nttdata.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 9/1/22 3:28 AM, Kyotaro Horiguchi wrote: > At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in >> On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: >>> Hi Hackers, >>> >>> Attached is a patch to >>> Add tracking of backend memory allocated Thanks for the patch. + 1 on the idea. >>> + proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', >> In the past, there was concern about making pg_stat_activity wider by >> adding information that's less-essential than what's been there for >> years. This is only an int64, so it's not "wide", but I wonder if >> there's another way to expose this information? Like adding backends to > The view looks already too wide to me. I don't want the numbers for > metrics are added to the view. +1 for a dedicated view. While we are at it, what do you think about also recording the max memory allocated by a backend? (could be useful and would avoid sampling for which there is no guarantee to sample the max anyway). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 2022-08-31 at 12:05 -0500, Justin Pryzby wrote: > > + proargmodes => > > '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}' > > , > > In the past, there was concern about making pg_stat_activity wider by > adding information that's less-essential than what's been there for > years. This is only an int64, so it's not "wide", but I wonder if > there's another way to expose this information? Like adding backends > to > pg_get_backend_memory_contexts() , maybe with another view on top of I will take a look at pg_get_backend_memory_contexts. I will also look at the other suggestions in the thread. > + * shown allocated in pgstat_activity when the > > pg_stat Corrected, > > replacing the * old > > wrapping caused extraneous stars Corrected > > here asit's > > as it's Corrected > errmsg() doesn't require the outside paranthesis since a couple years > ago. Corrected > > > mem_allocated cast to > add a comma before "cast" ? Corrected Patch with the corrections attached -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
On Thu, 2022-09-01 at 13:43 +0900, Kyotaro Horiguchi wrote: > > > @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > return NULL; > > > > context->mem_allocated += blksize; > > + pgstat_report_backend_mem_allocated_increase(blksi > > ze); > > I'm not sure this is acceptable. The function adds a branch even when > the feature is turned off, which I think may cause a certain extent > of > performance degradation. A past threads [1], [2] and [3] might be > informative. Stated above is '...even when the feature is turned off...', I want to note that this feature/patch (for tracking memory allocated) doesn't have an 'on/off'. Tracking would always occur. I'm open to guidance on testing for performance degradation. I did note some basic pgbench comparison numbers in the thread regarding limiting backend memory allocations. > -- > Kyotaro Horiguchi > NTT Open Source Software Center
At Tue, 06 Sep 2022 17:10:49 -0400, Reid Thompson <reid.thompson@crunchydata.com> wrote in > On Thu, 2022-09-01 at 13:43 +0900, Kyotaro Horiguchi wrote: > > > > > @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > > return NULL; > > > > > > context->mem_allocated += blksize; > > > + pgstat_report_backend_mem_allocated_increase(blksi > > > ze); > > > > I'm not sure this is acceptable. The function adds a branch even when > > the feature is turned off, which I think may cause a certain extent > > of > > performance degradation. A past threads [1], [2] and [3] might be > > informative. > > Stated above is '...even when the feature is turned off...', I want to > note that this feature/patch (for tracking memory allocated) doesn't > have an 'on/off'. Tracking would always occur. In the patch, I see that pgstat_report_backend_mem_allocated_increase() runs the following code, which seems like to me to be a branch.. + if (!beentry || !pgstat_track_activities) + { + /* + * Account for memory before pgstats is initialized. This will be + * migrated to pgstats on initialization. + */ + backend_mem_allocated += allocation; + + return; + } > I'm open to guidance on testing for performance degradation. I did > note some basic pgbench comparison numbers in the thread regarding > limiting backend memory allocations. Yeah.. That sounds good.. (I have a patch that is stuck at benchmarking on slight possible degradation caused by a branch (or indirect call) on a hot path similary to this one. The test showed fluctuation that is not clearly distinguishable between noise and degradation by running the target functions in a busy loop..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 07 Sep 2022 17:08:41 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 06 Sep 2022 17:10:49 -0400, Reid Thompson <reid.thompson@crunchydata.com> wrote in > > On Thu, 2022-09-01 at 13:43 +0900, Kyotaro Horiguchi wrote: > > > > > > > @@ -916,6 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > > > return NULL; > > > > > > > > context->mem_allocated += blksize; > > > > + pgstat_report_backend_mem_allocated_increase(blksi > > > > ze); > > > > > > I'm not sure this is acceptable. The function adds a branch even when > > > the feature is turned off, which I think may cause a certain extent > > > of > > > performance degradation. A past threads [1], [2] and [3] might be > > > informative. > > > > Stated above is '...even when the feature is turned off...', I want to > > note that this feature/patch (for tracking memory allocated) doesn't > > have an 'on/off'. Tracking would always occur. > > In the patch, I see that > pgstat_report_backend_mem_allocated_increase() runs the following > code, which seems like to me to be a branch.. Ah.. sorry. > pgstat_report_backend_mem_allocated_increase() runs the following - code, which seems like to me to be a branch.. + code, which seems like to me to be a branch that can turn of the feature.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Greetings, * Drouvot, Bertrand (bdrouvot@amazon.com) wrote: > On 9/1/22 3:28 AM, Kyotaro Horiguchi wrote: > >At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > >>On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > >>>Attached is a patch to > >>>Add tracking of backend memory allocated > > Thanks for the patch. > > + 1 on the idea. Glad folks are in support of the general idea. > >>>+ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', > >>In the past, there was concern about making pg_stat_activity wider by > >>adding information that's less-essential than what's been there for > >>years. This is only an int64, so it's not "wide", but I wonder if > >>there's another way to expose this information? Like adding backends to > >The view looks already too wide to me. I don't want the numbers for > >metrics are added to the view. > > +1 for a dedicated view. A dedicated view with a single column in it hardly seems sensible. I'd also argue that this particular bit of information is extremely useful and therefore worthy of being put directly into pg_stat_activity. I could see a dedicated view possibly *also* being added later if/when we provide a more detailed break-down of how the memory is being used but that's a whole other thing and I'm not even 100% sure we'll ever actually get there, as you can already poke a backend and have it dump out the memory context-level information on an as-needed basis. > While we are at it, what do you think about also recording the max memory > allocated by a backend? (could be useful and would avoid sampling for which > there is no guarantee to sample the max anyway). What would you do with that information..? By itself, it doesn't strike me as useful. Perhaps it'd be interesting to grab the max required for a particular query in pg_stat_statements or such but again, that's a very different thing. Thanks, Stephen
Attachment
Greetings, * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > At Tue, 06 Sep 2022 17:10:49 -0400, Reid Thompson <reid.thompson@crunchydata.com> wrote in > > I'm open to guidance on testing for performance degradation. I did > > note some basic pgbench comparison numbers in the thread regarding > > limiting backend memory allocations. > > Yeah.. That sounds good.. > > (I have a patch that is stuck at benchmarking on slight possible > degradation caused by a branch (or indirect call) on a hot path > similary to this one. The test showed fluctuation that is not clearly > distinguishable between noise and degradation by running the target > functions in a busy loop..) Just to be clear- this path is (hopefully) not *super* hot as we're only tracking actual allocations (that is- malloc() calls), this isn't changing anything for palloc() calls that aren't also needing to do a malloc(), and we already try to reduce the amount of malloc() calls we're doing by allocating more and more each time we run out in a given context. While I'm generally supportive of doing some benchmarking around this, I don't think the bar is as high as it would be if we were actually changing the cost of routine palloc() or such calls. Thanks, Stephen
Attachment
On Fri, Sep 09, 2022 at 12:34:15PM -0400, Stephen Frost wrote: > > While we are at it, what do you think about also recording the max memory > > allocated by a backend? (could be useful and would avoid sampling for which > > there is no guarantee to sample the max anyway). FYI, that's already kind-of available from getrusage: $ psql ts -c "SET log_executor_stats=on; SET client_min_messages=debug; SELECT a, COUNT(1) FROM generate_series(1,999999) a GROUP BY 1;" |wc LOG: EXECUTOR STATISTICS ... ! 194568 kB max resident size Note that max rss counts things allocated outside postgres (like linked libraries). > What would you do with that information..? By itself, it doesn't strike > me as useful. Perhaps it'd be interesting to grab the max required for > a particular query in pg_stat_statements or such but again, that's a > very different thing. log_executor_stats is at level "debug", so it's not great to enable it for a single session, and painful to think about enabling it globally. This would be a lot friendlier. Storing the maxrss per backend somewhere would be useful (and avoid the issue of "sampling" with top), after I agree that it ought to be exposed to a view. For example, it might help to determine whether (and which!) backends are using large multiple of work_mem, and then whether that can be increased. If/when we had a "memory budget allocator", this would help to determine how to set its GUCs, maybe to see "which backends are using the work_mem that are precluding this other backend from using efficient query plan". I wonder if it's better to combine these two threads into one. The 0001 patch of course can be considered independently from the 0002 patch, as usual. Right now, there's different parties on both threads ... -- Justin
Hi,
On 9/9/22 7:08 PM, Justin Pryzby wrote:
On Fri, Sep 09, 2022 at 12:34:15PM -0400, Stephen Frost wrote:While we are at it, what do you think about also recording the max memory allocated by a backend? (could be useful and would avoid sampling for which there is no guarantee to sample the max anyway).What would you do with that information..? By itself, it doesn't strike me as useful. Perhaps it'd be interesting to grab the max required for a particular query in pg_stat_statements or such but again, that's a very different thing.Storing the maxrss per backend somewhere would be useful (and avoid the issue of "sampling" with top), after I agree that it ought to be exposed to a view. For example, it might help to determine whether (and which!) backends are using large multiple of work_mem, and then whether that can be increased. If/when we had a "memory budget allocator", this would help to determine how to set its GUCs, maybe to see "which backends are using the work_mem that are precluding this other backend from using efficient query plan".
+1.
Regards,
-- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Greetings, * Drouvot, Bertrand (bdrouvot@amazon.com) wrote: > On 9/9/22 7:08 PM, Justin Pryzby wrote: > >On Fri, Sep 09, 2022 at 12:34:15PM -0400, Stephen Frost wrote: > >>>While we are at it, what do you think about also recording the max memory > >>>allocated by a backend? (could be useful and would avoid sampling for which > >>>there is no guarantee to sample the max anyway). > >>What would you do with that information..? By itself, it doesn't strike > >>me as useful. Perhaps it'd be interesting to grab the max required for > >>a particular query in pg_stat_statements or such but again, that's a > >>very different thing. > > >Storing the maxrss per backend somewhere would be useful (and avoid the > >issue of "sampling" with top), after I agree that it ought to be exposed > >to a view. For example, it might help to determine whether (and which!) > >backends are using large multiple of work_mem, and then whether that can > >be increased. If/when we had a "memory budget allocator", this would > >help to determine how to set its GUCs, maybe to see "which backends are > >using the work_mem that are precluding this other backend from using > >efficient query plan". > > +1. I still have a hard time seeing the value in tracking which backends are using the most memory over the course of a backend's entire lifetime, which would involve lots of different queries, some of which might use many multiples of work_mem and others not. Much more interesting would be to track this as part of pg_stat_statements and associated with queries. Either way, this looks like an independent feature which someone who has interest in could work on but generally doesn't impact what the feature of this thread is about; a feature which has already shown merit in finding a recently introduced memory leak and is the basis of another feature being contemplated to help avoid OOM-killer introduced crashes. Thanks, Stephen
Attachment
patch rebased to current master -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
On Tue, 2022-10-25 at 14:51 -0400, Reid Thompson wrote: > patch rebased to current master > actually attach the patch -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
Hi, On 10/25/22 8:59 PM, Reid Thompson wrote: > On Tue, 2022-10-25 at 14:51 -0400, Reid Thompson wrote: >> patch rebased to current master >> > actually attach the patch > It looks like the patch does not apply anymore since b1099eca8f. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, 2022-11-04 at 11:06 +0100, Drouvot, Bertrand wrote: > Hi, > > It looks like the patch does not apply anymore since b1099eca8f. > > Regards, > Thanks, rebased to current master attached. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
Hi, On 2022-11-04 08:56:13 -0400, Reid Thompson wrote: > From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 2001 > From: Reid Thompson <jreidthompson@nc.rr.com> > Date: Thu, 11 Aug 2022 12:01:25 -0400 > Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity > > This new field displays the current bytes of memory allocated to the > backend process. It is updated as memory for the process is > malloc'd/free'd. Memory allocated to items on the freelist is included in > the displayed value. Dynamic shared memory allocations are included > only in the value displayed for the backend that created them, they are > not included in the value for backends that are attached to them to > avoid double counting. On occasion, orphaned memory segments may be > cleaned up on postmaster startup. This may result in decreasing the sum > without a prior increment. We limit the floor of backend_mem_allocated > to zero. Updated pg_stat_activity documentation for the new column. I'm not convinced that counting DSM values this way is quite right. There are a few uses of DSMs that track shared resources, with the biggest likely being the stats for relations etc. I suspect that tracking that via backend memory usage will be quite confusing, because fairly random backends that had to grow the shared state end up being blamed with the memory usage in perpituity - and then suddenly that memory usage will vanish when that backend exits, despite the memory continuing to exist. > @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size size) > return NULL; > > context->mem_allocated += blksize; > + pgstat_report_backend_allocated_bytes_increase(blksize); > > block->aset = set; > block->freeptr = block->endptr = ((char *) block) + blksize; > @@ -944,6 +958,7 @@ AllocSetAlloc(MemoryContext context, Size size) > return NULL; > > context->mem_allocated += blksize; > + pgstat_report_backend_allocated_bytes_increase(blksize); > > block->aset = set; > block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; > @@ -1043,6 +1058,7 @@ AllocSetFree(void *pointer) > block->next->prev = block->prev; > > set->header.mem_allocated -= block->endptr - ((char *) block); > + pgstat_report_backend_allocated_bytes_decrease(block->endptr - ((char *) block)); > > #ifdef CLOBBER_FREED_MEMORY > wipe_mem(block, block->freeptr - ((char *) block)); I suspect this will be noticable cost-wise. Even though these paths aren't the hottest memory allocation paths, by nature of going down into malloc, adding an external function call that then does a bunch of branching etc. seems likely to add up to some. See below for how I think we can deal with that... > + > +/* -------- > + * pgstat_report_backend_allocated_bytes_increase() - > + * > + * Called to report increase in memory allocated for this backend > + * -------- > + */ > +void > +pgstat_report_backend_allocated_bytes_increase(uint64 allocation) > +{ > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + if (!beentry || !pgstat_track_activities) > + { > + /* > + * Account for memory before pgstats is initialized. This will be > + * migrated to pgstats on initialization. > + */ > + backend_allocated_bytes += allocation; > + > + return; > + } > + > + /* > + * Update my status entry, following the protocol of bumping > + * st_changecount before and after. We use a volatile pointer here to > + * ensure the compiler doesn't try to get cute. > + */ > + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); > + beentry->backend_allocated_bytes += allocation; > + PGSTAT_END_WRITE_ACTIVITY(beentry); > +} This is quite a few branches, including write/read barriers. It doesn't really make sense to use the PGSTAT_BEGIN_WRITE_ACTIVITY() pattern here - you're just updating a single value, there's nothing to be gained by it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a consistent view of multiple values - but there aren't multiple values here! To avoid the overhead of checking (!beentry || !pgstat_track_activities) and the external function call, I think you'd be best off copying the trickery I introduced for pgstat_report_wait_start(), in 225a22b19ed. I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline function that unconditionally updates something like *my_backend_allocated_memory. To deal with the case of (!beentry || !pgstat_track_activities), that variable initially points to some backend local state and is set to the shared state in pgstat_bestart(). This additionally has the nice benefit that you can track memory usage from before pgstat_bestart(), it'll be in the local variable. > +void > +pgstat_report_backend_allocated_bytes_decrease(uint64 deallocation) > +{ > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + /* > + * Cases may occur where shared memory from a previous postmaster > + * invocation still exist. These are cleaned up at startup by > + * dsm_cleanup_using_control_segment. Limit decreasing memory allocated to > + * zero in case no corresponding prior increase exists or decrease has > + * already been accounted for. > + */ I don't really follow - postmaster won't ever have a backend status array, so how would they be tracked here? Greetings, Andres Freund
On Fri, Nov 04, 2022 at 08:56:13AM -0400, Reid Thompson wrote: > From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 2001 > From: Reid Thompson <jreidthompson@nc.rr.com> > Date: Thu, 11 Aug 2022 12:01:25 -0400 > Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity > > This new field displays the current bytes of memory allocated to the > backend process. It is updated as memory for the process is > malloc'd/free'd. Memory allocated to items on the freelist is included in > the displayed value. Dynamic shared memory allocations are included > only in the value displayed for the backend that created them, they are > not included in the value for backends that are attached to them to > avoid double counting. On occasion, orphaned memory segments may be > cleaned up on postmaster startup. This may result in decreasing the sum > without a prior increment. We limit the floor of backend_mem_allocated > to zero. Updated pg_stat_activity documentation for the new column. > --- > doc/src/sgml/monitoring.sgml | 12 +++ > src/backend/catalog/system_views.sql | 1 + > src/backend/storage/ipc/dsm_impl.c | 81 +++++++++++++++ > src/backend/utils/activity/backend_status.c | 105 ++++++++++++++++++++ > src/backend/utils/adt/pgstatfuncs.c | 4 +- > src/backend/utils/mmgr/aset.c | 18 ++++ > src/backend/utils/mmgr/generation.c | 15 +++ > src/backend/utils/mmgr/slab.c | 21 ++++ > src/include/catalog/pg_proc.dat | 6 +- > src/include/utils/backend_status.h | 7 +- > src/test/regress/expected/rules.out | 9 +- > 11 files changed, 270 insertions(+), 9 deletions(-) > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > index e5d622d514..972805b85a 100644 > --- a/doc/src/sgml/monitoring.sgml > +++ b/doc/src/sgml/monitoring.sgml > @@ -947,6 +947,18 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser > </para></entry> > </row> > > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>backend_allocated_bytes</structfield> <type>bigint</type> > + </para> > + <para> > + The byte count of memory allocated to this backend. Dynamic shared memory > + allocations are included only in the value displayed for the backend that > + created them, they are not included in the value for backends that are > + attached to them to avoid double counting. > + </para></entry> > + </row> > + It doesn't seem like you need the backend_ prefix in the view since that is implied by it being in pg_stat_activity. For the wording on the description, I find "memory allocated to this backend" a bit confusing. Perhaps you could reword it to make clear you mean that the number represents the balance of allocations by this backend. Something like: Memory currently allocated to this backend in bytes. This is the balance of bytes allocated and freed by this backend. I would also link to the system administration function pg_size_pretty() so users know how to easily convert the value. If you end up removing shared memory as Andres suggests in [1], I would link pg_shmem_allocations view here and point out that shared memory allocations can be viewed there instead (and why). You could instead add dynamic shared memory allocation to the pg_shmem_allocations view as suggested as follow-on work by the commit which introduced it, ed10f32e3. > +/* -------- > + * pgstat_report_backend_allocated_bytes_increase() - > + * > + * Called to report increase in memory allocated for this backend > + * -------- > + */ It seems like you could combine the pgstat_report_backend_allocated_bytes_decrease/increase() by either using a signed integer to represent the allocation/deallocation or passing in a "direction" that is just a positive or negative multiplier enum. Especially if you don't use the write barriers, I think you could simplify the logic in the two functions. If you do combine them, you might shorten the name to pgstat_report_backend_allocation() or pgstat_report_allocation(). > +void > +pgstat_report_backend_allocated_bytes_increase(uint64 allocation) > +{ > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + if (!beentry || !pgstat_track_activities) > + { > + /* > + * Account for memory before pgstats is initialized. This will be > + * migrated to pgstats on initialization. > + */ > + backend_allocated_bytes += allocation; > + > + return; > + } > + > + /* > + * Update my status entry, following the protocol of bumping > + * st_changecount before and after. We use a volatile pointer here to > + * ensure the compiler doesn't try to get cute. > + */ > + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); > + beentry->backend_allocated_bytes += allocation; > + PGSTAT_END_WRITE_ACTIVITY(beentry); > +} > + > +/* -------- > + * pgstat_report_backend_allocated_bytes_decrease() - > + * > + * Called to report decrease in memory allocated for this backend > + * -------- > + */ > +void > +pgstat_report_backend_allocated_bytes_decrease(uint64 deallocation) > +{ > + volatile PgBackendStatus *beentry = MyBEEntry; > + > + /* > + * Cases may occur where shared memory from a previous postmaster > + * invocation still exist. These are cleaned up at startup by > + * dsm_cleanup_using_control_segment. Limit decreasing memory allocated to > + * zero in case no corresponding prior increase exists or decrease has > + * already been accounted for. > + */ > + > + if (!beentry || !pgstat_track_activities) > + { > + /* > + * Account for memory before pgstats is initialized. This will be > + * migrated to pgstats on initialization. Do not allow > + * backend_allocated_bytes to go below zero. If pgstats has not been > + * initialized, we are in startup and we set backend_allocated_bytes > + * to zero in cases where it would go negative and skip generating an > + * ereport. > + */ > + if (deallocation > backend_allocated_bytes) > + backend_allocated_bytes = 0; > + else > + backend_allocated_bytes -= deallocation; > + > + return; > + } > + > + /* > + * Do not allow backend_allocated_bytes to go below zero. ereport if we > + * would have. There's no need for a lock around the read here as it's > + * being referenced from the same backend which means that there shouldn't > + * be concurrent writes. We want to generate an ereport in these cases. > + */ > + if (deallocation > beentry->backend_allocated_bytes) > + { > + ereport(LOG, errmsg("decrease reduces reported backend memory allocated below zero; setting reported to 0")); > + I also think it would be nice to include the deallocation amount and backend_allocated_bytes amount in the log message. It also might be nice to start the message with something more clear than "decrease". For example, I would find this clear as a user: Backend [backend_type or pid] deallocated [deallocation number] bytes, [backend_allocated_bytes - deallocation number] more than this backend has reported allocating. > diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c > index 96bffc0f2a..b6d135ad2f 100644 > --- a/src/backend/utils/adt/pgstatfuncs.c > +++ b/src/backend/utils/adt/pgstatfuncs.c > @@ -553,7 +553,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS) > Datum > pg_stat_get_activity(PG_FUNCTION_ARGS) > { > -#define PG_STAT_GET_ACTIVITY_COLS 30 > +#define PG_STAT_GET_ACTIVITY_COLS 31 > int num_backends = pgstat_fetch_stat_numbackends(); > int curr_backend; > int pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0); > @@ -609,6 +609,8 @@ pg_stat_get_activity(PG_FUNCTION_ARGS) > else > nulls[16] = true; > > + values[30] = UInt64GetDatum(beentry->backend_allocated_bytes); Though not the fault of this patch, it is becoming very difficult to keep the columns straight in pg_stat_get_activity(). Perhaps you could add a separate commit to add an enum for the columns so the function is easier to understand. > diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h > index b582b46e9f..75d87e8308 100644 > --- a/src/include/utils/backend_status.h > +++ b/src/include/utils/backend_status.h > @@ -169,6 +169,9 @@ typedef struct PgBackendStatus > > /* query identifier, optionally computed using post_parse_analyze_hook */ > uint64 st_query_id; > + > + /* Current memory allocated to this backend */ > + uint64 backend_allocated_bytes; > } PgBackendStatus; I don't think you need the backend_ prefix here since it is in PgBackendStatus. > @@ -313,7 +316,9 @@ extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser); > extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer, > int buflen); > extern uint64 pgstat_get_my_query_id(void); > - > +extern void pgstat_report_backend_allocated_bytes_increase(uint64 allocation); > +extern void pgstat_report_backend_allocated_bytes_decrease(uint64 deallocation); > +extern uint64 pgstat_get_all_backend_memory_allocated(void); > > /* ---------- > * Support functions for the SQL-callable functions to > diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out > index 624d0e5aae..ba9f494806 100644 > --- a/src/test/regress/expected/rules.out > +++ b/src/test/regress/expected/rules.out > @@ -1753,10 +1753,11 @@ pg_stat_activity| SELECT s.datid, > s.state, > s.backend_xid, > s.backend_xmin, > + s.backend_allocated_bytes, > s.query_id, > s.query, > s.backend_type Seems like it would be possible to add a functional test to stats.sql. - Melanie [1] https://www.postgresql.org/message-id/20221105024146.xxlbtsxh2niyz2fu%40awork3.anarazel.de
Hi Andres, Thanks for looking at this and for the feedback. Responses inline below. On Fri, 2022-11-04 at 19:41 -0700, Andres Freund wrote: > Hi, > > On 2022-11-04 08:56:13 -0400, Reid Thompson wrote: > > I'm not convinced that counting DSM values this way is quite right. > There are a few uses of DSMs that track shared resources, with the biggest > likely being the stats for relations etc. I suspect that tracking that via > backend memory usage will be quite confusing, because fairly random backends that > had to grow the shared state end up being blamed with the memory usage in > perpituity - and then suddenly that memory usage will vanish when that backend exits, > despite the memory continuing to exist. Ok, I'll make an attempt to identify these allocations and manage them elsewhere. > > > > @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > return NULL; > > > > context->mem_allocated += blksize; > > + pgstat_report_backend_allocated_bytes_increase(blksize); > > > > block->aset = set; > > block->freeptr = block->endptr = ((char *) block) + blksize; > > @@ -944,6 +958,7 @@ AllocSetAlloc(MemoryContext context, Size size) > > return NULL; > > > > context->mem_allocated += blksize; > > + pgstat_report_backend_allocated_bytes_increase(blksize); > > > > block->aset = set; > > block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; > > @@ -1043,6 +1058,7 @@ AllocSetFree(void *pointer) > > block->next->prev = block->prev; > > > > set->header.mem_allocated -= block->endptr - ((char *) block); > > + pgstat_report_backend_allocated_bytes_decrease(block->endptr - ((char *) block)); > > > > #ifdef CLOBBER_FREED_MEMORY > > wipe_mem(block, block->freeptr - ((char *) block)); > > I suspect this will be noticable cost-wise. Even though these paths aren't the > hottest memory allocation paths, by nature of going down into malloc, adding > an external function call that then does a bunch of branching etc. seems > likely to add up to some. See below for how I think we can deal with > that... > > > > + > > +/* -------- > > + * pgstat_report_backend_allocated_bytes_increase() - > > + * > > + * Called to report increase in memory allocated for this backend > > + * -------- > > + */ > > +void > > +pgstat_report_backend_allocated_bytes_increase(uint64 allocation) > > +{ > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + if (!beentry || !pgstat_track_activities) > > + { > > + /* > > + * Account for memory before pgstats is initialized. This will be > > + * migrated to pgstats on initialization. > > + */ > > + backend_allocated_bytes += allocation; > > + > > + return; > > + } > > + > > + /* > > + * Update my status entry, following the protocol of bumping > > + * st_changecount before and after. We use a volatile pointer here to > > + * ensure the compiler doesn't try to get cute. > > + */ > > + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); > > + beentry->backend_allocated_bytes += allocation; > > + PGSTAT_END_WRITE_ACTIVITY(beentry); > > +} > > This is quite a few branches, including write/read barriers. > > It doesn't really make sense to use the PGSTAT_BEGIN_WRITE_ACTIVITY() pattern > here - you're just updating a single value, there's nothing to be gained by > it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a > consistent view of multiple values - but there aren't multiple values > here! I'll remove the barriers - initially I copied how prior functions were coded as my template ala pgstat_report_query_id, pgstat_report_xact_timestamp. > > To avoid the overhead of checking (!beentry || !pgstat_track_activities) and > the external function call, I think you'd be best off copying the trickery I > introduced for pgstat_report_wait_start(), in 225a22b19ed. > > I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline > function that unconditionally updates something like > *my_backend_allocated_memory. To deal with the case of (!beentry || > !pgstat_track_activities), that variable initially points to some backend > local state and is set to the shared state in pgstat_bestart(). > > This additionally has the nice benefit that you can track memory usage from > before pgstat_bestart(), it'll be in the local variable. OK, I think I can mimic the code you reference. > > > +void > > +pgstat_report_backend_allocated_bytes_decrease(uint64 > > deallocation) > > +{ > > + volatile PgBackendStatus *beentry = MyBEEntry; > > + > > + /* > > + * Cases may occur where shared memory from a previous postmaster > > + * invocation still exist. These are cleaned up at startup by > > + * dsm_cleanup_using_control_segment. Limit decreasing memory allocated to > > + * zero in case no corresponding prior increase exists or decrease has > > + * already been accounted for. > > + */ > > I don't really follow - postmaster won't ever have a backend status > array, so how would they be tracked here? On startup, a check is made for leftover dsm control segments in the DataDir. It appears possible that in certain situations on startup we may find and destroy stale segments and thus decrement the allocation variable. I based this off of: /ipc/dsm.c dsm_postmaster_startup: 150 dsm_postmaster_startup(PGShmemHeader *shim) { ...snip... 158 /* 159 ¦* If we're using the mmap implementations, clean up any leftovers. 160 ¦* Cleanup isn't needed on Windows, and happens earlier in startup for 161 ¦* POSIX and System V shared memory, via a direct call to 162 ¦* dsm_cleanup_using_control_segment. 163 ¦*/ ...snip... } dsm_cleanup_using_control_segment: 206 /* 207 * Determine whether the control segment from the previous postmaster 208 * invocation still exists. If so, remove the dynamic shared memory 209 * segments to which it refers, and then the control segment itself. 210 */ 211 void 212 dsm_cleanup_using_control_segment(dsm_handle old_control_handle) 213 { ...snip... 270 /* Destroy the referenced segment. */ 271 dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private, 272 &junk_mapped_address, &junk_mapped_size, LOG); 273 } 274 275 /* Destroy the old control segment, too. */ 276 elog(DEBUG2, 277 ¦"cleaning up dynamic shared memory control segment with ID %u", 278 ¦old_control_handle); 279 dsm_impl_op(DSM_OP_DESTROY, old_control_handle, 0, &impl_private, 280 &mapped_address, &mapped_size, LOG); 281 } > > Greetings, > > Andres Freund Thanks again, Reid -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Hi Melanie, Thank you for looking at this and for the feedback. Responses inline below. On Mon, 2022-11-07 at 16:17 -0500, Melanie Plageman wrote: > On Fri, Nov 04, 2022 at 08:56:13AM -0400, Reid Thompson wrote: > > From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 > > 2001 > > From: Reid Thompson <jreidthompson@nc.rr.com> > > Date: Thu, 11 Aug 2022 12:01:25 -0400 > > Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity > > + > > It doesn't seem like you need the backend_ prefix in the view since > that is implied by it being in pg_stat_activity. I will remove the prefix. > For the wording on the description, I find "memory allocated to this > backend" a bit confusing. Perhaps you could reword it to make clear > you mean that the number represents the balance of allocations by this > backend. Something like: > > Memory currently allocated to this backend in bytes. This is the > balance of bytes allocated and freed by this backend. > I would also link to the system administration function > pg_size_pretty() so users know how to easily convert the value. Thanks, I'll make these changes > If you end up removing shared memory as Andres suggests in [1], I > would link pg_shmem_allocations view here and point out that shared memory > allocations can be viewed there instead (and why). > > You could instead add dynamic shared memory allocation to the > pg_shmem_allocations view as suggested as follow-on work by the > commit which introduced it, ed10f32e3. > > > +/* -------- > > + * pgstat_report_backend_allocated_bytes_increase() - > > + * > > + * Called to report increase in memory allocated for this backend > > + * -------- > > + */ > > It seems like you could combine the > pgstat_report_backend_allocated_bytes_decrease/increase() by either > using a signed integer to represent the allocation/deallocation or passing in > a "direction" that is just a positive or negative multiplier enum. > > Especially if you don't use the write barriers, I think you could > simplify the logic in the two functions. > > If you do combine them, you might shorten the name to > pgstat_report_backend_allocation() or pgstat_report_allocation(). Agreed. This seems a cleaner, simpler way to go. I'll add it to the TODO list. > > + /* > > + * Do not allow backend_allocated_bytes to go below zero. > > ereport if we > > + * would have. There's no need for a lock around the read > > here as it's > > + * being referenced from the same backend which means that > > there shouldn't > > + * be concurrent writes. We want to generate an ereport in > > these cases. > > + */ > > + if (deallocation > beentry->backend_allocated_bytes) > > + { > > + ereport(LOG, errmsg("decrease reduces reported > > backend memory allocated below zero; setting reported to 0")); > > + > > I also think it would be nice to include the deallocation amount and > backend_allocated_bytes amount in the log message. > It also might be nice to start the message with something more clear > than "decrease". > For example, I would find this clear as a user: > > Backend [backend_type or pid] deallocated [deallocation number] bytes, > [backend_allocated_bytes - deallocation number] more than this backend > has reported allocating. Sounds good, I'll implement these changes > > diff --git a/src/backend/utils/adt/pgstatfuncs.c > > b/src/backend/utils/adt/pgstatfuncs.c > > index 96bffc0f2a..b6d135ad2f 100644 > > --- a/src/backend/utils/adt/pgstatfuncs.c > > +++ b/src/backend/utils/adt/pgstatfuncs.c > > @@ -553,7 +553,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS) > > Datum > > pg_stat_get_activity(PG_FUNCTION_ARGS) > > { > > -#define PG_STAT_GET_ACTIVITY_COLS 30 > > +#define PG_STAT_GET_ACTIVITY_COLS 31 > > int num_backends = > > > > + values[30] = UInt64GetDatum(beentry->backend_allocated_bytes); > > Though not the fault of this patch, it is becoming very difficult to > keep the columns straight in pg_stat_get_activity(). Perhaps you > could add a separate commit to add an enum for the columns so the function > is easier to understand. > > > diff --git a/src/include/utils/backend_status.h > > b/src/include/utils/backend_status.h > > index b582b46e9f..75d87e8308 100644 > > --- a/src/include/utils/backend_status.h > > +++ b/src/include/utils/backend_status.h > > @@ -169,6 +169,9 @@ typedef struct PgBackendStatus > > > > /* query identifier, optionally computed using > > post_parse_analyze_hook */ > > uint64 st_query_id; > > + > > + /* Current memory allocated to this backend */ > > + uint64 backend_allocated_bytes; > > } PgBackendStatus; > > I don't think you need the backend_ prefix here since it is in > PgBackendStatus. Agreed again, I'll remove the prefix. > > @@ -313,7 +316,9 @@ extern const char > > *pgstat_get_backend_current_activity(int pid, bool checkUser); > > extern const char *pgstat_get_crashed_backend_activity(int pid, > > char *buffer, > > > > int buflen); > > extern uint64 pgstat_get_my_query_id(void); > > - > > +extern void pgstat_report_backend_allocated_bytes_increase(uint64 > > allocation); > > +extern void pgstat_report_backend_allocated_bytes_decrease(uint64 > > deallocation); > > +extern uint64 pgstat_get_all_backend_memory_allocated(void); > > > > /* ---------- > > * Support functions for the SQL-callable functions to > > diff --git a/src/test/regress/expected/rules.out > > b/src/test/regress/expected/rules.out > > index 624d0e5aae..ba9f494806 100644 > > --- a/src/test/regress/expected/rules.out > > +++ b/src/test/regress/expected/rules.out > > @@ -1753,10 +1753,11 @@ pg_stat_activity| SELECT s.datid, > > s.state, > > s.backend_xid, > > s.backend_xmin, > > + s.backend_allocated_bytes, > > s.query_id, > > s.query, > > s.backend_type > > Seems like it would be possible to add a functional test to > stats.sql. I will look at adding this. > - Melanie > > [1] > https://www.postgresql.org/message-id/20221105024146.xxlbtsxh2niyz2fu%40awork3.anarazel.de Thanks again, Reid -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Hi, 2022-11-09 08:54:54 -0500, Reid Thompson wrote: > Thanks for looking at this and for the feedback. Responses inline below. > > > +void > > > +pgstat_report_backend_allocated_bytes_decrease(uint64 > > > deallocation) > > > +{ > > > + volatile PgBackendStatus *beentry = MyBEEntry; > > > + > > > + /* > > > + * Cases may occur where shared memory from a previous postmaster > > > + * invocation still exist. These are cleaned up at startup by > > > + * dsm_cleanup_using_control_segment. Limit decreasing memory allocated to > > > + * zero in case no corresponding prior increase exists or decrease has > > > + * already been accounted for. > > > + */ > > > > I don't really follow - postmaster won't ever have a backend status > > array, so how would they be tracked here? > > On startup, a check is made for leftover dsm control segments in the > DataDir. It appears possible that in certain situations on startup we > may find and destroy stale segments and thus decrement the allocation > variable. > > I based this off of: > /ipc/dsm.c > > dsm_postmaster_startup: > 150 dsm_postmaster_startup(PGShmemHeader *shim) > { > ... > 281 } I don't think we should account for memory allocations done in postmaster in this patch. They'll otherwise be counted again in each of the forked backends. As this cleanup happens during postmaster startup, we'll have to make sure accounting is reset during backend startup. Greetings, Andres Freund
Code rebased to current master. Updated to incorporate additional recommendations from the the list - add units to variables in view - remove 'backend_' prefix from variables/functions - update documentation - add functional test for allocated_bytes - refactor allocation reporting to reduce number of functions and branches/reduce performance hit - zero allocated bytes after fork to avoid double counting postmaster allocations -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
On Wed, 2022-11-09 at 09:23 -0500, Reid Thompson wrote: > Hi Melanie, > Thank you for looking at this and for the feedback. Responses inline > below. > > On Mon, 2022-11-07 at 16:17 -0500, Melanie Plageman wrote: > > > > It doesn't seem like you need the backend_ prefix in the view since > > that is implied by it being in pg_stat_activity. > > I will remove the prefix. done > > > For the wording on the description, I find "memory allocated to > > this > > backend" a bit confusing. Perhaps you could reword it to make clear > > you mean that the number represents the balance of allocations by > > this > > backend. Something like: > > > > Memory currently allocated to this backend in bytes. This > > is the > > balance of bytes allocated and freed by this backend. > > I would also link to the system administration function > > pg_size_pretty() so users know how to easily convert the value. > > Thanks, I'll make these changes done > > > +/* -------- > > > + * pgstat_report_backend_allocated_bytes_increase() - > > > + * > > > + * Called to report increase in memory allocated for this > > > backend > > > + * -------- > > > + */ > > > > It seems like you could combine the > > pgstat_report_backend_allocated_bytes_decrease/increase() by either > > using a signed integer to represent the allocation/deallocation or > > passing in > > a "direction" that is just a positive or negative multiplier enum. > > > > Especially if you don't use the write barriers, I think you could > > simplify the logic in the two functions. > > > > If you do combine them, you might shorten the name to > > pgstat_report_backend_allocation() or pgstat_report_allocation(). > > Agreed. This seems a cleaner, simpler way to go. I'll add it to the > TODO list. done > > > > + /* > > > + * Do not allow backend_allocated_bytes to go below zero. > > > ereport if we > > > + * would have. There's no need for a lock around the read > > > here as it's > > > + * being referenced from the same backend which means > > > that > > > there shouldn't > > > + * be concurrent writes. We want to generate an ereport > > > in > > > these cases. > > > + */ > > > + if (deallocation > beentry->backend_allocated_bytes) > > > + { > > > + ereport(LOG, errmsg("decrease reduces reported > > > backend memory allocated below zero; setting reported to 0")); > > > + > > > > I also think it would be nice to include the deallocation amount > > and > > backend_allocated_bytes amount in the log message. > > It also might be nice to start the message with something more > > clear > > than "decrease". > > For example, I would find this clear as a user: > > > > Backend [backend_type or pid] deallocated [deallocation > > number] bytes, > > [backend_allocated_bytes - deallocation number] more than > > this backend > > has reported allocating. > > Sounds good, I'll implement these changes done > > > diff --git a/src/include/utils/backend_status.h > > > b/src/include/utils/backend_status.h > > > index b582b46e9f..75d87e8308 100644 > > > --- a/src/include/utils/backend_status.h > > > +++ b/src/include/utils/backend_status.h > > > @@ -169,6 +169,9 @@ typedef struct PgBackendStatus > > > > > > /* query identifier, optionally computed using > > > post_parse_analyze_hook */ > > > uint64 st_query_id; > > > + > > > + /* Current memory allocated to this backend */ > > > + uint64 backend_allocated_bytes; > > > } PgBackendStatus; > > > > I don't think you need the backend_ prefix here since it is in > > PgBackendStatus. > > Agreed again, I'll remove the prefix. done > > > /* ---------- > > > * Support functions for the SQL-callable functions to > > > diff --git a/src/test/regress/expected/rules.out > > > b/src/test/regress/expected/rules.out > > > index 624d0e5aae..ba9f494806 100644 > > > --- a/src/test/regress/expected/rules.out > > > +++ b/src/test/regress/expected/rules.out > > > @@ -1753,10 +1753,11 @@ pg_stat_activity| SELECT s.datid, > > > s.state, > > > s.backend_xid, > > > s.backend_xmin, > > > + s.backend_allocated_bytes, > > > s.query_id, > > > s.query, > > > s.backend_type > > > > Seems like it would be possible to add a functional test to > > stats.sql. > > I will look at adding this. done patch attached to https://www.postgresql.org/message-id/06b4922193b80776a31e08a3809f2414b0d4bf90.camel%40crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
On Wed, 2022-11-09 at 08:54 -0500, Reid Thompson wrote: > Hi Andres, > Thanks for looking at this and for the feedback. Responses inline > below. > >> On Fri, 2022-11-04 at 19:41 -0700, Andres Freund wrote: > > Hi, > > > On 2022-11-04 08:56:13 -0400, Reid Thompson wrote: > > > > I'm not convinced that counting DSM values this way is quite right. > > There are a few uses of DSMs that track shared resources, with the biggest > > likely being the stats for relations etc. I suspect that tracking that via > > backend memory usage will be quite confusing, because fairly random backends that > > had to grow the shared state end up being blamed with the memory usage in > > perpituity - and then suddenly that memory usage will vanish when that backend exits, > > despite the memory continuing to exist. > > Ok, I'll make an attempt to identify these allocations and manage > them elsewhere. still TBD > > > > > > > @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size > > > size) > > > return NULL; > > > > > > context->mem_allocated += blksize; > > > + pgstat_report_backend_allocated_bytes_increase(bl > > > ksize); > > > > I suspect this will be noticable cost-wise. Even though these paths aren't the > > hottest memory allocation paths, by nature of going down into malloc, adding > > an external function call that then does a bunch of branching etc. seems > > likely to add up to some. See below for how I think we can deal with that... > > > > This is quite a few branches, including write/read barriers. > > > > It doesn't really make sense to use the > > PGSTAT_BEGIN_WRITE_ACTIVITY() pattern > > here - you're just updating a single value, there's nothing to be gained by > > it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a > > consistent view of multiple values - but there aren't multiple values > > here! > > I'll remove the barriers - initially I copied how prior functions were barriers removed > > > > To avoid the overhead of checking (!beentry || !pgstat_track_activities) and > > the external function call, I think you'd be best off copying the trickery I > > introduced for pgstat_report_wait_start(), in 225a22b19ed. > > > > I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline > > function that unconditionally updates something like > > *my_backend_allocated_memory. To deal with the case of (!beentry || > > !pgstat_track_activities), that variable initially points to some backend > > local state and is set to the shared state in pgstat_bestart(). > > > > This additionally has the nice benefit that you can track memory usage from > > before pgstat_bestart(), it'll be in the local variable. > > OK, I think I can mimic the code you reference. done patch attached to https://www.postgresql.org/message-id/06b4922193b80776a31e08a3809f2414b0d4bf90.camel%40crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
On Sat, 2022-11-26 at 22:10 -0500, Reid Thompson wrote: > Code rebased to current master. > Updated to incorporate additional recommendations from the the list > - add units to variables in view > - remove 'backend_' prefix from variables/functions > - update documentation > - add functional test for allocated_bytes > - refactor allocation reporting to reduce number of functions and > branches/reduce performance hit > - zero allocated bytes after fork to avoid double counting > postmaster allocations > > > > attempt to remedy cfbot windows build issues -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
On Sun, Nov 27, 2022 at 12:32:19AM -0500, Reid Thompson wrote: > attempt to remedy cfbot windows build issues You can trigger those tests under your own/private repo by pushing a branch to github. See src/tools/ci/README I suppose the cfbot page ought to point that out. -- Justin
On Sat, Nov 26, 2022 at 9:32 PM Reid Thompson <reid.thompson@crunchydata.com> wrote:
On Sat, 2022-11-26 at 22:10 -0500, Reid Thompson wrote:
> Code rebased to current master.
> Updated to incorporate additional recommendations from the the list
> - add units to variables in view
> - remove 'backend_' prefix from variables/functions
> - update documentation
> - add functional test for allocated_bytes
> - refactor allocation reporting to reduce number of functions and
> branches/reduce performance hit
> - zero allocated bytes after fork to avoid double counting
> postmaster allocations
>
>
>
>
attempt to remedy cfbot windows build issues
Hi,
+ {
+ pgstat_report_allocated_bytes(*mapped_size, DECREASE);
+ pgstat_report_allocated_bytes(request_size, INCREASE);
pgstat_report_allocated_bytes is called twice for this case. Can the two calls be combined into one (with request_size - *mapped_size, INCREASE) ?
Cheers
On Sun, Nov 27, 2022 at 12:32:19AM -0500, Reid Thompson wrote: > @@ -32,6 +33,12 @@ typedef enum BackendState > STATE_DISABLED > } BackendState; > > +/* Enum helper for reporting memory allocated bytes */ > +enum allocation_direction > +{ > + DECREASE = -1, > + INCREASE = 1, > +}; BTW, these should have some kind of prefix, like PG_ALLOC_* to avoid causing the same kind of problem for someone else that another header caused for you by defining something somewhere called IGNORE (ignore what, I don't know). The other problem was probably due to a define, though. Maybe instead of an enum, the function should take a boolean. I still wonder whether there needs to be a separate CF entry for the 0001 patch. One issue is that there's two different lists of people involved in the threads. -- Justin
On Sun, Nov 27, 2022 at 7:41 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Nov 27, 2022 at 12:32:19AM -0500, Reid Thompson wrote:
> @@ -32,6 +33,12 @@ typedef enum BackendState
> STATE_DISABLED
> } BackendState;
>
> +/* Enum helper for reporting memory allocated bytes */
> +enum allocation_direction
> +{
> + DECREASE = -1,
> + INCREASE = 1,
> +};
BTW, these should have some kind of prefix, like PG_ALLOC_* to avoid
causing the same kind of problem for someone else that another header
caused for you by defining something somewhere called IGNORE (ignore
what, I don't know). The other problem was probably due to a define,
though. Maybe instead of an enum, the function should take a boolean.
I still wonder whether there needs to be a separate CF entry for the
0001 patch. One issue is that there's two different lists of people
involved in the threads.
--
Justin
I am a bit curious: why is the allocation_direction enum needed ?
pgstat_report_allocated_bytes() can be given the amount (either negative or positive) to adjust directly.
Cheers
On 2022-11-26 22:10:06 -0500, Reid Thompson wrote: > - zero allocated bytes after fork to avoid double counting postmaster allocations I still don't understand this - postmaster shouldn't be counted at all. It doesn't have a PgBackendStatus. There simply shouldn't be any tracked allocations from it. Greetings, Andres Freund
On Mon, 2022-11-28 at 10:59 -0800, Andres Freund wrote: > On 2022-11-26 22:10:06 -0500, Reid Thompson wrote: > > - zero allocated bytes after fork to avoid double counting > > postmaster allocations > > I still don't understand this - postmaster shouldn't be counted at > all. It > doesn't have a PgBackendStatus. There simply shouldn't be any tracked > allocations from it. > > Greetings, > > Andres Freund Hi Andres, I based this on the following. It appears to me that Postmaster populates the local variable that *my_allocated_bytes points to. That allocation is passed to forked children, and if not zeroed out, will be double counted as part of the child allocation. Is this invalid? $ ps -ef|grep postgres postgres 6389 1 0 Dec01 ? 00:00:17 /usr/sbin/pgbouncer -d /etc/pgbouncer/pgbouncer.ini rthompso 2937799 1 0 09:45 ? 00:00:00 /tmp/postgres/install/pg-stats-memory/bin/postgres -D /var/tmp/pg-stats-memory-p 5433 rthompso 2937812 2937799 0 09:45 ? 00:00:00 postgres: checkpointer rthompso 2937813 2937799 0 09:45 ? 00:00:00 postgres: background writer rthompso 2937816 2937799 0 09:45 ? 00:00:00 postgres: walwriter rthompso 2937817 2937799 0 09:45 ? 00:00:00 postgres: autovacuum launcher rthompso 2937818 2937799 0 09:45 ? 00:00:00 postgres: logical replication launcher rthompso 2938877 2636586 0 09:46 pts/4 00:00:00 /usr/lib/postgresql/12/bin/psql -h localhost -p 5433 postgres rthompso 2938909 2937799 0 09:46 ? 00:00:00 postgres: rthompso postgres 127.0.0.1(44532) idle rthompso 2942164 1987403 0 09:49 pts/3 00:00:00 grep postgres Bracketing fork_process() calls with logging to print *my_allocated_bytes immediately prior and after fork_process... To me, this indicates that the forked children for this invocation (checkpointer, walwriter, autovac launcher, client backend, autovac worker, etc) are inheriting 240672 bytes from postmaster. $ ccat logfile 2022-12-02 09:45:05.871 EST [2937799] LOG: starting PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)9.4.0, 64-bit 2022-12-02 09:45:05.872 EST [2937799] LOG: listening on IPv4 address "127.0.0.1", port 5433 2022-12-02 09:45:05.874 EST [2937799] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433" parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672 parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672 parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672 child StartChildProcess process. pid: 2937812 *my_allocated_bytes: 240672 child StartChildProcess process. pid: 2937813 *my_allocated_bytes: 240672 child StartChildProcess process. pid: 2937814 *my_allocated_bytes: 240672 2022-12-02 09:45:05.884 EST [2937814] LOG: database system was shut down at 2022-12-02 09:41:13 EST parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672 parent StartAutoVacLauncher process. pid: 2937799 *my_allocated_bytes: 240672 child StartChildProcess process. pid: 2937816 *my_allocated_bytes: 240672 parent do_start_bgworker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacLauncher process. pid: 2937817 *my_allocated_bytes: 240672 2022-12-02 09:45:05.889 EST [2937799] LOG: database system is ready to accept connections child do_start_bgworker process. pid: 2937818 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2938417 *my_allocated_bytes: 240672 parent BackendStartup process. pid: 2937799 *my_allocated_bytes: 240672 child BackendStartup process. pid: 2938909 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2938910 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2939340 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2939665 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2940038 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2940364 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2940698 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2941317 *my_allocated_bytes: 240672 parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672 child StartAutoVacWorker process. pid: 2941825 *my_allocated_bytes: 240672 Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Hi, On 2022-12-02 11:09:30 -0500, Reid Thompson wrote: > It appears to me that Postmaster populates the local variable that > *my_allocated_bytes points to. That allocation is passed to forked > children, and if not zeroed out, will be double counted as part of > the child allocation. Is this invalid? I don't think we should count allocations made before backend_status.c has been initialized. Greetings, Andres Freund
On Fri, 2022-12-02 at 09:18 -0800, Andres Freund wrote: > Hi, > > On 2022-12-02 11:09:30 -0500, Reid Thompson wrote: > > It appears to me that Postmaster populates the local variable that > > *my_allocated_bytes points to. That allocation is passed to forked > > children, and if not zeroed out, will be double counted as part of > > the child allocation. Is this invalid? > > I don't think we should count allocations made before > backend_status.c has > been initialized. > > Greetings, > > Andres Freund Hi, The intent was to capture and display all the memory allocated to the backends, for admins/users/max_total_backend_memory/others to utilize. Why should we ignore the allocations prior to backend_status.c? Thanks, Reid -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Hi, On 2022-12-06 08:47:55 -0500, Reid Thompson wrote: > The intent was to capture and display all the memory allocated to the > backends, for admins/users/max_total_backend_memory/others to utilize. It's going to be far less accurate than that. For one, there's a lot of operating system resources, like the page table, that are going to be ignored. We're also not going to capture allocations done directly via malloc(). There's also copy-on-write effects that we're ignoring. If we just want to show an accurate picture of the current memory usage, we need to go to operating system specific interfaces. > Why should we ignore the allocations prior to backend_status.c? It seems to add complexity without a real increase in accuracy to me. But I'm not going to push harder on it than I already have. Greetings, Andres Freund
On Sun, 2022-11-27 at 09:40 -0600, Justin Pryzby wrote: > > BTW, these should have some kind of prefix, like PG_ALLOC_* to > > avoid causing the same kind of problem for someone else that > > another header caused for you by defining something somewhere > > called IGNORE (ignore what, I don't know). The other problem was > > probably due to a define, though. Maybe instead of an enum, the > > function should take a boolean. > > Patch updated to current master and includes above prefix recommendation and combining of two function calls to one recommended by Ted Yu. > > > > I still wonder whether there needs to be a separate CF entry for > > the 0001 patch. One issue is that there's two different lists of > > people involved in the threads. > > I'm OK with containing the conversation to one thread if everyone else is. If there's no argument against, then patches after today will go to the "Add the ability to limit the amount of memory that can be allocated to backends" thread https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Attachment
On Thu, 8 Dec 2022 at 19:44, Reid Thompson <reid.thompson@crunchydata.com> wrote: > > On Sun, 2022-11-27 at 09:40 -0600, Justin Pryzby wrote: > > > BTW, these should have some kind of prefix, like PG_ALLOC_* to > > > avoid causing the same kind of problem for someone else that > > > another header caused for you by defining something somewhere > > > called IGNORE (ignore what, I don't know). The other problem was > > > probably due to a define, though. Maybe instead of an enum, the > > > function should take a boolean. > > > > > Patch updated to current master and includes above prefix > recommendation and combining of two function calls to one recommended > by Ted Yu. > > > > > > > I still wonder whether there needs to be a separate CF entry for > > > the 0001 patch. One issue is that there's two different lists of > > > people involved in the threads. > > > > > I'm OK with containing the conversation to one thread if everyone else > is. If there's no argument against, then patches after today will go > to the "Add the ability to limit the amount of memory that can be > allocated to backends" thread > https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID e351f85418313e97c203c73181757a007dfda6d0 === === applying patch ./0001-Add-tracking-of-backend-memory-allocated-to-pg_stat_.patch patching file src/backend/utils/mmgr/slab.c Hunk #1 succeeded at 69 (offset 16 lines). Hunk #2 succeeded at 414 (offset 175 lines). Hunk #3 succeeded at 436 with fuzz 2 (offset 176 lines). Hunk #4 FAILED at 286. Hunk #5 succeeded at 488 (offset 186 lines). Hunk #6 FAILED at 381. Hunk #7 FAILED at 554. 3 out of 7 hunks FAILED -- saving rejects to file src/backend/utils/mmgr/slab.c.rej [1] - http://cfbot.cputube.org/patch_41_3865.log Regards, Vignesh
On Tue, 2023-01-03 at 16:25 +0530, vignesh C wrote: > ... > The patch does not apply on top of HEAD as in [1], please post a > rebased patch: >... > Regards, > Vignesh Per conversation in thread listed below, patches have been submitted to the "Add the ability to limit the amount of memorythat can be allocated to backends" thread https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com 0001-Add-tracking-of-backend-memory-allocated-to-pg_stat_.patch 0002-Add-the-ability-to-limit-the-amount-of-memory-that-c.patch On Thu, 8 Dec 2022 at 19:44, Reid Thompson <reid(dot)thompson(at)crunchydata(dot)com> wrote: > > On Sun, 2022-11-27 at 09:40 -0600, Justin Pryzby wrote: > > > ... > > > I still wonder whether there needs to be a separate CF entry for > > > the 0001 patch. One issue is that there's two different lists of > > > people involved in the threads. > > > > > I'm OK with containing the conversation to one thread if everyone else > is. If there's no argument against, then patches after today will go > to the "Add the ability to limit the amount of memory that can be > allocated to backends" thread > https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
On Thu, Jan 05, 2023 at 01:58:33PM -0500, Reid Thompson wrote: > On Tue, 2023-01-03 at 16:25 +0530, vignesh C wrote: > > ... > > The patch does not apply on top of HEAD as in [1], please post a > > rebased patch: > >... > > Regards, > > Vignesh > > Per conversation in thread listed below, patches have been submitted to the "Add the ability to limit the amount of memorythat can be allocated to backends" thread > https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com I suggest to close the associated CF entry. (Also, the people who participated in this thread may want to be included in the other thread going forward.) > 0001-Add-tracking-of-backend-memory-allocated-to-pg_stat_.patch > 0002-Add-the-ability-to-limit-the-amount-of-memory-that-c.patch > > On Thu, 8 Dec 2022 at 19:44, Reid Thompson > <reid(dot)thompson(at)crunchydata(dot)com> wrote: > > > > On Sun, 2022-11-27 at 09:40 -0600, Justin Pryzby wrote: > > > > ... > > > > I still wonder whether there needs to be a separate CF entry for > > > > the 0001 patch. One issue is that there's two different lists of > > > > people involved in the threads. > > > > > > > > I'm OK with containing the conversation to one thread if everyone else > > is. If there's no argument against, then patches after today will go > > to the "Add the ability to limit the amount of memory that can be > > allocated to backends" thread > > https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com
On Thu, 2023-01-05 at 14:13 -0600, Justin Pryzby wrote: > > I suggest to close the associated CF entry. Closed with status of Withdrawn. If that is invalid, please advise. > (Also, the people who participated in this thread may want to be > included in the other thread going forward.) Explicitly adding previously missed participants to Cc: - that conversation/patches are being consolidated to the thread "Add the ability to limit the amount of memory that can be allocated to backends" https://www.postgresql.org/message-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com Thanks, Reid -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thompson@crunchydata.com www.crunchydata.com
Here is an updated version of the earlier work. This version: 1) Tracks memory as requested by the backend. 2) Includes allocations made during program startup. 3) Optimizes the "fast path" to only update two local variables. 4) Places a cluster wide limit on total memory allocated. The cluster wide limit is useful for multi-hosting. One greedy cluster doesn't starve the other clusters of memory. Note there isn't a good way to track actual memory used by a cluster. Ideally, we like to get the working set size of each memory segment along with the size of the associated kernel data structures. Gathering that info in a portable way is a "can of worms". Instead, we're managing memory as requested by the application. While not identical, the two approaches are strongly correlated. The memory model used is 1) Each process is assumed to use a certain amount of memory simply by existing. 2) All pg memory allocations are counted, including those before the process is fully initialized. 3) Each process maintains its own local counters. These are the "truth". 4) Periodically, - local counters are added into the global, shared memory counters. - pgstats is updated - total memory is checked. For efficiency, the global total is an approximation, not a precise number. It can be off by as much as 1 MB per process. Memory limiting doesn't need precision, just a consistent and reasonable approximation. Repeating the earlier benchmark test, there is no measurable loss of performance.
Attachment
On Thu, Aug 31, 2023 at 9:19 AM John Morris <john.morris@crunchydata.com> wrote:
Here is an updated version of the earlier work.
This version:
1) Tracks memory as requested by the backend.
2) Includes allocations made during program startup.
3) Optimizes the "fast path" to only update two local variables.
4) Places a cluster wide limit on total memory allocated.
The cluster wide limit is useful for multi-hosting. One greedy cluster doesn't starve
the other clusters of memory.
Note there isn't a good way to track actual memory used by a cluster.
Ideally, we like to get the working set size of each memory segment along with
the size of the associated kernel data structures.
Gathering that info in a portable way is a "can of worms".
Instead, we're managing memory as requested by the application.
While not identical, the two approaches are strongly correlated.
The memory model used is
1) Each process is assumed to use a certain amount of memory
simply by existing.
2) All pg memory allocations are counted, including those before
the process is fully initialized.
3) Each process maintains its own local counters. These are the "truth".
4) Periodically,
- local counters are added into the global, shared memory counters.
- pgstats is updated
- total memory is checked.
For efficiency, the global total is an approximation, not a precise number.
It can be off by as much as 1 MB per process. Memory limiting
doesn't need precision, just a consistent and reasonable approximation.
Repeating the earlier benchmark test, there is no measurable loss of performance.
Hi,
In `InitProcGlobal`:
+ elog(WARNING, "proc init: max_total=%d result=%d\n", max_total_bkend_mem, result);
Is the above log for debugging purposes ? Maybe the log level should be changed.
+ errmsg("max_total_backend_memory %dMB - shared_memory_size %dMB is <= 100MB",
The `<=` in the error message is inconsistent with the `max_total_bkend_mem < result + 100` check.
Please modify one of them.
For update_global_allocation :
+ Assert((int64)pg_atomic_read_u64(&ProcGlobal->total_bkend_mem_bytes) >= 0);
+ Assert((int64)pg_atomic_read_u64(&ProcGlobal->global_dsm_allocation) >= 0);
+ Assert((int64)pg_atomic_read_u64(&ProcGlobal->global_dsm_allocation) >= 0);
Should the assertions be done earlier in the function?
For reserve_backend_memory:
+ return true;
+
+ /* CASE: the new allocation is within bounds. Take the fast path. */
+ else if (my_memory.allocated_bytes + size <= allocation_upper_bound)
+
+ /* CASE: the new allocation is within bounds. Take the fast path. */
+ else if (my_memory.allocated_bytes + size <= allocation_upper_bound)
The `else` can be omitted (the preceding if block returns).
For `atomic_add_within_bounds_i64`
+ newval = oldval + add;
+
+ /* check if we are out of bounds */
+ if (newval < lower_bound || newval > upper_bound || addition_overflow(oldval, add))
+
+ /* check if we are out of bounds */
+ if (newval < lower_bound || newval > upper_bound || addition_overflow(oldval, add))
Since the summation is stored in `newval`, you can pass newval to `addition_overflow` so that `addition_overflow` doesn't add them again.
There are debug statements, such as:
+ debug("done\n");
you can drop them in the next patch.
Thanks