Re: Enhancing Memory Context Statistics Reporting - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Enhancing Memory Context Statistics Reporting
Date
Msg-id 5bxhxniyvjyfldi7yjxcnxkl3i2ghci2grjyeclrbkfqnyhowk@dfkqzbvtbpml
Whole thread Raw
In response to Re: Enhancing Memory Context Statistics Reporting  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Enhancing Memory Context Statistics Reporting
Re: Enhancing Memory Context Statistics Reporting
List pgsql-hackers
Hi,

On 2025-04-07 15:41:37 +0200, Daniel Gustafsson wrote:
> I think this function can be a valuable debugging aid going forward.

What I am most excited about for this is to be able to measure server-wide and
fleet-wide memory usage over time. Today I have actually very little idea
about what memory is being used for across all connections, not to speak of a
larger number of servers.


> diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
> index 4f6795f7265..d3b4df27935 100644
> --- a/src/backend/postmaster/auxprocess.c
> +++ b/src/backend/postmaster/auxprocess.c
> @@ -84,6 +84,13 @@ AuxiliaryProcessMainCommon(void)
>      /* register a before-shutdown callback for LWLock cleanup */
>      before_shmem_exit(ShutdownAuxiliaryProcess, 0);
>  
> +    /*
> +     * The before shmem exit callback frees the DSA memory occupied by the
> +     * latest memory context statistics that could be published by this aux
> +     * proc if requested.
> +     */
> +    before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
> +
>      SetProcessingMode(NormalProcessing);
>  }

How about putting it into BaseInit()?  Or maybe just register it when its
first used?


> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> index fda91ffd1ce..d3cb3f1891c 100644
> --- a/src/backend/postmaster/checkpointer.c
> +++ b/src/backend/postmaster/checkpointer.c
> @@ -663,6 +663,10 @@ ProcessCheckpointerInterrupts(void)
>      /* Perform logging of memory contexts of this process */
>      if (LogMemoryContextPending)
>          ProcessLogMemoryContextInterrupt();
> +
> +    /* Publish memory contexts of this process */
> +    if (PublishMemoryContextPending)
> +        ProcessGetMemoryContextInterrupt();
>  }
>  
>  /*

Not this patch's responsibility, but we really desperately need to unify our
interrupt handling.  Manually keeping a ~dozen of functions similar, but not
exactly the same, is an insane approach.


> --- a/src/backend/utils/activity/wait_event_names.txt
> +++ b/src/backend/utils/activity/wait_event_names.txt
> @@ -161,6 +161,7 @@ WAL_RECEIVER_EXIT    "Waiting for the WAL receiver to exit."
>  WAL_RECEIVER_WAIT_START    "Waiting for startup process to send initial data for streaming replication."
>  WAL_SUMMARY_READY    "Waiting for a new WAL summary to be generated."
>  XACT_GROUP_UPDATE    "Waiting for the group leader to update transaction status at transaction end."
> +MEM_CTX_PUBLISH    "Waiting for a process to publish memory information."

The memory context stuff abbreviates as cxt not ctx.  There's a few more cases
of that in the patch.


> +const char *
> +ContextTypeToString(NodeTag type)
> +{
> +    const char *context_type;
> +
> +    switch (type)
> +    {
> +        case T_AllocSetContext:
> +            context_type = "AllocSet";
> +            break;
> +        case T_GenerationContext:
> +            context_type = "Generation";
> +            break;
> +        case T_SlabContext:
> +            context_type = "Slab";
> +            break;
> +        case T_BumpContext:
> +            context_type = "Bump";
> +            break;
> +        default:
> +            context_type = "???";
> +            break;
> +    }
> +    return (context_type);

Why these parens?



> + * If the publishing backend does not respond before the condition variable
> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry given that there is
> + * time left within the timeout specified by the user, before giving up and
> + * returning previously published statistics, if any. If no previous statistics
> + * exist, return NULL.

Why do we need to repeatedly wake up rather than just sleeping with the
"remaining" amount of time based on the time the function was called and the
time that has passed since?


> +    /*
> +     * A valid DSA pointer isn't proof that statistics are available, it can
> +     * be valid due to previously published stats.

Somehow "valid DSA pointer" is a bit too much about the precise mechanics and
not enough about what's actually happening. I'd rather say something like

"Even if the proc has published statistics, they may not be due to the current
request, but previously published stats."


> +        if (ConditionVariableTimedSleep(&memCtxState[procNumber].memctx_cv,
> +                                        MEMSTATS_WAIT_TIMEOUT,
> +                                        WAIT_EVENT_MEM_CTX_PUBLISH))
> +        {
> +            timer += MEMSTATS_WAIT_TIMEOUT;
> +
> +            /*
> +             * Wait for the timeout as defined by the user. If no updated
> +             * statistics are available within the allowed time then display
> +             * previously published statistics if there are any. If no
> +             * previous statistics are available then return NULL.  The timer
> +             * is defined in milliseconds since thats what the condition
> +             * variable sleep uses.
> +             */
> +            if ((timer * 1000) >= timeout)
> +            {

I'd suggest just comparing how much time has elapsed since the timestamp
you've requested earlier.


> +                LWLockAcquire(&memCtxState[procNumber].lw_lock, LW_EXCLUSIVE);
> +                /* Displaying previously published statistics if available */
> +                if (DsaPointerIsValid(memCtxState[procNumber].memstats_dsa_pointer))
> +                    break;
> +                else
> +                {
> +                    LWLockRelease(&memCtxState[procNumber].lw_lock);
> +                    PG_RETURN_NULL();
> +                }
> +            }
> +        }
> +    }


> +/*
> + * Initialize shared memory for displaying memory context statistics
> + */
> +void
> +MemoryContextReportingShmemInit(void)
> +{
> +    bool        found;
> +
> +    memCtxArea = (MemoryContextState *)
> +        ShmemInitStruct("MemoryContextState", sizeof(MemoryContextState), &found);
> +
> +    if (!IsUnderPostmaster)
> +    {
> +        Assert(!found);

I don't really understand why this uses IsUnderPostmaster?  Seems like this
should just use found like most (or all) the other *ShmemInit() functions do?


> +        LWLockInitialize(&memCtxArea->lw_lock, LWLockNewTrancheId());

I think for builtin code we just hardcode the tranches in BuiltinTrancheIds.



> +    memCtxState = (MemoryContextBackendState *)
> +        ShmemInitStruct("MemoryContextBackendState",
> +                        ((MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(MemoryContextBackendState)),
> +                        &found);

FWIW, I think it'd be mildly better if these two ShmemInitStruct()'s were
combined.


>  static void
>  MemoryContextStatsInternal(MemoryContext context, int level,
>                             int max_level, int max_children,
>                             MemoryContextCounters *totals,
> -                           bool print_to_stderr)
> +                           PrintDestination print_location, int *num_contexts)
>  {
>      MemoryContext child;
>      int            ichild;
> @@ -884,10 +923,39 @@ MemoryContextStatsInternal(MemoryContext context, int level,
>      Assert(MemoryContextIsValid(context));
>  
>      /* Examine the context itself */
> -    context->methods->stats(context,
> -                            MemoryContextStatsPrint,
> -                            &level,
> -                            totals, print_to_stderr);
> +    switch (print_location)
> +    {
> +        case PRINT_STATS_TO_STDERR:
> +            context->methods->stats(context,
> +                                    MemoryContextStatsPrint,
> +                                    &level,
> +                                    totals, true);
> +            break;
> +
> +        case PRINT_STATS_TO_LOGS:
> +            context->methods->stats(context,
> +                                    MemoryContextStatsPrint,
> +                                    &level,
> +                                    totals, false);
> +            break;
> +
> +        case PRINT_STATS_NONE:
> +
> +            /*
> +             * Do not print the statistics if print_location is
> +             * PRINT_STATS_NONE, only compute totals. This is used in
> +             * reporting of memory context statistics via a sql function. Last
> +             * parameter is not relevant.
> +             */
> +            context->methods->stats(context,
> +                                    NULL,
> +                                    NULL,
> +                                    totals, false);
> +            break;
> +    }
> +
> +    /* Increment the context count for each of the recursive call */
> +    *num_contexts = *num_contexts + 1;

It feels a bit silly to duplicate the call to context->methods->stats three
times. We've changed these parameters a bunch in the past, having more callers
to fix makes that more work. Can't the switch just set up the args that are
then passed to one call to context->methods->stats?


> +
> +    /* Compute the number of stats that can fit in the defined limit */
> +    max_stats = (MAX_SEGMENTS_PER_BACKEND * DSA_DEFAULT_INIT_SEGMENT_SIZE)
> +        / (MAX_MEMORY_CONTEXT_STATS_SIZE);

MAX_SEGMENTS_PER_BACKEND sounds way too generic to me for something defined in
memutils.h.  I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
something that makes sense to use here?

The header says:

> +/* Maximum size (in Mb) of DSA area per process */
> +#define MAX_SEGMENTS_PER_BACKEND 1

But the name doesn't at all indicate it's in megabytes. Nor does the way it's
used clearly indicate that. That seems to be completely incidental based on
the current default value DSA_DEFAULT_INIT_SEGMENT_SIZE.


> +    /*
> +     * Hold the process lock to protect writes to process specific memory. Two
> +     * processes publishing statistics do not block each other.
> +     */

s/specific/process specific/


> +    LWLockAcquire(&memCtxState[idx].lw_lock, LW_EXCLUSIVE);
> +    memCtxState[idx].proc_id = MyProcPid;
> +
> +    if (DsaPointerIsValid(memCtxState[idx].memstats_dsa_pointer))
> +    {
> +        /*
> +         * Free any previous allocations, free the name, ident and path
> +         * pointers before freeing the pointer that contains them.
> +         */
> +        free_memorycontextstate_dsa(area, memCtxState[idx].total_stats,
> +                                    memCtxState[idx].memstats_dsa_pointer);
> +
> +        dsa_free(area, memCtxState[idx].memstats_dsa_pointer);
> +        memCtxState[idx].memstats_dsa_pointer = InvalidDsaPointer;

Both callers to free_memorycontextstate_dsa() do these lines immediately after
calling free_memorycontextstate_dsa(), why not do that inside?



> +        for (MemoryContext c = TopMemoryContext->firstchild; c != NULL;
> +             c = c->nextchild)
> +        {
> +            MemoryContextCounters grand_totals;
> +            int            num_contexts = 0;
> +            int            level = 0;
> +
> +            path = NIL;
> +            memset(&grand_totals, 0, sizeof(grand_totals));
> +
> +            MemoryContextStatsInternal(c, level, 100, 100, &grand_totals,
> +                                       PRINT_STATS_NONE, &num_contexts);
> +
> +            path = compute_context_path(c, context_id_lookup);
> +
> +            PublishMemoryContext(meminfo, ctx_id, c, path,
> +                                 grand_totals, num_contexts, area, 100);
> +            ctx_id = ctx_id + 1;
> +        }
> +        memCtxState[idx].total_stats = ctx_id;
> +        /* Notify waiting backends and return */
> +        hash_destroy(context_id_lookup);
> +        dsa_detach(area);
> +        signal_memorycontext_reporting();
> +    }
> +
> +    foreach_ptr(MemoryContextData, cur, contexts)
> +    {
> +        List       *path = NIL;
> +
> +        /*
> +         * Figure out the transient context_id of this context and each of its
> +         * ancestors, to compute a path for this context.
> +         */
> +        path = compute_context_path(cur, context_id_lookup);
> +
> +        /* Account for saving one statistics slot for cumulative reporting */
> +        if (context_id < (max_stats - 1) || stats_count <= max_stats)
> +        {
> +            /* Examine the context stats */
> +            memset(&stat, 0, sizeof(stat));
> +            (*cur->methods->stats) (cur, NULL, NULL, &stat, true);

Hm. So here we call the callback ourselves, even though we extended
MemoryContextStatsInternal() to satisfy the summary output.  I guess it's
tolerable, but it's not great.


> +            /* Copy statistics to DSA memory */
> +            PublishMemoryContext(meminfo, context_id, cur, path, stat, 1, area, 100);
> +        }
> +        else
> +        {
> +            /* Examine the context stats */
> +            memset(&stat, 0, sizeof(stat));
> +            (*cur->methods->stats) (cur, NULL, NULL, &stat, true);

But do we really do it twice in a row?  The lines are exactly the same, so it
seems that should just be done before the if?

> +
> +    /* Notify waiting backends and return */
> +    hash_destroy(context_id_lookup);
> +    dsa_detach(area);
> +    signal_memorycontext_reporting();
> +}
> +
> +/*
> + * Signal all the waiting client backends after copying all the statistics.
> + */
> +static void
> +signal_memorycontext_reporting(void)
> +{
> +    memCtxState[MyProcNumber].stats_timestamp = GetCurrentTimestamp();
> +    LWLockRelease(&memCtxState[MyProcNumber].lw_lock);
> +    ConditionVariableBroadcast(&memCtxState[MyProcNumber].memctx_cv);
> +}

IMO somewhat confusing to release the lock in a function named
signal_memorycontext_reporting().  Why do we do that after
hash_destroy()/dsa_detach()?



> +static void
> +compute_contexts_count_and_ids(List *contexts, HTAB *context_id_lookup,
> +                               int *stats_count, bool summary)
> +{
> +    foreach_ptr(MemoryContextData, cur, contexts)
> +    {
> +        MemoryContextId *entry;
> +        bool        found;
> +
> +        entry = (MemoryContextId *) hash_search(context_id_lookup, &cur,
> +                                                HASH_ENTER, &found);
> +        Assert(!found);
> +
> +        /* context id starts with 1 */
> +        entry->context_id = ++(*stats_count);

Given that we don't actually do anything here relating to starting with 1, I
find that comment confusing.


> +static void
> +PublishMemoryContext(MemoryContextStatsEntry *memctx_info, int curr_id,
> +                     MemoryContext context, List *path,
> +                     MemoryContextCounters stat, int num_contexts,
> +                     dsa_area *area, int max_levels)
> +{
> +    const char *ident = context->ident;
> +    const char *name = context->name;
> +    int           *path_list;
> +
> +    /*
> +     * To be consistent with logging output, we label dynahash contexts with
> +     * just the hash table name as with MemoryContextStatsPrint().
> +     */
> +    if (context->ident && strncmp(context->name, "dynahash", 8) == 0)
> +    {
> +        name = context->ident;
> +        ident = NULL;
> +    }
> +
> +    if (name != NULL)
> +    {
> +        int            namelen = strlen(name);
> +        char       *nameptr;
> +
> +        if (strlen(name) >= MEMORY_CONTEXT_IDENT_SHMEM_SIZE)
> +            namelen = pg_mbcliplen(name, namelen,
> +                                   MEMORY_CONTEXT_IDENT_SHMEM_SIZE - 1);
> +
> +        memctx_info[curr_id].name = dsa_allocate0(area, namelen + 1);

Given the number of references to memctx_info[curr_id] I'd put it in a local variable.

Why is this a dsa_allocate0 given that we're immediately overwriting it?


> +        nameptr = (char *) dsa_get_address(area, memctx_info[curr_id].name);
> +        strlcpy(nameptr, name, namelen + 1);
> +    }
> +    else
> +        memctx_info[curr_id].name = InvalidDsaPointer;
> +
> +    /* Trim and copy the identifier if it is not set to NULL */
> +    if (ident != NULL)
> +    {
> +        int            idlen = strlen(context->ident);
> +        char       *identptr;
> +
> +        /*
> +         * Some identifiers such as SQL query string can be very long,
> +         * truncate oversize identifiers.
> +         */
> +        if (idlen >= MEMORY_CONTEXT_IDENT_SHMEM_SIZE)
> +            idlen = pg_mbcliplen(ident, idlen,
> +                                 MEMORY_CONTEXT_IDENT_SHMEM_SIZE - 1);
> +
> +        memctx_info[curr_id].ident = dsa_allocate0(area, idlen + 1);
> +        identptr = (char *) dsa_get_address(area, memctx_info[curr_id].ident);
> +        strlcpy(identptr, ident, idlen + 1);

Hm. First I thought we'd leak memory if this second (and subsequent)
dsa_allocate failed. Then I thought we'd be ok, because the memory would be
memory because it'd be reachable from memCtxState[idx].memstats_dsa_pointer.

But I think it wouldn't *quite* work, because memCtxState[idx].total_stats is
only set *after* we would have failed.




> +    /* Allocate DSA memory for storing path information */
> +    if (path == NIL)
> +        memctx_info[curr_id].path = InvalidDsaPointer;
> +    else
> +    {
> +        int            levels = Min(list_length(path), max_levels);
> +
> +        memctx_info[curr_id].path_length = levels;
> +        memctx_info[curr_id].path = dsa_allocate0(area, levels * sizeof(int));
> +        memctx_info[curr_id].levels = list_length(path);
> +        path_list = (int *) dsa_get_address(area, memctx_info[curr_id].path);
> +
> +        foreach_int(i, path)
> +        {
> +            path_list[foreach_current_index(i)] = i;
> +            if (--levels == 0)
> +                break;
> +        }
> +    }
> +    memctx_info[curr_id].type = ContextTypeToString(context->type);

I don't think this works across platforms. On windows / EXEC_BACKEND builds
the location of string constants can differ across backends.  And: Why do we
need the string here? You can just call ContextTypeToString when reading?


> +/*
> + * Free the memory context statistics stored by this process
> + * in DSA area.
> + */
> +void
> +AtProcExit_memstats_dsa_free(int code, Datum arg)
> +{

FWIW, to me the fact that it does a dsa_free() is an implementation
detail. It's also not the only thing this does.

And, I don't think AtProcExit* really is accurate, given that it runs *before*
shmem is cleaned up?

I wonder if the best approach here wouldn't be to forgo the use of a
before_shmem_exit() callback, but instead use on_dsm_detach(). That would
require we'd not constantly detach from the dsm segment, but I don't
understand why we do that in the first place?


> +    int            idx = MyProcNumber;
> +    dsm_segment *dsm_seg = NULL;
> +    dsa_area   *area = NULL;
> +
> +    if (memCtxArea->memstats_dsa_handle == DSA_HANDLE_INVALID)
> +        return;
> +
> +    dsm_seg = dsm_find_mapping(memCtxArea->memstats_dsa_handle);
> +
> +    LWLockAcquire(&memCtxState[idx].lw_lock, LW_EXCLUSIVE);
> +
> +    if (!DsaPointerIsValid(memCtxState[idx].memstats_dsa_pointer))
> +    {
> +        LWLockRelease(&memCtxState[idx].lw_lock);
> +        return;
> +    }
> +
> +    /* If the dsm mapping could not be found, attach to the area */
> +    if (dsm_seg != NULL)
> +        return;

I don't understand what we do here with the dsm?  Why do we not need cleanup
if we are already attached to the dsm segment?


> +/*
> + * Static shared memory state representing the DSA area created for memory
> + * context statistics reporting.  A single DSA area is created and used by all
> + * the processes, each having its specific DSA allocations for sharing memory
> + * statistics, tracked by per backend static shared memory state.
> + */
> +typedef struct MemoryContextState
> +{
> +    dsa_handle    memstats_dsa_handle;
> +    LWLock        lw_lock;
> +} MemoryContextState;

IMO that's too generic a name for something in a header.


> +/*
> + * Used for storage of transient identifiers for pg_get_backend_memory_contexts
> + */
> +typedef struct MemoryContextId
> +{
> +    MemoryContext context;
> +    int            context_id;
> +} MemoryContextId;

This too.  Particularly because MemoryContextData->ident exist but is
something different.



> +DO $$
> +DECLARE
> +    launcher_pid int;
> +    r RECORD;
> +BEGIN
> +        SELECT pid from pg_stat_activity where backend_type='autovacuum launcher'
> +     INTO launcher_pid;
> +
> +        select type, name, ident
> +        from pg_get_process_memory_contexts(launcher_pid, false, 20)
> +     where path = '{1}' into r;
> +    RAISE NOTICE '%', r;
> +        select type, name, ident
> +        from pg_get_process_memory_contexts(pg_backend_pid(), false, 20)
> +     where path = '{1}' into r;
> +    RAISE NOTICE '%', r;
> +END $$;

I'd also test an aux process.  I think the AV launcher isn't one, because it
actually does "table" access of shared relations.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Logging which local address was connected to in log_line_prefix
Next
From: Tom Lane
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions