Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From Lukas Fittl
Subject Re: Stack-based tracking of per-node WAL/buffer usage
Date
Msg-id CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com
Whole thread Raw
In response to Re: Stack-based tracking of per-node WAL/buffer usage  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi Andres,

Thanks for the detailed review!

Attached v4 patchset that addresses feedback (unless otherwise noted below) and is rebased on master. Other changes:

- Ensured each patch is individually pgindent clean (and compiles)
- Refactored 0003 a bit to consistently use InstrPushStack/InstrPopStack helpers for modifying the active stack entry
- Building on that refactoring, merged v3/0006 "Introduce alternate Instrumentation stack mechanism relying on PG_FINALLY" into the main commit introducing the stack mechanism (the "alternate" mechanism is just using these helpers directly and making sure InstrPopStack is called via PG_FINALLY, instead of using resource owners)
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that illustrates how the stack mechanism can be used to separate index and table buffer accesses in the EXPLAIN for Index Scans

On Wed, Oct 22, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote:
> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
>  typedef struct Instrumentation
> +{
> +     /* Parameters set at creation: */
> +     bool            need_timer;             /* true if we need timer data */
> +     bool            need_bufusage;  /* true if we need buffer usage data */
> +     bool            need_walusage;  /* true if we need WAL usage data */
> +     /* Internal state keeping: */
> +     instr_time      starttime;              /* start time of last InstrStart */
> +     BufferUsage bufusage_start; /* buffer usage at start */
> +     WalUsage        walusage_start; /* WAL usage at start */
> +     /* Accumulated statistics: */
> +     instr_time      total;                  /* total runtime */
> +     double          ntuples;                /* total tuples counted in InstrStop */
> +     BufferUsage bufusage;           /* total buffer usage */
> +     WalUsage        walusage;               /* total WAL usage */
> +} Instrumentation;

Maybe add a comment explaining why ntuples is in here?

After thinking about this some more, I'd think we should just go ahead and special case trigger instrumentation, and specifically count firings of the trigger (which was counted in "ntuples" before).

I've adjusted the 0002 patch accordingly to split out both node and trigger instrumentation.
 


> ---
>  .../pg_stat_statements/pg_stat_statements.c   |   4 +-
>  src/backend/commands/explain.c                |   8 +-
>  src/backend/commands/trigger.c                |   4 +-
>  src/backend/executor/execMain.c               |  25 ++-
>  src/backend/executor/execProcnode.c           |  29 +++
>  src/backend/executor/instrument.c             | 199 ++++++++++++++----
>  src/include/executor/executor.h               |   1 +
>  src/include/executor/instrument.h             |  53 ++++-
>  8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>                                  PGSS_EXEC,
>                                  INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>                                  queryDesc->estate->es_total_processed,
> -                                &queryDesc->totaltime->bufusage,
> -                                &queryDesc->totaltime->walusage,
> +                                &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> +                                &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?

I've dropped the macro (its just one additional indirection after all) - I think we could refactor this further (i.e. pass the stack), but that doesn't seem strictly necessary.

>  /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>       return false;
>  }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> +     (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.

Yeah, I can't think of a better name, so I've left this the same for now.
 

> +     if (node->instrument && node->instrument->stack.previous)
> +             InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> +     return false;
> +}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too?  It probably is ok, but it's worth explicitly checking and adding a
comment.

I went through and double checked - we don't need to special case these node types in my understanding, and I couldn't see any specific cases where intermediary nodes would be removed in shutdown either (o.e. 
causing dropped stats since a node was removed with its associated stats not being added to the parent yet).

I added a comment to make it clear this must run after ExecShutdownNode.
 
> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> +     .name = "instrumentation",
> +     .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> +     .release_priority = RELEASE_PRIO_FIRST,
> +     .ReleaseResource = ResOwnerReleaseInstrumentation,
> +     .DebugPrint = NULL,                     /* default message is fine */
> +};

Is there a reason to do the release here before the lock release? And why
_FIRST?

Adjusted to have its own phase, and after lock release.
 
> +     if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> +             pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right?  It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?

Yeah, I added a comment - I don't see a case where this is a bottleneck today given the limited nesting of resource owner using stacks.
 
> +     /*
> +      * Always accumulate all collected stats before the abort, even if we
> +      * already walked up the stack with an earlier resource.
> +      */
> +     if (pgInstrStack)
> +             InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...

In my understanding, we need to do this in case of functions called in a query that catch a rollback/error, since we'd otherwise not account for that function's activity as part of the top-level query.

>  /* General purpose instrumentation handling */
>  Instrumentation *
>  InstrAlloc(int n, int instrument_options)
>  {
...
> +     if (need_buffers || need_wal)
> +             instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> +     else
> +             instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?

I ended up refactoring this a bit, since it seemed useful to do an explicit pfree at InstrStop or when aborting, both to avoid leaks, and to theoretically support using TopMemoryContext here.

That said, from my testing I think CurTransactionContext is sufficient, because we just need something that lives long enough during resource owner abort situations (e.g. per-query context doesn't work, since the abort frees it before we do our resource owner handling).

The pgss+CIC case isn't relevant here (I think), because utility statements don't use the resource owner mechanism at all (with the exception of EXPLAIN which calls into ExecutorStart), instead we use a PG_TRY/PG_FINALLY in pg_stat_statements to pop the stack.
 
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> +     Assert(dst != NULL);
> +     Assert(add != NULL);
> +
> +     BufferUsageAdd(&dst->bufusage, &add->bufusage);
> +     WalUsageAdd(&dst->walusage, &add->walusage);
>  }
>

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...

I briefly considered whether we could add this to the InstrStack itself (i.e. whether we actually care about buffer, WAL usage or both), but I think where it gets messy is that we can have indirect requirements to track this - you might have pg_stat_statements capturing both, but e.g. the utility statement being executed only caring about emitting WAL usage in the log.

I'm also not familiar with an in-core use case today where only WAL (but not buffers) is needed, short of doing something like "EXPLAIN (ANALYZE, BUFFERS OFF, WAL ON)" without having pg_stat_statements or similar enabled. Do you have a specific example where this could help?

>  /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
>  void
>  InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
>  {
> +     if (pgInstrStack != NULL)
> +     {
> +             InstrStack *dst = pgInstrStack;
> +
> +             BufferUsageAdd(&dst->bufusage, bufusage);
> +             WalUsageAdd(&dst->walusage, walusage);
> +     }
> +
>       BufferUsageAdd(&pgBufferUsage, bufusage);
>       WalUsageAdd(&pgWalUsage, walusage);
>  }

Is the pgInstrStack == NULL case actually reachable?

In my reading of the code, its necessary because we unconditionally track WAL/buffer usage in parallel workers, even if the leader doesn't actually need it.

We could be smarter about this (i.e. tell the workers not to collect the information in the first place), but for now it seemed easiest to just discard it.

Thanks,
Lukas

--
Lukas Fittl
Attachment

pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Update Windows CI Task Names: Server 2022 + VS 2022 Upgrade
Next
From: Quan Zongliang
Date:
Subject: Re: [PATCH] Add pg_get_type_ddl() to retrieve the CREATE TYPE statement