Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Stack-based tracking of per-node WAL/buffer usage
Date
Msg-id k2uxwancbi5blb2duny3kbuz667m47hqc4jv7xstlx3yyzcd3a@fjqww6avalt3
Whole thread Raw
In response to Re: Stack-based tracking of per-node WAL/buffer usage  (Lukas Fittl <lukas@fittl.com>)
List pgsql-hackers
Hi,


On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:
> Attached v4 patchset that addresses feedback (unless otherwise noted below)
> and is rebased on master. Other changes:
>
> [...]
> - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
> illustrates how the stack mechanism can be used to separate index and table
> buffer accesses in the EXPLAIN for Index Scans

Nice!


I pushed 0001. The only changes I made were to break a few long lines.


> From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
>  use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This overuse of the same struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time and
> WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
> per-node instrumentation - these calls were added without any apparent
> benefit since the relevant fields were never read.
>
> Introduce the NodeInstrumentation struct to carry forward the per-node
> instrumentation information, and introduce TriggerInstrumentation to
> capture trigger timing and firings (previously counted in "ntuples").

Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?


> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
> index f098a5557cf..e87810d292e 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
>   * instrument_options: Same meaning here as in instrument.c.
>   *
>   * instrument_offset: Offset, relative to the start of this structure,
> - * of the first Instrumentation object.  This will depend on the length of
> + * of the first NodeInstrumentation object.  This will depend on the length of
>   * the plan_node_id array.
>   *
>   * num_workers: Number of workers.
> @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
>      int            num_workers;
>      int            num_plan_nodes;
>      int            plan_node_id[FLEXIBLE_ARRAY_MEMBER];
> -    /* array of num_plan_nodes * num_workers Instrumentation objects follows */
> +
> +    /*
> +     * array of num_plan_nodes * num_workers NodeInstrumentation objects
> +     * follows
> +     */
>  };

Spurious change, I think?



> +
> +/* Trigger instrumentation handling */
> +TriggerInstrumentation *
> +InstrAllocTrigger(int n, int instrument_options)
> +{
> +    TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> +    bool        need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> +    bool        need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> +    bool        need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> +    int            i;
> +
> +    for (i = 0; i < n; i++)
> +    {
> +        tginstr[i].instr.need_bufusage = need_buffers;
> +        tginstr[i].instr.need_walusage = need_wal;
> +        tginstr[i].instr.need_timer = need_timer;
> +    }
> +
> +    return tginstr;

Hm. Not that it's a huge difference, but I wonder if we ought to use
InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition?

> +/* Trigger instrumentation */
> +typedef struct TriggerInstrumentation
> +{
> +    Instrumentation instr;
> +    int            firings;        /* number of times the instrumented trigger
> +                                 * was fired */
> +}            TriggerInstrumentation;
> +
> +/*
> + * Specialized instrumentation for per-node execution statistics
> + */
> +typedef struct NodeInstrumentation
>  {
>      /* Parameters set at node creation: */
>      bool        need_timer;        /* true if we need timer data */
> @@ -92,25 +125,34 @@ typedef struct Instrumentation
>      double        nfiltered2;        /* # of tuples removed by "other" quals */
>      BufferUsage bufusage;        /* total buffer usage */
>      WalUsage    walusage;        /* total WAL usage */
> -} Instrumentation;
> +}            NodeInstrumentation;

The indentation here will look better if you add TriggerInstrumentation,
NodeInstrumentation to typedefs.list


>  typedef struct WorkerInstrumentation
>  {
>      int            num_workers;    /* # of structures that follow */
> -    Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> +    NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
>  } WorkerInstrumentation;

Hm. Shouldn't this be WorkerNodeInstrumentation now?


> From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:34:42 -0700
> Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
>  with INSTR_* macros

Looks sane to me.

I'd move that to the head of the queue in the next revision.


> From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
>  stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code section,
> we utilized continuously incrementing global counters that get updated when the
> actual activity (e.g. shared block read) occurred, and then calculated a diff when
> the code section ended. This resulted in a bottleneck for executor node instrumentation
> specifically, with the function BufferUsageAccumDiff showing up in profiles and
> in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity writes
> into the current stack entry. In the case of executor nodes, this means that
> each node gets its own stack entry that is pushed at InstrStartNode, and popped
> at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism)
> and get added to their parent entry when they are finalized, i.e. no more
> modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks must
> involve either a top-level Instrumentation struct, and its associated InstrStart/
> InstrStop helpers (which use resource owners to handle aborts), or dedicated
> PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
> an abort.

I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.


> @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>                     PGSS_EXEC,
>                     INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>                     queryDesc->estate->es_total_processed,
> -                   &queryDesc->totaltime->bufusage,
> -                   &queryDesc->totaltime->walusage,
> +
> +        /*
> +         * Check if stack is initialized - it is not when ExecutorRun wasn't
> +         * called
> +         */
> +                   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL,
> +                   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL,
>                     queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
>                     NULL,
>                     queryDesc->estate->es_parallel_workers_to_launch,

Maybe it's just the diff, but this looks a bit odd, indentation wise.



> @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>          resultRelInfo->ri_TrigWhenExprs = (ExprState **)
>              palloc0(n * sizeof(ExprState *));
>          if (instrument_options)
> -            resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> +        {
> +            /*
> +             * Triggers do not individually track buffer/WAL usage, even if
> +             * otherwise tracked
> +             */

Why is that?


> @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
>          !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>          elog(ERROR, "InstrStart called twice in a row");
>
> -    if (instr->need_bufusage)
> -        instr->bufusage_start = pgBufferUsage;
> +    if (instr->need_bufusage || instr->need_walusage)
> +    {
> +        Assert(CurrentResourceOwner != NULL);
> +        instr->owner = CurrentResourceOwner;
> +
> +        /*
> +         * Allocate the stack resource in a memory context that survives
> +         * during an abort. This will be freed by InstrStop (regular
> +         * execution) or ResOwnerReleaseInstrumentation (abort).
> +         *
> +         * We don't do this in InstrAlloc to avoid leaking when InstrStart +
> +         * InstrStop isn't called.
> +         */
> +        if (instr->stack == NULL)
> +            instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
>
> -    if (instr->need_walusage)
> -        instr->walusage_start = pgWalUsage;
> +        ResourceOwnerEnlarge(instr->owner);
> +        ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
> +
> +        InstrPushStack(instr->stack);
> +    }
>  }

I'm still worried that allocating something in CurTransactionContext will bite
us eventually. It seems like it'd be pretty sane to want to have
instrumentation around procedure execution, for example.


>  void
> -InstrStop(Instrumentation *instr)
> +InstrStop(Instrumentation *instr, bool finalize)
>  {
>      instr_time    endtime;
>
> @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
>          INSTR_TIME_SET_ZERO(instr->starttime);
>      }
>
> -    /* Add delta of buffer usage since entry to node's totals */
> -    if (instr->need_bufusage)
> -        BufferUsageAccumDiff(&instr->bufusage,
> -                             &pgBufferUsage, &instr->bufusage_start);
> +    if (instr->need_bufusage || instr->need_walusage)
> +    {
> +        InstrPopStack(instr->stack, finalize);
>
> -    if (instr->need_walusage)
> -        WalUsageAccumDiff(&instr->walusage,
> -                          &pgWalUsage, &instr->walusage_start);
> +        Assert(instr->owner != NULL);
> +        ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
> +        instr->owner = NULL;
> +
> +        if (finalize)
> +        {
> +            /*
> +             * To avoid keeping memory allocated beyond when its needed, copy
> +             * the result to the current memory context, and free it in the
> +             * transaction context.
> +             */
> +            InstrStack *stack = palloc(sizeof(InstrStack));
> +
> +            memcpy(stack, instr->stack, sizeof(InstrStack));
> +            pfree(instr->stack);
> +            instr->stack = stack;
> +        }
> +    }
>  }

Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?


>  /* Trigger instrumentation handling */
> @@ -98,15 +194,20 @@ TriggerInstrumentation *
>  InstrAllocTrigger(int n, int instrument_options)
>  {
>      TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> -    bool        need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> -    bool        need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
>      bool        need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
>      int            i;
>
> +    /*
> +     * To avoid having to determine when the last trigger fired, we never
> +     * track WAL/buffer usage for now
> +     */
> +    Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
> +    Assert((instrument_options & INSTRUMENT_WAL) == 0);
> +
>      for (i = 0; i < n; i++)
>      {
> -        tginstr[i].instr.need_bufusage = need_buffers;
> -        tginstr[i].instr.need_walusage = need_wal;
> +        tginstr[i].instr.need_bufusage = false;
> +        tginstr[i].instr.need_walusage = false;
>          tginstr[i].instr.need_timer = need_timer;
>      }
>

That does seem problematic to me.




> @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>  }
>
>  /* start instrumentation during parallel executor startup */
> -void
> +Instrumentation *
>  InstrStartParallelQuery(void)
>  {
> -    save_pgBufferUsage = pgBufferUsage;
> -    save_pgWalUsage = pgWalUsage;
> +    Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> +
> +    InstrStart(instr);
> +    return instr;
>  }

>  /* report usage after parallel executor shutdown */
>  void
> -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
>  {
> -    memset(bufusage, 0, sizeof(BufferUsage));
> -    BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
> -    memset(walusage, 0, sizeof(WalUsage));
> -    WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
> +    InstrStop(instr, true);
> +    memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
> +    memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
>  }


Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...



>          if (verbose || params.log_vacuum_min_duration == 0 ||
>              TimestampDifferenceExceeds(starttime, endtime,
>                                         params.log_vacuum_min_duration))
>          {
>              long        secs_dur;
>              int            usecs_dur;
> -            WalUsage    walusage;
> -            BufferUsage bufferusage;
>              StringInfoData buf;
>              char       *msgfmt;
>              int32        diff;
> @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
>              int64        total_blks_hit;
>              int64        total_blks_read;
>              int64        total_blks_dirtied;
> +            BufferUsage bufusage = instr->stack->bufusage;
> +            WalUsage    walusage = instr->stack->walusage;

>              TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
> -            memset(&walusage, 0, sizeof(WalUsage));
> -            WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
> -            memset(&bufferusage, 0, sizeof(BufferUsage));
> -            BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
> -
> -            total_blks_hit = bufferusage.shared_blks_hit +
> -                bufferusage.local_blks_hit;
> -            total_blks_read = bufferusage.shared_blks_read +
> -                bufferusage.local_blks_read;
> -            total_blks_dirtied = bufferusage.shared_blks_dirtied +
> -                bufferusage.local_blks_dirtied;
> +
> +            total_blks_hit = bufusage.shared_blks_hit +
> +                bufusage.local_blks_hit;
> +            total_blks_read = bufusage.shared_blks_read +
> +                bufusage.local_blks_read;
> +            total_blks_dirtied = bufusage.shared_blks_dirtied +
> +                bufusage.local_blks_dirtied;

Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?

Kinda going the other way: Why copy this on the stack, rather than use a
pointer?

> diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
> index 34b6410d6a2..77b4c59e71c 100644
> --- a/src/backend/commands/prepare.c
> +++ b/src/backend/commands/prepare.c
> @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
>      ListCell   *p;
>      ParamListInfo paramLI = NULL;
>      EState       *estate = NULL;
> -    instr_time    planstart;
> -    instr_time    planduration;
> -    BufferUsage bufusage_start,
> -                bufusage;
> +    Instrumentation *instr = NULL;
>      MemoryContextCounters mem_counters;
>      MemoryContext planner_ctx = NULL;
>      MemoryContext saved_ctx = NULL;
>
> +    if (es->buffers)
> +        instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> +    else
> +        instr = InstrAlloc(1, INSTRUMENT_TIMER);

I'd not duplicate the call to InstrAlloc().



> From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Thu, 30 Oct 2025 22:27:30 -0700
> Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
>  reporting in EXPLAIN
>
> This makes it clear whether activity was on the index directly, or
> on the table based on heap fetches.

Yay^2.


> ---
>  src/backend/commands/explain.c       | 56 ++++++++++++++++------------
>  src/backend/executor/execProcnode.c  | 13 +++++++
>  src/backend/executor/instrument.c    | 25 +++++++++++++
>  src/backend/executor/nodeIndexscan.c | 15 +++++++-
>  src/include/access/genam.h           |  3 ++
>  src/include/executor/instrument.h    |  3 ++
>  6 files changed, 91 insertions(+), 24 deletions(-)
>
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index dd3bf615581..fb96dd5248c 100644

> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d00cf820a27..f19af428d97 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
>      planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
>
>      if (node->instrument && node->instrument->stack.previous)
> +    {

Perhaps we should instead (probably in the earlier patch) just make the
function return early if we don't have instrumentation / a previous node? We
might gain a few more special cases, and it's nice if they're less indented.


> +        /*
> +         * Index Scan nodes account for heap buffer usage separately, so we
> +         * need to explitly add here
> +         */

s/heap/table/


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Reduce build times of pg_trgm GIN indexes
Next
From: Tom Lane
Date:
Subject: Maybe BF "timedout" failures are the client script's fault?