Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Stack-based tracking of per-node WAL/buffer usage
Date
Msg-id h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle
Whole thread Raw
In response to Stack-based tracking of per-node WAL/buffer usage  (Lukas Fittl <lukas@fittl.com>)
List pgsql-hackers
Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.


> 0001: Separate node instrumentation from other use of Instrumentation struct
> 
>     Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).


> 0003: Introduce stack for tracking per-node WAL/buffer usage


> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      DestReceiver *dest;
>      bool        sendTuples;
>      MemoryContext oldcontext;
> +    InstrStackResource *res;
>  
>      /* sanity checks */
>      Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      if (queryDesc->totaltime)
>          InstrStart(queryDesc->totaltime);
>  
> +    /* Start up per-query node level instrumentation */
> +    res = InstrStartQuery();
> +
>      /*
>       * extract information from the query descriptor and the query feature.
>       */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      if (sendTuples)
>          dest->rShutdown(dest);
>  
> +    /* Shut down per-query node level instrumentation */
> +    InstrShutdownQuery(res);
> +
>      if (queryDesc->totaltime)
>          InstrStop(queryDesc->totaltime, estate->es_processed);


Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.


> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>  
>      /* Stop the node if we started it above, reporting 0 tuples. */
>      if (node->instrument && node->instrument->running)
> +    {
>          InstrStopNode(node->instrument, 0);
>  
> +        /*
> +         * Propagate WAL/buffer stats to the parent node on the
> +         * instrumentation stack (which is where InstrStopNode returned us
> +         * to).
> +         */
> +        InstrNodeAddToCurrent(&node->instrument->stack);
> +    }
> +
>      return false;
>  }

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

    /*
     * If we know we won't need to back up, we can release resources at this
     * point.
     */
    if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
        ExecShutdownNode(planstate);


> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> +    /*
> +     * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> +     * get what was first registered first at shutdown. To avoid handling
> +     * that, we are resetting the stack here on abort (instead of recovering
> +     * to previous).
> +     */
> +    pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: index prefetching
Next
From: Nathan Bossart
Date:
Subject: Re: Improve LWLock tranche name visibility across backends