Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: Stack-based tracking of per-node WAL/buffer usage |
| Date | |
| Msg-id | k2uxwancbi5blb2duny3kbuz667m47hqc4jv7xstlx3yyzcd3a@fjqww6avalt3 Whole thread Raw |
| In response to | Re: Stack-based tracking of per-node WAL/buffer usage (Lukas Fittl <lukas@fittl.com>) |
| List | pgsql-hackers |
Hi,
On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:
> Attached v4 patchset that addresses feedback (unless otherwise noted below)
> and is rebased on master. Other changes:
>
> [...]
> - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
> illustrates how the stack mechanism can be used to separate index and table
> buffer accesses in the EXPLAIN for Index Scans
Nice!
I pushed 0001. The only changes I made were to break a few long lines.
> From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
> use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This overuse of the same struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time and
> WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
> per-node instrumentation - these calls were added without any apparent
> benefit since the relevant fields were never read.
>
> Introduce the NodeInstrumentation struct to carry forward the per-node
> instrumentation information, and introduce TriggerInstrumentation to
> capture trigger timing and firings (previously counted in "ntuples").
Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?
> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
> index f098a5557cf..e87810d292e 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
> * instrument_options: Same meaning here as in instrument.c.
> *
> * instrument_offset: Offset, relative to the start of this structure,
> - * of the first Instrumentation object. This will depend on the length of
> + * of the first NodeInstrumentation object. This will depend on the length of
> * the plan_node_id array.
> *
> * num_workers: Number of workers.
> @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
> int num_workers;
> int num_plan_nodes;
> int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
> - /* array of num_plan_nodes * num_workers Instrumentation objects follows */
> +
> + /*
> + * array of num_plan_nodes * num_workers NodeInstrumentation objects
> + * follows
> + */
> };
Spurious change, I think?
> +
> +/* Trigger instrumentation handling */
> +TriggerInstrumentation *
> +InstrAllocTrigger(int n, int instrument_options)
> +{
> + TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> + int i;
> +
> + for (i = 0; i < n; i++)
> + {
> + tginstr[i].instr.need_bufusage = need_buffers;
> + tginstr[i].instr.need_walusage = need_wal;
> + tginstr[i].instr.need_timer = need_timer;
> + }
> +
> + return tginstr;
Hm. Not that it's a huge difference, but I wonder if we ought to use
InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition?
> +/* Trigger instrumentation */
> +typedef struct TriggerInstrumentation
> +{
> + Instrumentation instr;
> + int firings; /* number of times the instrumented trigger
> + * was fired */
> +} TriggerInstrumentation;
> +
> +/*
> + * Specialized instrumentation for per-node execution statistics
> + */
> +typedef struct NodeInstrumentation
> {
> /* Parameters set at node creation: */
> bool need_timer; /* true if we need timer data */
> @@ -92,25 +125,34 @@ typedef struct Instrumentation
> double nfiltered2; /* # of tuples removed by "other" quals */
> BufferUsage bufusage; /* total buffer usage */
> WalUsage walusage; /* total WAL usage */
> -} Instrumentation;
> +} NodeInstrumentation;
The indentation here will look better if you add TriggerInstrumentation,
NodeInstrumentation to typedefs.list
> typedef struct WorkerInstrumentation
> {
> int num_workers; /* # of structures that follow */
> - Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> + NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> } WorkerInstrumentation;
Hm. Shouldn't this be WorkerNodeInstrumentation now?
> From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:34:42 -0700
> Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
> with INSTR_* macros
Looks sane to me.
I'd move that to the head of the queue in the next revision.
> From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
> stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code section,
> we utilized continuously incrementing global counters that get updated when the
> actual activity (e.g. shared block read) occurred, and then calculated a diff when
> the code section ended. This resulted in a bottleneck for executor node instrumentation
> specifically, with the function BufferUsageAccumDiff showing up in profiles and
> in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity writes
> into the current stack entry. In the case of executor nodes, this means that
> each node gets its own stack entry that is pushed at InstrStartNode, and popped
> at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism)
> and get added to their parent entry when they are finalized, i.e. no more
> modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks must
> involve either a top-level Instrumentation struct, and its associated InstrStart/
> InstrStop helpers (which use resource owners to handle aborts), or dedicated
> PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
> an abort.
I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.
> @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> PGSS_EXEC,
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> queryDesc->estate->es_total_processed,
> - &queryDesc->totaltime->bufusage,
> - &queryDesc->totaltime->walusage,
> +
> + /*
> + * Check if stack is initialized - it is not when ExecutorRun wasn't
> + * called
> + */
> + queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL,
> + queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL,
> queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
> NULL,
> queryDesc->estate->es_parallel_workers_to_launch,
Maybe it's just the diff, but this looks a bit odd, indentation wise.
> @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0(n * sizeof(ExprState *));
> if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> + {
> + /*
> + * Triggers do not individually track buffer/WAL usage, even if
> + * otherwise tracked
> + */
Why is that?
> @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStart called twice in a row");
>
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + Assert(CurrentResourceOwner != NULL);
> + instr->owner = CurrentResourceOwner;
> +
> + /*
> + * Allocate the stack resource in a memory context that survives
> + * during an abort. This will be freed by InstrStop (regular
> + * execution) or ResOwnerReleaseInstrumentation (abort).
> + *
> + * We don't do this in InstrAlloc to avoid leaking when InstrStart +
> + * InstrStop isn't called.
> + */
> + if (instr->stack == NULL)
> + instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
>
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + ResourceOwnerEnlarge(instr->owner);
> + ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
> +
> + InstrPushStack(instr->stack);
> + }
> }
I'm still worried that allocating something in CurTransactionContext will bite
us eventually. It seems like it'd be pretty sane to want to have
instrumentation around procedure execution, for example.
> void
> -InstrStop(Instrumentation *instr)
> +InstrStop(Instrumentation *instr, bool finalize)
> {
> instr_time endtime;
>
> @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + InstrPopStack(instr->stack, finalize);
>
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + Assert(instr->owner != NULL);
> + ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
> + instr->owner = NULL;
> +
> + if (finalize)
> + {
> + /*
> + * To avoid keeping memory allocated beyond when its needed, copy
> + * the result to the current memory context, and free it in the
> + * transaction context.
> + */
> + InstrStack *stack = palloc(sizeof(InstrStack));
> +
> + memcpy(stack, instr->stack, sizeof(InstrStack));
> + pfree(instr->stack);
> + instr->stack = stack;
> + }
> + }
> }
Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?
> /* Trigger instrumentation handling */
> @@ -98,15 +194,20 @@ TriggerInstrumentation *
> InstrAllocTrigger(int n, int instrument_options)
> {
> TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> int i;
>
> + /*
> + * To avoid having to determine when the last trigger fired, we never
> + * track WAL/buffer usage for now
> + */
> + Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
> + Assert((instrument_options & INSTRUMENT_WAL) == 0);
> +
> for (i = 0; i < n; i++)
> {
> - tginstr[i].instr.need_bufusage = need_buffers;
> - tginstr[i].instr.need_walusage = need_wal;
> + tginstr[i].instr.need_bufusage = false;
> + tginstr[i].instr.need_walusage = false;
> tginstr[i].instr.need_timer = need_timer;
> }
>
That does seem problematic to me.
> @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
> }
>
> /* start instrumentation during parallel executor startup */
> -void
> +Instrumentation *
> InstrStartParallelQuery(void)
> {
> - save_pgBufferUsage = pgBufferUsage;
> - save_pgWalUsage = pgWalUsage;
> + Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> +
> + InstrStart(instr);
> + return instr;
> }
> /* report usage after parallel executor shutdown */
> void
> -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
> {
> - memset(bufusage, 0, sizeof(BufferUsage));
> - BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
> - memset(walusage, 0, sizeof(WalUsage));
> - WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
> + InstrStop(instr, true);
> + memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
> + memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
> }
Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...
> if (verbose || params.log_vacuum_min_duration == 0 ||
> TimestampDifferenceExceeds(starttime, endtime,
> params.log_vacuum_min_duration))
> {
> long secs_dur;
> int usecs_dur;
> - WalUsage walusage;
> - BufferUsage bufferusage;
> StringInfoData buf;
> char *msgfmt;
> int32 diff;
> @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
> int64 total_blks_hit;
> int64 total_blks_read;
> int64 total_blks_dirtied;
> + BufferUsage bufusage = instr->stack->bufusage;
> + WalUsage walusage = instr->stack->walusage;
> TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
> - memset(&walusage, 0, sizeof(WalUsage));
> - WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
> - memset(&bufferusage, 0, sizeof(BufferUsage));
> - BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
> -
> - total_blks_hit = bufferusage.shared_blks_hit +
> - bufferusage.local_blks_hit;
> - total_blks_read = bufferusage.shared_blks_read +
> - bufferusage.local_blks_read;
> - total_blks_dirtied = bufferusage.shared_blks_dirtied +
> - bufferusage.local_blks_dirtied;
> +
> + total_blks_hit = bufusage.shared_blks_hit +
> + bufusage.local_blks_hit;
> + total_blks_read = bufusage.shared_blks_read +
> + bufusage.local_blks_read;
> + total_blks_dirtied = bufusage.shared_blks_dirtied +
> + bufusage.local_blks_dirtied;
Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?
Kinda going the other way: Why copy this on the stack, rather than use a
pointer?
> diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
> index 34b6410d6a2..77b4c59e71c 100644
> --- a/src/backend/commands/prepare.c
> +++ b/src/backend/commands/prepare.c
> @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
> ListCell *p;
> ParamListInfo paramLI = NULL;
> EState *estate = NULL;
> - instr_time planstart;
> - instr_time planduration;
> - BufferUsage bufusage_start,
> - bufusage;
> + Instrumentation *instr = NULL;
> MemoryContextCounters mem_counters;
> MemoryContext planner_ctx = NULL;
> MemoryContext saved_ctx = NULL;
>
> + if (es->buffers)
> + instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> + else
> + instr = InstrAlloc(1, INSTRUMENT_TIMER);
I'd not duplicate the call to InstrAlloc().
> From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Thu, 30 Oct 2025 22:27:30 -0700
> Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
> reporting in EXPLAIN
>
> This makes it clear whether activity was on the index directly, or
> on the table based on heap fetches.
Yay^2.
> ---
> src/backend/commands/explain.c | 56 ++++++++++++++++------------
> src/backend/executor/execProcnode.c | 13 +++++++
> src/backend/executor/instrument.c | 25 +++++++++++++
> src/backend/executor/nodeIndexscan.c | 15 +++++++-
> src/include/access/genam.h | 3 ++
> src/include/executor/instrument.h | 3 ++
> 6 files changed, 91 insertions(+), 24 deletions(-)
>
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index dd3bf615581..fb96dd5248c 100644
> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d00cf820a27..f19af428d97 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
>
> if (node->instrument && node->instrument->stack.previous)
> + {
Perhaps we should instead (probably in the earlier patch) just make the
function return early if we don't have instrumentation / a previous node? We
might gain a few more special cases, and it's nice if they're less indented.
> + /*
> + * Index Scan nodes account for heap buffer usage separately, so we
> + * need to explitly add here
> + */
s/heap/table/
Greetings,
Andres Freund
pgsql-hackers by date: