Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From Lukas Fittl
Subject Re: Stack-based tracking of per-node WAL/buffer usage
Date
Msg-id CAP53PkzpJ39HFRGaMR7n5kSsNuqui3TJ3Gw_FU7k_qjdBvT6hg@mail.gmail.com
Whole thread
In response to Re: Stack-based tracking of per-node WAL/buffer usage  (Andres Freund <andres@anarazel.de>)
Responses Re: Stack-based tracking of per-node WAL/buffer usage
List pgsql-hackers
On Tue, Apr 7, 2026 at 3:19 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote:
> > 0001 is the change to make queryDesc->totaltime be allocated by
> > ExecutorStart instead of plugins themselves, and adds a
> > queryDesc->totaltime_options to have plugins request which level of
> > summary instrumentation they need. This change is pretty simple, and
> > could still make sense to get into 19. Because of the earlier
> > Instrumentation refactoring that was pushed (thanks!) we're already
> > asking extensions allocating queryDesc->totaltime to modify their use
> > of InstrAlloc, so I think we might as well clean this up now.
>
> Hm.  That's a fair argument. They indeed would have to again change next
> release
>
>  It's not a complicated change and removes more lines than it adds.
>
> I guess one thing I'm not sure is whether the fields shouldn't be renamed at
> the same time:
>
> a) To prevent extensions from continuing to set it, most of them do not test
>    against assert enabled builds. With a different name they would get a
>    compiler error.
>
> b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
>    query level statistics.  If everyone has to change anyway, this is a good
>    occasion.
>
> 'query_instr[_options]'?
>
>
> Any opinions?

I think renaming makes sense - both to make sure extensions reconsider
how they use it, and because "totaltime" is a bad name anyway, because
its not just about timing (and hasn't been for many releases).

"query_instr[_options]" seems reasonable to me, although we could drop
the "query_" since it'd be "queryDesc->query_instr" vs
"queryDesc->instr".

>
> > 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> > suggested previously. We still get some quick performance wins from
> > doing that (see end of email), and again, its a simple change, so
> > could be considered if someone has bandwidth remaining. I've added a
> > later patch that then does the more complex inlining and gets us the
> > full speed up.
>
> Here it needs a few more inlines to get the full performance, otherwise it
> doesn't inline all the helpers.  I think on balance I didn't like the
> prototype in instrument.h, that's too widely included, and it might even cause
> some circularity issues.  It seems better to do the somewhat ugly thing and
> have the prototype be in executor.h.

Yeah, that makes sense.

>
> > 0002 measurements (with current master and TSC clock source used for
> > timing, best of three):
> >
> > CREATE TABLE lotsarows(key int not null);
> > INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> > VACUUM FREEZE lotsarows;
>
> With the somewhat more extreme benchmark I used in the rdtsc thread and the
> added inline mentioned above I see a bit bigger wins. See the attached
> explainbench.sql - it doesn't quite cover all the combinations, but I think it
> gives a good enough overview.
>
> c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
>
> master:
> statement latencies in milliseconds and failures:
>        200.800           0 SELECT pg_prewarm('pgbench_accounts');
>          0.098           0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
>        212.010           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
>        268.648           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
>        232.421           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
>        283.531           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
>          0.030           0 DEALLOCATE query;
>
>
> 0002:
>
> statement latencies in milliseconds and failures:
>        201.558           0 SELECT pg_prewarm('pgbench_accounts');
>          0.103           0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
>        188.696           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
>        244.479           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
>        223.773           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
>        266.947           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
>          0.034           0 DEALLOCATE query;
>
> That's something like 4-12%.
>
> Pretty nice for a patch that just adds a few lines around and adds a few
> inlines.

Agreed.


> > @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >
> >       if (auto_explain_enabled())
> >       {
> > +             /* We're always interested in runtime */
> > +             queryDesc->totaltime_options |= INSTRUMENT_TIMER;
>
> > -                     queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
>
> Not that it's going to make a significant difference, but it is nice that this
> now would need to track less.

Yup.

>
> Kinda wonder about having
>   EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
>
> in plenty cases that'd be all one needs, at substantially lower cost.

True. I don't like the name "totals_only", but I like the concept.
Today someone has to go to pg_stat_statements to get just the total
numbers, without running them for all nodes with EXPLAIN ANALYZE (and
incurring its overhead).

Thanks,
Lukas

--
Lukas Fittl



pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: updates for handling optional argument in system functions
Next
From: Heikki Linnakangas
Date:
Subject: Re: Optimization of the is_normalized() function.