Re: Get rid of pgstat_count_backend_io_op*() functions - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Get rid of pgstat_count_backend_io_op*() functions
Date
Msg-id aNVWe2tR1jj5Tsct@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Get rid of pgstat_count_backend_io_op*() functions  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi,

On Thu, Sep 25, 2025 at 03:42:33PM +0900, Michael Paquier wrote:
> On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote:
> > On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:
> >> As far the ordering concern for v1, what about:
> >> 
> >> - let backend kind enum defined as 6
> >> - move the backend flush outside of the loop in pgstat_report_stat()
> >> 
> >> That way the backend are flushed last and that solves your concern about custom
> >> stats kind.
> >> 
> >> The backend would not be the "only" exception in pgstat_report_stat(), the db
> >> stats are already handled as exception with pgstat_update_dbstats().
> > 
> > That would give something like v3 attached, thoughts?
> > 
> > Once we agree on the approach to deal with per backend pending stats, I'll make
> > use of the same in [1] and [2].
> 
>          for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
>          {
> [...]
> +            if (kind == PGSTAT_KIND_BACKEND)
> +                continue;
>              if (!kind_info)
>                  continue;
>              if (!kind_info->flush_static_cb)
>  
>              partial_flush |= kind_info->flush_static_cb(nowait);
>          }
> +
> +        /*
> +         * Do per-backend last as some of their pending stats are populated
> +         * during the flush of other stats kinds.
> +         */
> +        kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
> +        partial_flush |= kind_info->flush_static_cb(nowait);
>      }
> 
> Hmm.  I really have an issue with the ordering dependency this
> introduces between stats kinds, assumed to happen inside the code.  Up
> to now, we have always considered stats kinds as rather independent
> pieces at flush time, simplifying its code (I am not saying counter
> increment time).

We already have such dependencies for some databases/relations stats but that
works well because all of them are "variable" stats and their flush is based
on the pgStatPending list.

> I get that you are doing what you do here because
> the IO pending statistics (PendingIOStats) are reset each time the
> stat IO flush callback finishes its job, and we'd lose the amount of
> stats saved between two reports.

We don't lose the stats, but it takes one more reporting cyle to
get them flushed. The sequence (without the ordering being forced) is:

1. static flush of IO backend (happens before IO because backend's enum < IO's enum)
2. static flush of IO -> pending IOs are copied to backend's pending IOs

later on (i.e next report cycle):

3. static flush of IO backend
4. static flush of IO -> pending IOs are copied to backend's pending IO

During 3. we now see the backend IO stats coming from 2.

So the stats would still be reported correctly, just with a one-cycle backend
IO reporting stats delay. 

> Putting aside the style issue (aka I like the pgstat_count_backend_*
> interface, rather than copying the pending data), I think that we lack
> data to be able to clearly identify how and actually how much we
> should try to optimize and change this code:
> - Should we try to use more inlining?
> - Should we try to use more static structures, without any functions
> at all.
> - Should we try to prefer copies? 

The fundamental issue is that as we add more stats to PGSTAT_KIND_BACKEND,
we increasingly end up doing duplicate work: the same operation gets counted in
both the specialized stat kind (IO, relation, etc.) and the backend stats.

I think that backend stats should serve as a consolidated view of what a particular
backend is doing, not as an independent counting mechanism.

It makes more sense for backend stats to copy from the specialized pending stats
rather than having every operation increment multiple independent counters.

The specific optimization techniques you mention (inlining, static structures, etc.)
don't address the fundamental architectural issue of duplicate counting.

> In which case, could it make sense
> to introduce a concept of dependency between stats kinds?  This way,
> we could order how the flush actions are taken rather than having a
> for loop based on the sole stats kind IDs.

We could, but I don't think we need that complexity yet.
Backend stats are inherently special they're a "consolidated" view of all activity
for a specific backend, which naturally makes them dependent on other stat kinds.

This makes backend stats naturally suited to be "last" in the flush order.

I'm not sure we'll need such ordering dependencies for other kinds. And if we need
more later on, we could always work on a more sophisticated approach.

The v3 implementation is explicit about this ordering (with comments and clear
code structure).

> - Should we try to reduce diff operations at flush time?  This is
> something done by the WAL stats and worry about their costs, with
> diffs calculated each time between the current and previous states?
> With a non-forced report delay of 10s, this does not worry me, short
> transactions would, as we force reports each time a transaction
> finishes.

Right, that might be a concern for short transactions.

The copy approach makes more sense to me: it is just a simple memory copy operation.
That produces less operations than iterating through all the counter arrays doing
arithmetic. I think it's hard to beat (even if we "could" reduce diff operations
at flush time).

> For the first and second point, more optimization could be done for
> more stats kinds.  For the last point, we'd want to reconsider how
> pgstat_wal.c does its job.  As the proposal stands, at least it seems
> to me, this sacrifices code readability, though I get that the opinion
> of each may differ on the matter.  And perhaps there are practices
> that we may be able to use in core for all the stats kinds, as well.

I disagree on the readability point. The copy approach is actually quite readable
and is very clear about what's happening.

The diff approach requires understanding:

- Why we maintain separate previous state variables
- How the diff calculations work across multi-dimensional arrays
- The complex time arithmetic with INSTR_TIME_ACCUM_DIFF

Regarding performance, I think it's hard to beat a simple memory copy.
Even with potential optimizations to other approaches, you're still doing more
work (arithmetic operations across arrays) versus a straightforward copy operation.

As for incrementing counters twice "just" for code readability, that seems backwards
to me.

> That would mean doing benchmarks with specific workloads on big
> hardware, likely.

I don't think we need to benchmark each approach.

I suspect benchmarks would show that avoiding duplicate work in the hot path (I/O operations)
provides benefits.

> So, what I am seeing currently here is an incomplete picture.

I think the picture is actually quite complete for the specific problem we're solving.

Backend stats are unique in that they're designed to "aggregate" activity from
other stat kinds and that's what creates the duplicate counting issue.

The "incomplete picture" would be if we were trying to optimize all stat kinds
uniformly, but that's not the problem here.

The problem is specifically that backend stats create duplicate work by counting
the same operations that are already being counted in their respective dedicated
stats kind.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Marcos Pegoraro
Date:
Subject: Re: Missing parentheses
Next
From: Ashutosh Bapat
Date:
Subject: Re: Report reorder buffer size