At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund <andres@anarazel.de> wrote in
> I refreshed my memory: The startup process has indeed behaved that way for
> much longer than pg_stat_io existed - but it's harder to spot, because the
> stats are more coarsely aggregated :/. And it's very oddly inconsistent:
>
> The startup process doesn't report per-relation read/hit (it might when we
> create a fake relcache entry, to lazy to see what happens exactly), because we
The key difference lies between relation-level and smgr-level;
recovery doesn't call ReadBufferExtended.
> key those stats by oid. However, it *does* report the read/write time. But
> only at process exit, of course. The weird part is that the startup process
> does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
> basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
> the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't
> recent, afaict.
I see four issues here.
1. The current database stats omits buffer fetches that don't
originate from a relation.
In this case pgstat_relation can't work since recovery isn't conscious
of relids. We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.
2. Even if we wanted to report stats for the startup process,
pgstat_report_stats wouldn't permit it since transaction-end
timestamp doesn't advance.
I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.
3. When should we call pgstat_report_stats on the startup process?
During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.
4. In the existing ReadBuffer_common, there's an inconsistency in
counting hits and reads between pgstat_io and pgBufferUsage.
The difference comes from the case of RBM_ZERO pages. We should simply
align them.
> TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
> but doesn't maintain blks_read, blks_hit - which doesn't make sense.
As a result, the attached patch, which is meant for discussion, allows
pg_stat_database to show fetches and reads by the startup process as
the counts for the database with id 0.
There's still some difference between pg_stat_io and pg_stat_database,
but I haven't examined it in detail.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center