Re: shared-memory based stats collector - v70 - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: shared-memory based stats collector - v70
Date
Msg-id CAKFQuwbcj8WszSvnR+oqAv9CGtqwVS-OWg6fSAiZH-LxmtWreA@mail.gmail.com
Whole thread Raw
In response to Re: shared-memory based stats collector - v70  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: shared-memory based stats collector - v70  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund <andres@anarazel.de> wrote in
> Hi,
>
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
>
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].
>
>
> Working on the docs. Found a few things worth raising:
>
> 1)
> Existing text:
>    When the server shuts down cleanly, a permanent copy of the statistics
>    data is stored in the <filename>pg_stat</filename> subdirectory, so that
>    statistics can be retained across server restarts.  When recovery is
>    performed at server start (e.g., after immediate shutdown, server crash,
>    and point-in-time recovery), all statistics counters are reset.
>
> The existing docs patch hadn't updated yet. My current edit is
>
>    When the server shuts down cleanly, a permanent copy of the statistics
>    data is stored in the <filename>pg_stat</filename> subdirectory, so that
>    statistics can be retained across server restarts.  When crash recovery is
>    performed at server start (e.g., after immediate shutdown, server crash,
>    and point-in-time recovery, but not when starting a standby that was shut
>    down normally), all statistics counters are reset.
>
> but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts.  When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.


Maybe:
When the server shuts down cleanly a permanent copy of the statistics data is stored in the <filename>pg_stat</filename> subdirectory so that statistics can be retained across server restarts.  However, if crash recovery is performed (i.e., after immediate shutdown, server crash, or point-in-time recovery), all statistics counters are reset.  For any standby server, the initial startup to get the cluster initialized is a point-in-time crash recovery startup. For all subsequent startups it behaves like any other server.  For a hot standby server, statistics are retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not examples but rather the complete set.

Is crash recovery ever performed other than at server start?  If so I choose to remove the redundancy.

I feel like some of this detail about standby servers is/should be covered elsewhere and we are at least missing a cross-reference chance even if we leave the material coverage as-is.

> 2)
> The edit is not a problem, but it's hard to understand what the existing
> paragraph actually means?
>
> diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
> index 3247e056663..8bfb584b752 100644
> --- a/doc/src/sgml/high-availability.sgml
> +++ b/doc/src/sgml/high-availability.sgml
> @@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
> ...
>     <para>
> -    The statistics collector is active during recovery. All scans, reads, blocks,
> +    The cumulative statistics system is active during recovery. All scans, reads, blocks,
>      index usage, etc., will be recorded normally on the standby. Replayed
>      actions will not duplicate their effects on primary, so replaying an
>      insert will not increment the Inserts column of pg_stat_user_tables.
>      The stats file is deleted at the start of recovery, so stats from primary
>      and standby will differ; this is considered a feature, not a bug.
>     </para>
>
>     <para>

Agreed partially. It's too detailed.  It might not need to mention WAL
replay.


The insert example seems like a poor one...IIUC cumulative statistics are not WAL logged and while in recovery INSERT is prohibited, so how would replaying the insert in the WAL result in a duplicated effect on pg_stat_user_tables.inserts?

I also have no idea what, in the fragment, "Replayed actions will not duplicate their effects on primary...", what "on primary" is supposed to mean.

I would like to write the following but I don't believe it is sufficiently true:

"The cumulative statistics system records only the locally generated activity of the cluster, including while in recovery.  Activity happening only due to the replay of WAL is not considered local."

But to apply the WAL we have to fetch blocks from the local filesystem and write them back out.  That is local activity happening due to the replay of WAL which sounds like it is and should be reported ("All...blocks...", and the example given being logical DDL oriented).

I cannot think of a better paragraph at the moment, the minimal change is good, and the detail it contains presently seems like the right amount, if indeed my interpretation of it is correct (i.e., the standby records physical stats, not logical ones).  It still has wording issues around "on primary" and maybe a better example choice than a disallowed-in-recovery-anyway insert.

David J.

pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: pg14 psql broke \d datname.nspname.relname
Next
From: Andres Freund
Date:
Subject: Re: shared-memory based stats collector - v70