Thread: Naming of the different stats systems / "stats collector"

Naming of the different stats systems / "stats collector"

From
Andres Freund
Date:
Hi,

One thing I'm not yet happy around the shared memory stats patch is
naming. Currently a lot of comments say things like:

  * [...] We convert to
  * microseconds in PgStat_Counter format when transmitting to the collector.

or

# - Query and Index Statistics Collector -

or

/* ----------
 * pgstat_report_subscription_drop() -
 *
 *  Tell the collector about dropping the subscription.
 * ----------
 */


the immediate question for the patch is what to replace "collector" with.

"stats subsystem" is too general, because that could be about
backend_activity.c, or pg_statistic, or ...

"shared memory stats" seems too focussed on the manner of storage, rather than
the kind of stats.

The patch currently uses "activity statistics" in a number of places, but that
is confusing too, because pg_stat_activity is a different kind of stats.

Any ideas?


The postgresql.conf.sample section header seems particularly odd - "index
statistics"? We collect more data about tables etc.


A more general point: Our naming around different types of stats is horribly
confused. We have stats describing the current state (e.g. pg_stat_activity,
pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats
(pg_stat_user_tables, pg_stat_database, etc) in the same namespace.  Should we
try to move towards something more coherent, at least going forward?


Greetings,

Andres Freund



Re: Naming of the different stats systems / "stats collector"

From
"David G. Johnston"
Date:
On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:

One thing I'm not yet happy around the shared memory stats patch is
naming. Currently a lot of comments say things like:

  * [...] We convert to
  * microseconds in PgStat_Counter format when transmitting to the collector.

or

# - Query and Index Statistics Collector -

or

/* ----------
 * pgstat_report_subscription_drop() -
 *
 *  Tell the collector about dropping the subscription.
 * ----------
 */


the immediate question for the patch is what to replace "collector" with.


Not really following the broader context here so this came out of nowhere for me.  What is the argument for changing the status quo here?  Collector seems like good term.
 

The patch currently uses "activity statistics" in a number of places, but that
is confusing too, because pg_stat_activity is a different kind of stats.

Any ideas?

If the complaint is that not all of these statistics modules use the statistics collector then maybe we say each non-collector module defines an "Event Listener".  Or, and without looking at the source code, have the collector simply forward events like "reset now" to the appropriate module but keep the collector as the single point of message interchange for all.  And so "tell the collector about" is indeed the correct phrasing of what happens.
 

The postgresql.conf.sample section header seems particularly odd - "index
statistics"? We collect more data about tables etc.

No argument for bringing the header current.

A more general point: Our naming around different types of stats is horribly
confused. We have stats describing the current state (e.g. pg_stat_activity,
pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats
(pg_stat_user_tables, pg_stat_database, etc) in the same namespace.  Should we
try to move towards something more coherent, at least going forward?


I'm not sure trying to improve this going forward, and thus having at least three categories, is particularly desirable.  While it is unfortunate that we don't have separate pg_metric and pg_status namespaces (combining pg_stat with pg_status or pg_state, the two obvious choices, would be undesirable being they all have a shared leading character sequence) that is where we are today.  We are probably stuck with just using the pg_stat namespace and doing a better job of letting users know about the underlying implementation choice each pg_stat relation took in order to know whether what is being reported is considered reliable (self-managed shared memory) or not (leverages the unreliable collector).  In short, deal with this mainly in documentation/comments and implementation details but leave the public facing naming alone.

David J.

Re: Naming of the different stats systems / "stats collector"

From
Kyotaro Horiguchi
Date:
At Tue, 8 Mar 2022 15:55:04 -0700, "David G. Johnston" <david.g.johnston@gmail.com> wrote in 
> On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
> > the immediate question for the patch is what to replace "collector" with.
> >
> >
> Not really following the broader context here so this came out of nowhere
> for me.  What is the argument for changing the status quo here?  Collector
> seems like good term.

The name "stats collector" is tied with the story that "there is a
process that only collects stats data that arrive from working
proceses.".  We have such modules like bgwriter, checkpointer,
walwriter and so on.  On the other hand we have many features with no
dedicate process instead work on shared storage area as a part of
working prcesses. table/column statistics, XLOG, heap, SLUR and so on.

In the world where every working process writes statitics to shared
meomry area by its own, no such process exists.  I think we no longer
name it "stats collector".

> > The patch currently uses "activity statistics" in a number of places, but
> > that
> > is confusing too, because pg_stat_activity is a different kind of stats.
> >
> > Any ideas?
> >
> 
> If the complaint is that not all of these statistics modules use the
> statistics collector then maybe we say each non-collector module defines an
> "Event Listener".  Or, and without looking at the source code, have the
> collector simply forward events like "reset now" to the appropriate module
> but keep the collector as the single point of message interchange for all.
> And so "tell the collector about" is indeed the correct phrasing of what
> happens.

So the collector as a process is going to die. We need alternative
name for the non-collector.  Metrics, as you mentioned below, sounds
good to me.  The name "activity stat(istics)?s" is an answer to my
desire to discriminate it from "table/column statistics" but I have to
admit that it is still not great.

> > The postgresql.conf.sample section header seems particularly odd - "index
> > statistics"? We collect more data about tables etc.
> >
> 
> No argument for bringing the header current.
> 
> >
> > A more general point: Our naming around different types of stats is
> > horribly
> > confused. We have stats describing the current state (e.g.
> > pg_stat_activity,
> > pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats
> > (pg_stat_user_tables, pg_stat_database, etc) in the same namespace.
> > Should we
> > try to move towards something more coherent, at least going forward?
> >
> >
> I'm not sure trying to improve this going forward, and thus having at least
> three categories, is particularly desirable.  While it is unfortunate that
> we don't have separate pg_metric and pg_status namespaces (combining
> pg_stat with pg_status or pg_state, the two obvious choices, would be
> undesirable being they all have a shared leading character sequence) that
> is where we are today.  We are probably stuck with just using the pg_stat
> namespace and doing a better job of letting users know about the underlying
> implementation choice each pg_stat relation took in order to know whether
> what is being reported is considered reliable (self-managed shared memory)
> or not (leverages the unreliable collector).  In short, deal with this
> mainly in documentation/comments and implementation details but leave the
> public facing naming alone.
> 
> David J.

If we could, I like the namings like pg_metrics.process,
pg_metrics.replication, pg_progress.vacuum, pg_progress.basebackup,
and pg_stats.database, pg_stats.user_tables..  With such eyes, it
looks somewhat odd that pg_stat_* views are belonging to the
pg_catalog namespace.

If we had system table-aliases, people who insist on the good-old
names can live with that.  Even if there isn't, we can instead provide
views with the old names.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Naming of the different stats systems / "stats collector"

From
Andres Freund
Date:
On 2022-03-08 15:55:04 -0700, David G. Johnston wrote:
> On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
> > One thing I'm not yet happy around the shared memory stats patch is
> > naming. Currently a lot of comments say things like:
> >
> >   * [...] We convert to
> >   * microseconds in PgStat_Counter format when transmitting to the
> > collector.
> >
> > or
> >
> > # - Query and Index Statistics Collector -
> >
> > or
> >
> > /* ----------
> >  * pgstat_report_subscription_drop() -
> >  *
> >  *  Tell the collector about dropping the subscription.
> >  * ----------
> >  */
> >
> >
> > the immediate question for the patch is what to replace "collector" with.
> >
> >
> Not really following the broader context here so this came out of nowhere
> for me.  What is the argument for changing the status quo here?  Collector
> seems like good term.

Sorry, probably should have shared a bit more context. The shared memory stats
patch removes the stats collector process - which seems to make 'collector'
not descriptive anymore...

It's still lossy in the sense that a crash will result in stats being lost and
inprecise in that counter updates can be delayed, but there won't be lost
stats due to UDP messages being thrown away under load anymore.

Greetings,

Andres Freund



Re: Naming of the different stats systems / "stats collector"

From
"David G. Johnston"
Date:
On Tue, Mar 8, 2022 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-03-08 15:55:04 -0700, David G. Johnston wrote:
> On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
> > One thing I'm not yet happy around the shared memory stats patch is
> > naming. Currently a lot of comments say things like:
> >
> >   * [...] We convert to
> >   * microseconds in PgStat_Counter format when transmitting to the
> > collector.
> >

"...format for writing to the statistics datastore"
 
> > or
> >
> > # - Query and Index Statistics Collector -

"...Statistics Collection"

 
> >
> > or
> >
> > /* ----------
> >  * pgstat_report_subscription_drop() -
> >  *
> >  *  Tell the collector about dropping the subscription.
> >  * ----------
> >  */

I would expect that either the function gets renamed or just goes away.  Just changing the word "collector" isn't going to be a good change, the new description should describe whatever the new behavior is.


> >
> > the immediate question for the patch is what to replace "collector" with.
> >
> >
> Not really following the broader context here so this came out of nowhere
> for me.  What is the argument for changing the status quo here?  Collector
> seems like good term.

Sorry, probably should have shared a bit more context. The shared memory stats
patch removes the stats collector process - which seems to make 'collector'
not descriptive anymore...


As shown above I don't see that there is a single word that will simply replace "collector".  We are changing a core design of the system and each dependent system will need to be tweaked in a context-appropriate manner.

As the process goes away we are now dealing directly with a conceptual datastore.  And instead of referring to the implementation detail of how statistics are collected we can just refer to the "collection" behavior generically.  Whether we funnel through a process or write directly to the datastore it is still statistics collection.

David J.

Re: Naming of the different stats systems / "stats collector"

From
Andres Freund
Date:
Hi,

On 2022-03-08 19:13:45 -0700, David G. Johnston wrote:
> On Tue, Mar 8, 2022 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
>
> > On 2022-03-08 15:55:04 -0700, David G. Johnston wrote:
> > > On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
> > > > One thing I'm not yet happy around the shared memory stats patch is
> > > > naming. Currently a lot of comments say things like:
> > > >
> > > >   * [...] We convert to
> > > >   * microseconds in PgStat_Counter format when transmitting to the
> > > > collector.
> > > >
> >
>
> "...format for writing to the statistics datastore"

That could also describe pg_statistic. Nor are we writing during normal
operation (stats are first accumulated locally and subsequently a shared
memory hash table is updated), the on-disk file is only written at shutdown.

Which is my problem with this - we need a descriptive term / shorthand that
describes the type of statistics we currently send to the stats collector.

"cumulative stats subsystem"?


> > > > /* ----------
> > > >  * pgstat_report_subscription_drop() -
> > > >  *
> > > >  *  Tell the collector about dropping the subscription.
> > > >  * ----------
> > > >  */
> >
>
> I would expect that either the function gets renamed or just goes away.
> Just changing the word "collector" isn't going to be a good change, the new
> description should describe whatever the new behavior is.

It currently has the same signature in the patch, and I don't forsee that
changing.


> As the process goes away we are now dealing directly with a conceptual
> datastore.  And instead of referring to the implementation detail of how
> statistics are collected we can just refer to the "collection" behavior
> generically.  Whether we funnel through a process or write directly to the
> datastore it is still statistics collection.

We have many other types of stats that we collect, so yes, it's statistic
collection, but that's not descriptive enough imo. "Stats collector" somewhat
worked because the fact that the collector process was involved served to
distinguish from other types of stats.

Greetings,

Andres Freund



Re: Naming of the different stats systems / "stats collector"

From
"David G. Johnston"
Date:
On Tue, Mar 8, 2022 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
we need a descriptive term / shorthand that
describes the type of statistics we currently send to the stats collector.

"cumulative stats subsystem"?


I'm growing fond of "cumulative".  It is more precise (and restrictive) than "metric" but that is beneficial here so long as it is indeed true (which a quick skim of Table 28.2. Collected Statistics Views [1] leads me to believe it is).

I'd be concerned that subsystem implies a collection process in a manner similar to how you associated datastore with a physical file on disk.  But I'd pick subsystem over datastore here in any case.

David J.


Re: Naming of the different stats systems / "stats collector"

From
Andres Freund
Date:
Hi,

On 2022-03-08 20:17:06 -0700, David G. Johnston wrote:
> On Tue, Mar 8, 2022 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > we need a descriptive term / shorthand that
> > describes the type of statistics we currently send to the stats collector.
> >
> > "cumulative stats subsystem"?
> >
> >
> I'm growing fond of "cumulative".  It is more precise (and restrictive)
> than "metric" but that is beneficial here so long as it is indeed true
> (which a quick skim of Table 28.2. Collected Statistics Views [1] leads me
> to believe it is).

I did go for that one - I think it looks better than the other
alternatives. Should you be interested, I posted a version using that name at
[1]. The majority of the changes related to the naming are in 0005, 0026,
0028...

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20220404041516.cctrvpadhuriawlq%40alap3.anarazel.de