Thread: Naming of the different stats systems / "stats collector"
Hi, One thing I'm not yet happy around the shared memory stats patch is naming. Currently a lot of comments say things like: * [...] We convert to * microseconds in PgStat_Counter format when transmitting to the collector. or # - Query and Index Statistics Collector - or /* ---------- * pgstat_report_subscription_drop() - * * Tell the collector about dropping the subscription. * ---------- */ the immediate question for the patch is what to replace "collector" with. "stats subsystem" is too general, because that could be about backend_activity.c, or pg_statistic, or ... "shared memory stats" seems too focussed on the manner of storage, rather than the kind of stats. The patch currently uses "activity statistics" in a number of places, but that is confusing too, because pg_stat_activity is a different kind of stats. Any ideas? The postgresql.conf.sample section header seems particularly odd - "index statistics"? We collect more data about tables etc. A more general point: Our naming around different types of stats is horribly confused. We have stats describing the current state (e.g. pg_stat_activity, pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats (pg_stat_user_tables, pg_stat_database, etc) in the same namespace. Should we try to move towards something more coherent, at least going forward? Greetings, Andres Freund
On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
One thing I'm not yet happy around the shared memory stats patch is
naming. Currently a lot of comments say things like:
* [...] We convert to
* microseconds in PgStat_Counter format when transmitting to the collector.
or
# - Query and Index Statistics Collector -
or
/* ----------
* pgstat_report_subscription_drop() -
*
* Tell the collector about dropping the subscription.
* ----------
*/
the immediate question for the patch is what to replace "collector" with.
Not really following the broader context here so this came out of nowhere for me. What is the argument for changing the status quo here? Collector seems like good term.
The patch currently uses "activity statistics" in a number of places, but that
is confusing too, because pg_stat_activity is a different kind of stats.
Any ideas?
If the complaint is that not all of these statistics modules use the statistics collector then maybe we say each non-collector module defines an "Event Listener". Or, and without looking at the source code, have the collector simply forward events like "reset now" to the appropriate module but keep the collector as the single point of message interchange for all. And so "tell the collector about" is indeed the correct phrasing of what happens.
The postgresql.conf.sample section header seems particularly odd - "index
statistics"? We collect more data about tables etc.
No argument for bringing the header current.
A more general point: Our naming around different types of stats is horribly
confused. We have stats describing the current state (e.g. pg_stat_activity,
pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats
(pg_stat_user_tables, pg_stat_database, etc) in the same namespace. Should we
try to move towards something more coherent, at least going forward?
I'm not sure trying to improve this going forward, and thus having at least three categories, is particularly desirable. While it is unfortunate that we don't have separate pg_metric and pg_status namespaces (combining pg_stat with pg_status or pg_state, the two obvious choices, would be undesirable being they all have a shared leading character sequence) that is where we are today. We are probably stuck with just using the pg_stat namespace and doing a better job of letting users know about the underlying implementation choice each pg_stat relation took in order to know whether what is being reported is considered reliable (self-managed shared memory) or not (leverages the unreliable collector). In short, deal with this mainly in documentation/comments and implementation details but leave the public facing naming alone.
David J.
At Tue, 8 Mar 2022 15:55:04 -0700, "David G. Johnston" <david.g.johnston@gmail.com> wrote in > On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote: > > the immediate question for the patch is what to replace "collector" with. > > > > > Not really following the broader context here so this came out of nowhere > for me. What is the argument for changing the status quo here? Collector > seems like good term. The name "stats collector" is tied with the story that "there is a process that only collects stats data that arrive from working proceses.". We have such modules like bgwriter, checkpointer, walwriter and so on. On the other hand we have many features with no dedicate process instead work on shared storage area as a part of working prcesses. table/column statistics, XLOG, heap, SLUR and so on. In the world where every working process writes statitics to shared meomry area by its own, no such process exists. I think we no longer name it "stats collector". > > The patch currently uses "activity statistics" in a number of places, but > > that > > is confusing too, because pg_stat_activity is a different kind of stats. > > > > Any ideas? > > > > If the complaint is that not all of these statistics modules use the > statistics collector then maybe we say each non-collector module defines an > "Event Listener". Or, and without looking at the source code, have the > collector simply forward events like "reset now" to the appropriate module > but keep the collector as the single point of message interchange for all. > And so "tell the collector about" is indeed the correct phrasing of what > happens. So the collector as a process is going to die. We need alternative name for the non-collector. Metrics, as you mentioned below, sounds good to me. The name "activity stat(istics)?s" is an answer to my desire to discriminate it from "table/column statistics" but I have to admit that it is still not great. > > The postgresql.conf.sample section header seems particularly odd - "index > > statistics"? We collect more data about tables etc. > > > > No argument for bringing the header current. > > > > > A more general point: Our naming around different types of stats is > > horribly > > confused. We have stats describing the current state (e.g. > > pg_stat_activity, > > pg_stat_replication, pg_stat_progress_*, ...) and accumulated stats > > (pg_stat_user_tables, pg_stat_database, etc) in the same namespace. > > Should we > > try to move towards something more coherent, at least going forward? > > > > > I'm not sure trying to improve this going forward, and thus having at least > three categories, is particularly desirable. While it is unfortunate that > we don't have separate pg_metric and pg_status namespaces (combining > pg_stat with pg_status or pg_state, the two obvious choices, would be > undesirable being they all have a shared leading character sequence) that > is where we are today. We are probably stuck with just using the pg_stat > namespace and doing a better job of letting users know about the underlying > implementation choice each pg_stat relation took in order to know whether > what is being reported is considered reliable (self-managed shared memory) > or not (leverages the unreliable collector). In short, deal with this > mainly in documentation/comments and implementation details but leave the > public facing naming alone. > > David J. If we could, I like the namings like pg_metrics.process, pg_metrics.replication, pg_progress.vacuum, pg_progress.basebackup, and pg_stats.database, pg_stats.user_tables.. With such eyes, it looks somewhat odd that pg_stat_* views are belonging to the pg_catalog namespace. If we had system table-aliases, people who insist on the good-old names can live with that. Even if there isn't, we can instead provide views with the old names. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2022-03-08 15:55:04 -0700, David G. Johnston wrote: > On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote: > > One thing I'm not yet happy around the shared memory stats patch is > > naming. Currently a lot of comments say things like: > > > > * [...] We convert to > > * microseconds in PgStat_Counter format when transmitting to the > > collector. > > > > or > > > > # - Query and Index Statistics Collector - > > > > or > > > > /* ---------- > > * pgstat_report_subscription_drop() - > > * > > * Tell the collector about dropping the subscription. > > * ---------- > > */ > > > > > > the immediate question for the patch is what to replace "collector" with. > > > > > Not really following the broader context here so this came out of nowhere > for me. What is the argument for changing the status quo here? Collector > seems like good term. Sorry, probably should have shared a bit more context. The shared memory stats patch removes the stats collector process - which seems to make 'collector' not descriptive anymore... It's still lossy in the sense that a crash will result in stats being lost and inprecise in that counter updates can be delayed, but there won't be lost stats due to UDP messages being thrown away under load anymore. Greetings, Andres Freund
On Tue, Mar 8, 2022 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-03-08 15:55:04 -0700, David G. Johnston wrote:
> On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote:
> > One thing I'm not yet happy around the shared memory stats patch is
> > naming. Currently a lot of comments say things like:
> >
> > * [...] We convert to
> > * microseconds in PgStat_Counter format when transmitting to the
> > collector.
> >
"...format for writing to the statistics datastore"
> > or
> >
> > # - Query and Index Statistics Collector -
"...Statistics Collection"
> >
> > or
> >
> > /* ----------
> > * pgstat_report_subscription_drop() -
> > *
> > * Tell the collector about dropping the subscription.
> > * ----------
> > */
I would expect that either the function gets renamed or just goes away. Just changing the word "collector" isn't going to be a good change, the new description should describe whatever the new behavior is.
> >
> > the immediate question for the patch is what to replace "collector" with.
> >
> >
> Not really following the broader context here so this came out of nowhere
> for me. What is the argument for changing the status quo here? Collector
> seems like good term.
Sorry, probably should have shared a bit more context. The shared memory stats
patch removes the stats collector process - which seems to make 'collector'
not descriptive anymore...
As shown above I don't see that there is a single word that will simply replace "collector". We are changing a core design of the system and each dependent system will need to be tweaked in a context-appropriate manner.
As the process goes away we are now dealing directly with a conceptual datastore. And instead of referring to the implementation detail of how statistics are collected we can just refer to the "collection" behavior generically. Whether we funnel through a process or write directly to the datastore it is still statistics collection.
David J.
Hi, On 2022-03-08 19:13:45 -0700, David G. Johnston wrote: > On Tue, Mar 8, 2022 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2022-03-08 15:55:04 -0700, David G. Johnston wrote: > > > On Tue, Mar 8, 2022 at 1:54 PM Andres Freund <andres@anarazel.de> wrote: > > > > One thing I'm not yet happy around the shared memory stats patch is > > > > naming. Currently a lot of comments say things like: > > > > > > > > * [...] We convert to > > > > * microseconds in PgStat_Counter format when transmitting to the > > > > collector. > > > > > > > > "...format for writing to the statistics datastore" That could also describe pg_statistic. Nor are we writing during normal operation (stats are first accumulated locally and subsequently a shared memory hash table is updated), the on-disk file is only written at shutdown. Which is my problem with this - we need a descriptive term / shorthand that describes the type of statistics we currently send to the stats collector. "cumulative stats subsystem"? > > > > /* ---------- > > > > * pgstat_report_subscription_drop() - > > > > * > > > > * Tell the collector about dropping the subscription. > > > > * ---------- > > > > */ > > > > I would expect that either the function gets renamed or just goes away. > Just changing the word "collector" isn't going to be a good change, the new > description should describe whatever the new behavior is. It currently has the same signature in the patch, and I don't forsee that changing. > As the process goes away we are now dealing directly with a conceptual > datastore. And instead of referring to the implementation detail of how > statistics are collected we can just refer to the "collection" behavior > generically. Whether we funnel through a process or write directly to the > datastore it is still statistics collection. We have many other types of stats that we collect, so yes, it's statistic collection, but that's not descriptive enough imo. "Stats collector" somewhat worked because the fact that the collector process was involved served to distinguish from other types of stats. Greetings, Andres Freund
On Tue, Mar 8, 2022 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
we need a descriptive term / shorthand that
describes the type of statistics we currently send to the stats collector.
"cumulative stats subsystem"?
I'm growing fond of "cumulative". It is more precise (and restrictive) than "metric" but that is beneficial here so long as it is indeed true (which a quick skim of Table 28.2. Collected Statistics Views [1] leads me to believe it is).
I'd be concerned that subsystem implies a collection process in a manner similar to how you associated datastore with a physical file on disk. But I'd pick subsystem over datastore here in any case.
David J.
Hi, On 2022-03-08 20:17:06 -0700, David G. Johnston wrote: > On Tue, Mar 8, 2022 at 7:32 PM Andres Freund <andres@anarazel.de> wrote: > > > we need a descriptive term / shorthand that > > describes the type of statistics we currently send to the stats collector. > > > > "cumulative stats subsystem"? > > > > > I'm growing fond of "cumulative". It is more precise (and restrictive) > than "metric" but that is beneficial here so long as it is indeed true > (which a quick skim of Table 28.2. Collected Statistics Views [1] leads me > to believe it is). I did go for that one - I think it looks better than the other alternatives. Should you be interested, I posted a version using that name at [1]. The majority of the changes related to the naming are in 0005, 0026, 0028... Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20220404041516.cctrvpadhuriawlq%40alap3.anarazel.de