Thread: shared memory stats: high level design decisions: consistency, dropping

shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
the goal of getting it into a shape that I'd be happy to commit.  That
thread is quite long and most are probably skipping over new messages in
it.

There are two high-level design decisions / questions that I think
warrant a wider audience (I'll keep lower level discussion in the other
thread).

In case it is not obvious, the goal of the shared memory stats patch is
to replace the existing statistics collector, to which new stats are
reported via an UDP socket, and where clients read data from the stats
collector by reading the entire database's stats from disk.

The replacement is to put the statistics into a shared memory
segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
etc) being stored directly in a struct in memory. Stats for objects
where a variable number exists, e.g. tables, are addressed via a dshash.
table that points to the stats that are in turn allocated using dsa.h.


1) What kind of consistency do we want from the pg_stats_* views?

Right now the first access to stats in a transaction will trigger a read
of both the global and per-database stats from disk. If the on-disk
state is too old, we'll ask the stats collector to write out a new file
a couple times.

For the rest of the transaction that in-memory state is used unless
pg_stat_clear_snapshot() is called. Which means that multiple reads from
e.g. pg_stat_user_tables will show the same results as before [2].

That makes stats accesses quite expensive if there are lots of
objects.

But it also means that separate stats accesses - which happen all the
time - return something repeatable and kind of consistent.

Now, the stats aren't really consistent in the sense that they are
really accurate, UDP messages can be lost, or only some of the stats
generated by a TX might not yet have been received, other transactions
haven't yet sent them. Etc.


With the shared memory patch the concept of copying all stats for the
current database into local memory at the time of the first stats access
doesn't make sense to me.  Horiguchi-san had actually implemented that,
but I argued that that would be cargo-culting an efficiency hack
required by the old storage model forward.

The cost of doing this is substantial. On master, with a database that
contains 1 million empty tables, any stats access takes ~0.4s and
increases memory usage by 170MB.


1.1)

I hope everybody agrees with not requiring that stats don't need to be
the way they were at the time of first stat access in a transaction,
even if that first access was to a different stat object than the
currently accessed stat?


1.2)

Do we expect repeated accesses to the same stat to stay the same through
the transaction?  The easiest way to implement stats accesses is to
simply fetch the stats from shared memory ([3]). That would obviously
result in repeated accesses to the same stat potentially returning
changing results over time.

I think that's perfectly fine, desirable even, for pg_stat_*.


1.3)

What kind of consistency do we expect between columns of views like
pg_stat_all_tables?

Several of the stats views aren't based on SRFs or composite-type
returning functions, but instead fetch each stat separately:

E.g. pg_stat_all_tables:
 SELECT c.oid AS relid,
    n.nspname AS schemaname,
    c.relname,
    pg_stat_get_numscans(c.oid) AS seq_scan,
    pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
    sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
    sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
    pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
...
    pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
   FROM pg_class c
     LEFT JOIN pg_index i ON c.oid = i.indrelid
...

Which means that if we do not cache stats, additional stats updates
could have been applied between two stats accessors. E.g the seq_scan
from before some pgstat_report_stat() but the seq_tup_read from after.

If we instead fetch all of a table's stats in one go, we would get
consistency between the columns. But obviously that'd require changing
all the stats views.

Horiguchi-san, in later iterations of the patch, attempted to address
this issue by adding a one-entry caches below
pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.


But I think that leads to very confusing results. Access stats for the
same relation multiple times in a row? Do not see updates. Switch
between e.g. a table and its indexes? See updates.


I personally think it's fine to have short-term divergences between the
columns. The stats aren't that accurate anyway, as we don't submit them
all the time.  And that if we want consistency between columns, we
instead should replace the current view definitions with[set of] record
returning function - everything else seems to lead to weird tradeoffs.



2) How to remove stats of dropped objects?

In the stats collector world stats for dropped objects (tables, indexes,
functions, etc) are dropped after the fact, using a pretty expensive
process:

Each autovacuum worker cycle and each manual VACUUM does
pgstat_vacuum_stat() to detect since-dropped objects. It does that by
building hash-tables for all databases, tables and functions, and then
comparing that against a freshly loaded stats snapshot. All stats object
not in pg_class etc are dropped.

The patch currently copies that approach, although that adds some
complications, mainly around [3].


Accessing all database objects after each VACUUM, even if the table was
tiny, isn't great, performance wise. In a fresh master database with 1
million functions, a VACUUM of an empty table takes ~0.5s, with 1
million tables it's ~1s.  Due to partitioning tables with many database
objects are of course getting more common.


There isn't really a better approach in the stats collector world. As
messages to the stats collector can get lost, we need to to be able to
re-do dropping of dead stats objects.


But now we could instead schedule stats to be removed at commit
time. That's not trivial of course, as we'd need to handle cases where
the commit fails after the commit record, but before processing the
dropped stats.

But it seems that integrating the stats that need to be dropped into the
commit message would make a lot of sense. With benefits beyond the
[auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
replication and even opening the door to keep stats across crashes.


My gut feeling here is to try to to fix the remaining issues in the
"collect oids" approach for 14 and to try to change the approach in
15. And, if that proves too hard, try to see how hard it'd be to
"accurately" drop.  But I'm also not sure - it might be smarter to go
full in, to avoid introducing a system that we'll just rip out again.


Comments?


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp

[2] Except that new tables with show up with lots of 0s

[3] There is a cache to avoid repeated dshash lookups for
    previously-accessed stats, to avoid contention. But that just points
    to the shared memory area with the stats.



Re: shared memory stats: high level design decisions: consistency, dropping

From
Hannu Krosing
Date:
> But now we could instead schedule stats to be removed at commit
time. That's not trivial of course, as we'd need to handle cases where
the commit fails after the commit record, but before processing the
dropped stats.

We likely can not remove them at commit time, but only after the
oldest open snapshot moves parts that commit ?

Would an approach where we keep stats in a structure logically similar
to MVCC we use for normal tables be completely unfeasible ?

We would only need to keep one Version per backend in transaction .

---
Hannu




On Sat, Mar 20, 2021 at 12:51 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
> the goal of getting it into a shape that I'd be happy to commit.  That
> thread is quite long and most are probably skipping over new messages in
> it.
>
> There are two high-level design decisions / questions that I think
> warrant a wider audience (I'll keep lower level discussion in the other
> thread).
>
> In case it is not obvious, the goal of the shared memory stats patch is
> to replace the existing statistics collector, to which new stats are
> reported via an UDP socket, and where clients read data from the stats
> collector by reading the entire database's stats from disk.
>
> The replacement is to put the statistics into a shared memory
> segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
> etc) being stored directly in a struct in memory. Stats for objects
> where a variable number exists, e.g. tables, are addressed via a dshash.
> table that points to the stats that are in turn allocated using dsa.h.
>
>
> 1) What kind of consistency do we want from the pg_stats_* views?
>
> Right now the first access to stats in a transaction will trigger a read
> of both the global and per-database stats from disk. If the on-disk
> state is too old, we'll ask the stats collector to write out a new file
> a couple times.
>
> For the rest of the transaction that in-memory state is used unless
> pg_stat_clear_snapshot() is called. Which means that multiple reads from
> e.g. pg_stat_user_tables will show the same results as before [2].
>
> That makes stats accesses quite expensive if there are lots of
> objects.
>
> But it also means that separate stats accesses - which happen all the
> time - return something repeatable and kind of consistent.
>
> Now, the stats aren't really consistent in the sense that they are
> really accurate, UDP messages can be lost, or only some of the stats
> generated by a TX might not yet have been received, other transactions
> haven't yet sent them. Etc.
>
>
> With the shared memory patch the concept of copying all stats for the
> current database into local memory at the time of the first stats access
> doesn't make sense to me.  Horiguchi-san had actually implemented that,
> but I argued that that would be cargo-culting an efficiency hack
> required by the old storage model forward.
>
> The cost of doing this is substantial. On master, with a database that
> contains 1 million empty tables, any stats access takes ~0.4s and
> increases memory usage by 170MB.
>
>
> 1.1)
>
> I hope everybody agrees with not requiring that stats don't need to be
> the way they were at the time of first stat access in a transaction,
> even if that first access was to a different stat object than the
> currently accessed stat?
>
>
> 1.2)
>
> Do we expect repeated accesses to the same stat to stay the same through
> the transaction?  The easiest way to implement stats accesses is to
> simply fetch the stats from shared memory ([3]). That would obviously
> result in repeated accesses to the same stat potentially returning
> changing results over time.
>
> I think that's perfectly fine, desirable even, for pg_stat_*.
>
>
> 1.3)
>
> What kind of consistency do we expect between columns of views like
> pg_stat_all_tables?
>
> Several of the stats views aren't based on SRFs or composite-type
> returning functions, but instead fetch each stat separately:
>
> E.g. pg_stat_all_tables:
>  SELECT c.oid AS relid,
>     n.nspname AS schemaname,
>     c.relname,
>     pg_stat_get_numscans(c.oid) AS seq_scan,
>     pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
>     sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
>     sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
>     pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
> ...
>     pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
>    FROM pg_class c
>      LEFT JOIN pg_index i ON c.oid = i.indrelid
> ...
>
> Which means that if we do not cache stats, additional stats updates
> could have been applied between two stats accessors. E.g the seq_scan
> from before some pgstat_report_stat() but the seq_tup_read from after.
>
> If we instead fetch all of a table's stats in one go, we would get
> consistency between the columns. But obviously that'd require changing
> all the stats views.
>
> Horiguchi-san, in later iterations of the patch, attempted to address
> this issue by adding a one-entry caches below
> pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
> what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.
>
>
> But I think that leads to very confusing results. Access stats for the
> same relation multiple times in a row? Do not see updates. Switch
> between e.g. a table and its indexes? See updates.
>
>
> I personally think it's fine to have short-term divergences between the
> columns. The stats aren't that accurate anyway, as we don't submit them
> all the time.  And that if we want consistency between columns, we
> instead should replace the current view definitions with[set of] record
> returning function - everything else seems to lead to weird tradeoffs.
>
>
>
> 2) How to remove stats of dropped objects?
>
> In the stats collector world stats for dropped objects (tables, indexes,
> functions, etc) are dropped after the fact, using a pretty expensive
> process:
>
> Each autovacuum worker cycle and each manual VACUUM does
> pgstat_vacuum_stat() to detect since-dropped objects. It does that by
> building hash-tables for all databases, tables and functions, and then
> comparing that against a freshly loaded stats snapshot. All stats object
> not in pg_class etc are dropped.
>
> The patch currently copies that approach, although that adds some
> complications, mainly around [3].
>
>
> Accessing all database objects after each VACUUM, even if the table was
> tiny, isn't great, performance wise. In a fresh master database with 1
> million functions, a VACUUM of an empty table takes ~0.5s, with 1
> million tables it's ~1s.  Due to partitioning tables with many database
> objects are of course getting more common.
>
>
> There isn't really a better approach in the stats collector world. As
> messages to the stats collector can get lost, we need to to be able to
> re-do dropping of dead stats objects.
>
>
> But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
>
> But it seems that integrating the stats that need to be dropped into the
> commit message would make a lot of sense. With benefits beyond the
> [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
> replication and even opening the door to keep stats across crashes.
>
>
> My gut feeling here is to try to to fix the remaining issues in the
> "collect oids" approach for 14 and to try to change the approach in
> 15. And, if that proves too hard, try to see how hard it'd be to
> "accurately" drop.  But I'm also not sure - it might be smarter to go
> full in, to avoid introducing a system that we'll just rip out again.
>
>
> Comments?
>
>
> Greetings,
>
> Andres Freund
>
> [1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp
>
> [2] Except that new tables with show up with lots of 0s
>
> [3] There is a cache to avoid repeated dshash lookups for
>     previously-accessed stats, to avoid contention. But that just points
>     to the shared memory area with the stats.
>
>



Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote:
> > But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
> 
> We likely can not remove them at commit time, but only after the
> oldest open snapshot moves parts that commit ?

I don't see why? A dropped table is dropped, and cannot be accessed
anymore. Snapshots don't play a role here - the underlying data is gone
(minus a placeholder file to avoid reusing the oid, until the next
commit).  If you run a vacuum on some unrelated table in the same
database, the stats for a dropped table will already be removed long
before there's no relation that could theoretically open the table.

Note that table level locking would prevent a table from being dropped
if a long-running transaction has already accessed it.


> Would an approach where we keep stats in a structure logically similar
> to MVCC we use for normal tables be completely unfeasible ?

Yes, pretty unfeasible. Stats should work on standbys too...

Regards,

Andres



Re: shared memory stats: high level design decisions: consistency, dropping

From
Hannu Krosing
Date:
On Sat, Mar 20, 2021 at 1:21 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote:
> > > But now we could instead schedule stats to be removed at commit
> > time. That's not trivial of course, as we'd need to handle cases where
> > the commit fails after the commit record, but before processing the
> > dropped stats.
> >
> > We likely can not remove them at commit time, but only after the
> > oldest open snapshot moves parts that commit ?
>
> I don't see why? A dropped table is dropped, and cannot be accessed
> anymore. Snapshots don't play a role here - the underlying data is gone
> (minus a placeholder file to avoid reusing the oid, until the next
> commit).  If you run a vacuum on some unrelated table in the same
> database, the stats for a dropped table will already be removed long
> before there's no relation that could theoretically open the table.
>
> Note that table level locking would prevent a table from being dropped
> if a long-running transaction has already accessed it.

Yeah, just checked. DROP TABLE waits until the reading transaction finishes.

>
> > Would an approach where we keep stats in a structure logically similar
> > to MVCC we use for normal tables be completely unfeasible ?
>
> Yes, pretty unfeasible. Stats should work on standbys too...

I did not mean actually using MVCC and real transaction ids but rather
 similar approach, where (potentially) different stats rows are kept
for each backend.

This of course only is a win in case multiple backends can use the
same stats row. Else it is easier to copy the backends version into
backend local memory.

But I myself do not see any problem with stats rows changing all the time.

The only worry would be parts of the same row being out of sync. This
can of course be solved by locking, but for large number of backends
with tiny transactions this locking itself could potentially become a
problem. Here alternating between two or more versions could help and
then it also starts makes sense to keep the copies in shared memory



> Regards,
>
> Andres



Re: shared memory stats: high level design decisions: consistency, dropping

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
> the goal of getting it into a shape that I'd be happy to commit.  That
> thread is quite long and most are probably skipping over new messages in
> it.

Awesome, +1.

> 1) What kind of consistency do we want from the pg_stats_* views?
>
> Right now the first access to stats in a transaction will trigger a read
> of both the global and per-database stats from disk. If the on-disk
> state is too old, we'll ask the stats collector to write out a new file
> a couple times.
>
> For the rest of the transaction that in-memory state is used unless
> pg_stat_clear_snapshot() is called. Which means that multiple reads from
> e.g. pg_stat_user_tables will show the same results as before [2].
>
> That makes stats accesses quite expensive if there are lots of
> objects.
>
> But it also means that separate stats accesses - which happen all the
> time - return something repeatable and kind of consistent.
>
> Now, the stats aren't really consistent in the sense that they are
> really accurate, UDP messages can be lost, or only some of the stats
> generated by a TX might not yet have been received, other transactions
> haven't yet sent them. Etc.
>
>
> With the shared memory patch the concept of copying all stats for the
> current database into local memory at the time of the first stats access
> doesn't make sense to me.  Horiguchi-san had actually implemented that,
> but I argued that that would be cargo-culting an efficiency hack
> required by the old storage model forward.
>
> The cost of doing this is substantial. On master, with a database that
> contains 1 million empty tables, any stats access takes ~0.4s and
> increases memory usage by 170MB.
>
>
> 1.1)
>
> I hope everybody agrees with not requiring that stats don't need to be
> the way they were at the time of first stat access in a transaction,
> even if that first access was to a different stat object than the
> currently accessed stat?

Agreed, that doesn't seem necessary and blowing up backend memory usage
by copying all the stats into local memory seems pretty terrible.

> 1.2)
>
> Do we expect repeated accesses to the same stat to stay the same through
> the transaction?  The easiest way to implement stats accesses is to
> simply fetch the stats from shared memory ([3]). That would obviously
> result in repeated accesses to the same stat potentially returning
> changing results over time.
>
> I think that's perfectly fine, desirable even, for pg_stat_*.

This seems alright to me.

> 1.3)
>
> What kind of consistency do we expect between columns of views like
> pg_stat_all_tables?
>
> Several of the stats views aren't based on SRFs or composite-type
> returning functions, but instead fetch each stat separately:
>
> E.g. pg_stat_all_tables:
>  SELECT c.oid AS relid,
>     n.nspname AS schemaname,
>     c.relname,
>     pg_stat_get_numscans(c.oid) AS seq_scan,
>     pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
>     sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
>     sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
>     pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
> ...
>     pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
>    FROM pg_class c
>      LEFT JOIN pg_index i ON c.oid = i.indrelid
> ...
>
> Which means that if we do not cache stats, additional stats updates
> could have been applied between two stats accessors. E.g the seq_scan
> from before some pgstat_report_stat() but the seq_tup_read from after.
>
> If we instead fetch all of a table's stats in one go, we would get
> consistency between the columns. But obviously that'd require changing
> all the stats views.
>
> Horiguchi-san, in later iterations of the patch, attempted to address
> this issue by adding a one-entry caches below
> pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
> what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.
>
>
> But I think that leads to very confusing results. Access stats for the
> same relation multiple times in a row? Do not see updates. Switch
> between e.g. a table and its indexes? See updates.
>
>
> I personally think it's fine to have short-term divergences between the
> columns. The stats aren't that accurate anyway, as we don't submit them
> all the time.  And that if we want consistency between columns, we
> instead should replace the current view definitions with[set of] record
> returning function - everything else seems to lead to weird tradeoffs.

Agreed, doesn't seem like a huge issue to have short-term divergences
but if we want to fix them then flipping those all to SRFs would make
the most sense.

> 2) How to remove stats of dropped objects?
>
> In the stats collector world stats for dropped objects (tables, indexes,
> functions, etc) are dropped after the fact, using a pretty expensive
> process:
>
> Each autovacuum worker cycle and each manual VACUUM does
> pgstat_vacuum_stat() to detect since-dropped objects. It does that by
> building hash-tables for all databases, tables and functions, and then
> comparing that against a freshly loaded stats snapshot. All stats object
> not in pg_class etc are dropped.
>
> The patch currently copies that approach, although that adds some
> complications, mainly around [3].
>
>
> Accessing all database objects after each VACUUM, even if the table was
> tiny, isn't great, performance wise. In a fresh master database with 1
> million functions, a VACUUM of an empty table takes ~0.5s, with 1
> million tables it's ~1s.  Due to partitioning tables with many database
> objects are of course getting more common.
>
>
> There isn't really a better approach in the stats collector world. As
> messages to the stats collector can get lost, we need to to be able to
> re-do dropping of dead stats objects.
>
>
> But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
>
> But it seems that integrating the stats that need to be dropped into the
> commit message would make a lot of sense. With benefits beyond the
> [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
> replication and even opening the door to keep stats across crashes.
>
>
> My gut feeling here is to try to to fix the remaining issues in the
> "collect oids" approach for 14 and to try to change the approach in
> 15. And, if that proves too hard, try to see how hard it'd be to
> "accurately" drop.  But I'm also not sure - it might be smarter to go
> full in, to avoid introducing a system that we'll just rip out again.
>
> Comments?

The current approach sounds pretty terrible and propagating that forward
doesn't seem great.  Guess here I'd disagree with your gut feeling and
encourage trying to go 'full in', as you put it, or at least put enough
effort into it to get a feeling of if it's going to require a *lot* more
work or not and then reconsider if necessary.

Thanks,

Stephen

Attachment
Andres Freund <andres@anarazel.de> writes:
> 1) What kind of consistency do we want from the pg_stats_* views?

That's a hard choice to make.  But let me set the record straight:
when we did the initial implementation, the stats snapshotting behavior
was considered a FEATURE, not an "efficiency hack required by the old
storage model".

If I understand what you are proposing, all stats views would become
completely volatile, without even within-query consistency.  That really
is not gonna work.  As an example, you could get not-even-self-consistent
results from a join to a stats view if the planner decides to implement
it as a nestloop with the view on the inside.

I also believe that the snapshotting behavior has advantages in terms
of being able to perform multiple successive queries and get consistent
results from them.  Only the most trivial sorts of analysis don't need
that.

In short, what you are proposing sounds absolutely disastrous for
usability of the stats views, and I for one will not sign off on it
being acceptable.

I do think we could relax the consistency guarantees a little bit,
perhaps along the lines of only caching view rows that have already
been read, rather than grabbing everything up front.  But we can't
just toss the snapshot concept out the window.  It'd be like deciding
that nobody needs MVCC, or even any sort of repeatable read.

            regards, tom lane



Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

On 2021-03-21 11:41:30 -0400, Stephen Frost wrote:
> > 1.1)
> >
> > I hope everybody agrees with not requiring that stats don't need to be
> > the way they were at the time of first stat access in a transaction,
> > even if that first access was to a different stat object than the
> > currently accessed stat?
>
> Agreed, that doesn't seem necessary and blowing up backend memory usage
> by copying all the stats into local memory seems pretty terrible.

Yea. I've seen instances where most backends had several hundreds MB of
stats loaded :(. Even leaving the timing overhead aside, that's really
not fun. Of course that application may not have had exactly the
greatest design, but ...


> > 1.2)
> >
> > Do we expect repeated accesses to the same stat to stay the same through
> > the transaction?  The easiest way to implement stats accesses is to
> > simply fetch the stats from shared memory ([3]). That would obviously
> > result in repeated accesses to the same stat potentially returning
> > changing results over time.
> >
> > I think that's perfectly fine, desirable even, for pg_stat_*.
>
> This seems alright to me.

Seems Tom disagrees :(


> > 1.3)
> >
> > What kind of consistency do we expect between columns of views like
> > pg_stat_all_tables?
> > [...]
> > I personally think it's fine to have short-term divergences between the
> > columns. The stats aren't that accurate anyway, as we don't submit them
> > all the time.  And that if we want consistency between columns, we
> > instead should replace the current view definitions with[set of] record
> > returning function - everything else seems to lead to weird tradeoffs.
>
> Agreed, doesn't seem like a huge issue to have short-term divergences
> but if we want to fix them then flipping those all to SRFs would make
> the most sense.

There's also a pretty good efficiency argument for going to SRFs. Doing
18 hashtable lookups + function calls just to return one row of
pg_stat_all_tables surely is a higher overhead than unnecessarily
returning columns that weren't needed by the user.

I do think it makes sense to get idx_scan/idx_tup_fetch via a join
though.


> > 2) How to remove stats of dropped objects?
> >
> > [...]
>
> The current approach sounds pretty terrible and propagating that forward
> doesn't seem great.  Guess here I'd disagree with your gut feeling and
> encourage trying to go 'full in', as you put it, or at least put enough
> effort into it to get a feeling of if it's going to require a *lot* more
> work or not and then reconsider if necessary.

I think my gut's argument is that it's already a huge patch, and that
it's better to have the the very substantial memory and disk IO savings
with the crappy vacuum approach, than neither. And given the timeframe
there does seem to be a substantial danger of "neither" being the
outcome...  Anyway, I'm mentally sketching out what it'd take.

Greetings,

Andres Freund



Re: shared memory stats: high level design decisions: consistency, dropping

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> If I understand what you are proposing, all stats views would become
> completely volatile, without even within-query consistency.  That really
> is not gonna work.  As an example, you could get not-even-self-consistent
> results from a join to a stats view if the planner decides to implement
> it as a nestloop with the view on the inside.
>
> I also believe that the snapshotting behavior has advantages in terms
> of being able to perform multiple successive queries and get consistent
> results from them.  Only the most trivial sorts of analysis don't need
> that.
>
> In short, what you are proposing sounds absolutely disastrous for
> usability of the stats views, and I for one will not sign off on it
> being acceptable.
>
> I do think we could relax the consistency guarantees a little bit,
> perhaps along the lines of only caching view rows that have already
> been read, rather than grabbing everything up front.  But we can't
> just toss the snapshot concept out the window.  It'd be like deciding
> that nobody needs MVCC, or even any sort of repeatable read.

This isn't the same use-case as traditional tables or relational
concepts in general- there aren't any foreign keys for the fields that
would actually be changing across these accesses to the shared memory
stats- we're talking about gross stats numbers like the number of
inserts into a table, not an employee_id column.  In short, I don't
agree that this is a fair comparison.

Perhaps there's a good argument to try and cache all this info per
backend, but saying that it's because we need MVCC-like semantics for
this data because other things need MVCC isn't it and I don't know that
saying this is a justifiable case for requiring repeatable read is
reasonable either.

What specific, reasonable, analysis of the values that we're actually
talking about, which are already aggregates themselves, is going to
end up being utterly confused?

Thanks,

Stephen

Attachment

Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

On 2021-03-21 12:14:35 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > 1) What kind of consistency do we want from the pg_stats_* views?
>
> That's a hard choice to make.  But let me set the record straight:
> when we did the initial implementation, the stats snapshotting behavior
> was considered a FEATURE, not an "efficiency hack required by the old
> storage model".

Oh - sorry for misstating that then. I did try to look for the origins of the
approach, and all that I found was that it'd be too expensive to do multiple
stats file reads.


> If I understand what you are proposing, all stats views would become
> completely volatile, without even within-query consistency.  That really
> is not gonna work.  As an example, you could get not-even-self-consistent
> results from a join to a stats view if the planner decides to implement
> it as a nestloop with the view on the inside.

I don't really think it's a problem that's worth incuring that much cost to
prevent. We already have that behaviour for a number of of the pg_stat_* views,
e.g. pg_stat_xact_all_tables, pg_stat_replication.

If the cost were low - or we can find a reasonable way to get to low costs - I
think it'd be worth preserving for backward compatibility's sake alone.  From
an application perspective, I actually rarely want that behaviour for stats
views - I'm querying them to get the most recent information, not an older
snapshot. And in the cases I do want snapshots, I'd want them for longer than a
transaction.

There's just a huge difference between being able to access a table's stats in
O(1) time, or having a single stats access be O(database-objects).

And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
(for IO over time stats etc) that often are monitored at a somewhat high
frequency - they also pay the price of reading in all object stats.  On my
example database with 1M tables it takes 0.4s to read pg_stat_database.


We currently also fetch the full stats in places like autovacuum.c. Where we
don't need repeated access to be consistent - we even explicitly force the
stats to be re-read for every single table that's getting vacuumed.

Even if we to just cache already accessed stats, places like do_autovacuum()
would end up with a completely unnecessary cache of all tables, blowing up
memory usage by a large amount on systems with lots of relations.


> I also believe that the snapshotting behavior has advantages in terms
> of being able to perform multiple successive queries and get consistent
> results from them.  Only the most trivial sorts of analysis don't need
> that.

In most cases you'd not do that in a transaction tho, and you'd need to create
temporary tables with a snapshot of the stats anyway.


> In short, what you are proposing sounds absolutely disastrous for
> usability of the stats views, and I for one will not sign off on it
> being acceptable.

:(

That's why I thought it'd be important to bring this up to a wider
audience. This has been discussed several times in the thread, and nobody
really chimed up wanting the "snapshot" behaviour...


> I do think we could relax the consistency guarantees a little bit,
> perhaps along the lines of only caching view rows that have already
> been read, rather than grabbing everything up front.  But we can't
> just toss the snapshot concept out the window.  It'd be like deciding
> that nobody needs MVCC, or even any sort of repeatable read.

I think that'd still a huge win - caching only what's been accessed rather than
everything will save a lot of memory in very common cases. I did bring it up as
one approach for that reason.

I do think it has a few usability quirks though. The time-skew between stats
objects accessed at different times seems like it could be quite confusing?
E.g. imagine looking at table stats and then later join to index stats and see
table / index stats not matching up at all.


I wonder if a reasonable way out could be to have pg_stat_make_snapshot()
(accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager
data load. But not use any snapshot / caching behaviour without that?

It's be a fair bit of code to have that, but I think can see a way to have it
not be too bad?


Greetings,

Andres Freund



On Sun, 21 Mar 2021 at 18:16, Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > I also believe that the snapshotting behavior has advantages in terms
> > of being able to perform multiple successive queries and get consistent
> > results from them.  Only the most trivial sorts of analysis don't need
> > that.
> >
> > In short, what you are proposing sounds absolutely disastrous for
> > usability of the stats views, and I for one will not sign off on it
> > being acceptable.
> >
> > I do think we could relax the consistency guarantees a little bit,
> > perhaps along the lines of only caching view rows that have already
> > been read, rather than grabbing everything up front.  But we can't
> > just toss the snapshot concept out the window.  It'd be like deciding
> > that nobody needs MVCC, or even any sort of repeatable read.
>
> This isn't the same use-case as traditional tables or relational
> concepts in general- there aren't any foreign keys for the fields that
> would actually be changing across these accesses to the shared memory
> stats- we're talking about gross stats numbers like the number of
> inserts into a table, not an employee_id column.  In short, I don't
> agree that this is a fair comparison.

I use these stats quite a bit and do lots of slicing and dicing with
them. I don't think it's as bad as Tom says but I also don't think we
can be quite as loosy-goosy as I think Andres or Stephen might be
proposing either (though I note that haven't said they don't want any
consistency at all).

The cases where the consistency really matter for me is when I'm doing
math involving more than one statistic.

Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide
seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look
at the ratio between n_tup_upd and n_tup_hot_upd.

And no, it doesn't help that these are often large numbers after a
long time because I'm actually working with the first derivative of
these numbers using snapshots or a time series database. So if you
have the seq_tup_read incremented but not seq_scan incremented you
could get a wildly incorrect calculation of "tup read per seq scan"
which actually matters.

I don't think I've ever done math across stats for different objects.
I mean, I've plotted them together and looked at which was higher but
I don't think that's affected by some plots having peaks slightly out
of sync with the other. I suppose you could look at the ratio of
access patterns between two tables and know that they're only ever
accessed by a single code path at the same time and therefore the
ratios would be meaningful. But I don't think users would be surprised
to find they're not consistent that way either.

So I think we need to ensure that at least all the values for a single
row representing a single object are consistent. Or at least that
there's *some* correct way to retrieve a consistent row and that the
standard views use that. I don't think we need to guarantee that every
possible plan will always be consistent even if you access the row
multiple times in a self-join or use the lookup function on individual
columns separately.

-- 
greg



Re: shared memory stats: high level design decisions: consistency, dropping

From
Alvaro Herrera
Date:
On 2021-Mar-21, Andres Freund wrote:

> We currently also fetch the full stats in places like autovacuum.c. Where we
> don't need repeated access to be consistent - we even explicitly force the
> stats to be re-read for every single table that's getting vacuumed.
> 
> Even if we to just cache already accessed stats, places like do_autovacuum()
> would end up with a completely unnecessary cache of all tables, blowing up
> memory usage by a large amount on systems with lots of relations.

It's certainly not the case that autovacuum needs to keep fully
consistent stats.  That's just the way that seemed easier (?) to do at
the time.  Unless I misunderstand things severely, we could just have
autovacuum grab all necessary numbers for one database at the start of a
run, not cache anything, then re-read the numbers for one table as it
rechecks that table.

Resetting before re-reading was obviously necessary because the
built-in snapshotting made it impossible to freshen up the numbers at
the recheck step.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W



Re: shared memory stats: high level design decisions: consistency, dropping

From
Magnus Hagander
Date:
On Tue, Mar 23, 2021 at 4:21 AM Greg Stark <stark@mit.edu> wrote:
>
> On Sun, 21 Mar 2021 at 18:16, Stephen Frost <sfrost@snowman.net> wrote:
> >
> > Greetings,
> >
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > > I also believe that the snapshotting behavior has advantages in terms
> > > of being able to perform multiple successive queries and get consistent
> > > results from them.  Only the most trivial sorts of analysis don't need
> > > that.
> > >
> > > In short, what you are proposing sounds absolutely disastrous for
> > > usability of the stats views, and I for one will not sign off on it
> > > being acceptable.
> > >
> > > I do think we could relax the consistency guarantees a little bit,
> > > perhaps along the lines of only caching view rows that have already
> > > been read, rather than grabbing everything up front.  But we can't
> > > just toss the snapshot concept out the window.  It'd be like deciding
> > > that nobody needs MVCC, or even any sort of repeatable read.
> >
> > This isn't the same use-case as traditional tables or relational
> > concepts in general- there aren't any foreign keys for the fields that
> > would actually be changing across these accesses to the shared memory
> > stats- we're talking about gross stats numbers like the number of
> > inserts into a table, not an employee_id column.  In short, I don't
> > agree that this is a fair comparison.
>
> I use these stats quite a bit and do lots of slicing and dicing with
> them. I don't think it's as bad as Tom says but I also don't think we
> can be quite as loosy-goosy as I think Andres or Stephen might be
> proposing either (though I note that haven't said they don't want any
> consistency at all).
>
> The cases where the consistency really matter for me is when I'm doing
> math involving more than one statistic.
>
> Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide
> seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look
> at the ratio between n_tup_upd and n_tup_hot_upd.
>
> And no, it doesn't help that these are often large numbers after a
> long time because I'm actually working with the first derivative of
> these numbers using snapshots or a time series database. So if you
> have the seq_tup_read incremented but not seq_scan incremented you
> could get a wildly incorrect calculation of "tup read per seq scan"
> which actually matters.
>
> I don't think I've ever done math across stats for different objects.
> I mean, I've plotted them together and looked at which was higher but
> I don't think that's affected by some plots having peaks slightly out
> of sync with the other. I suppose you could look at the ratio of
> access patterns between two tables and know that they're only ever
> accessed by a single code path at the same time and therefore the
> ratios would be meaningful. But I don't think users would be surprised
> to find they're not consistent that way either.

Yeah, it's important to differentiate if things can be inconsistent
within a single object, or just between objects. And I agree that in a
lot of cases, just having per-object consistent data is probably
enough.

Normally when you graph things for example, your peaks will look
across >1 sample point anyway, and in that case it doesn't much matter
does it?

But if we said we try to offer per-object consistency only, then for
example the idx_scans value in the tables view may see changes to some
but not all indexes on that table. Would that be acceptable?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: shared memory stats: high level design decisions: consistency, dropping

From
Magnus Hagander
Date:
On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-03-21 12:14:35 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > 1) What kind of consistency do we want from the pg_stats_* views?
> >
> > That's a hard choice to make.  But let me set the record straight:
> > when we did the initial implementation, the stats snapshotting behavior
> > was considered a FEATURE, not an "efficiency hack required by the old
> > storage model".
>
> Oh - sorry for misstating that then. I did try to look for the origins of the
> approach, and all that I found was that it'd be too expensive to do multiple
> stats file reads.
>
>
> > If I understand what you are proposing, all stats views would become
> > completely volatile, without even within-query consistency.  That really
> > is not gonna work.  As an example, you could get not-even-self-consistent
> > results from a join to a stats view if the planner decides to implement
> > it as a nestloop with the view on the inside.
>
> I don't really think it's a problem that's worth incuring that much cost to
> prevent. We already have that behaviour for a number of of the pg_stat_* views,
> e.g. pg_stat_xact_all_tables, pg_stat_replication.

Aren't those both pretty bad examples though?

pg_stat_xact_all_tables surely is within-query consistent, and would
be pretty useless if it wwas within-transaction consistent?

pg_stat_replication is a snapshot of what things are right now (like
pg_stat_activity), and not collected statistics.

Maybe there's inconsistency in that they should've had a different
name to separate it out, but fundamentally having xact consistent
views there would be a bad thing, no?


> If the cost were low - or we can find a reasonable way to get to low costs - I
> think it'd be worth preserving for backward compatibility's sake alone.  From
> an application perspective, I actually rarely want that behaviour for stats
> views - I'm querying them to get the most recent information, not an older
> snapshot. And in the cases I do want snapshots, I'd want them for longer than a
> transaction.

I agree in general, but I'd want them to be *query-consistent*, not
*transaction-consistent*. But the question is as you say, am I willing
to pay for that. Less certain of that.


> There's just a huge difference between being able to access a table's stats in
> O(1) time, or having a single stats access be O(database-objects).
>
> And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
> (for IO over time stats etc) that often are monitored at a somewhat high
> frequency - they also pay the price of reading in all object stats.  On my
> example database with 1M tables it takes 0.4s to read pg_stat_database.

IMV, singeling things out into "larger groups" would be one perfectly
acceptable compromise. That is, say that pg_stat_user_tables can be
inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent
with itself.

Basically anything that's "global" seems like it could be treated that
way, independent of each other.

For relations and such having a way to get just a single relation
stats or a number of them that will be consistent with each other
without getting all of them, could also be a reasonable optimization.
Mabye an SRF that takes an array of oids as a parameter and returns
consistent data across those, without having to copy/mess with the
rest?


> We currently also fetch the full stats in places like autovacuum.c. Where we
> don't need repeated access to be consistent - we even explicitly force the
> stats to be re-read for every single table that's getting vacuumed.
>
> Even if we to just cache already accessed stats, places like do_autovacuum()
> would end up with a completely unnecessary cache of all tables, blowing up
> memory usage by a large amount on systems with lots of relations.

autovacuum is already dealing with things being pretty fuzzy though,
so it shouldn't matter much there?

But autovacuum might also deserve it's own interface to access the
data directly and doesn't have to follow the same one as the stats
views in this new scheme, perhaps?


> > I also believe that the snapshotting behavior has advantages in terms
> > of being able to perform multiple successive queries and get consistent
> > results from them.  Only the most trivial sorts of analysis don't need
> > that.
>
> In most cases you'd not do that in a transaction tho, and you'd need to create
> temporary tables with a snapshot of the stats anyway.

I'd say in most cases this analysis happens in snapshots anyway, and
those are snapshots unrelated to what we do in pg_stat. It's either
snapshotted to tables, or to storage in a completely separate
database.


> > In short, what you are proposing sounds absolutely disastrous for
> > usability of the stats views, and I for one will not sign off on it
> > being acceptable.
>
> :(
>
> That's why I thought it'd be important to bring this up to a wider
> audience. This has been discussed several times in the thread, and nobody
> really chimed up wanting the "snapshot" behaviour...

I can chime in with the ones saying I don't think I need  that kind of
snapshot behaviour.

I would *like* to have query-level consistent views. But I may be able
to compromise on that one for the sake of performance as well.

I definitely need there to be object-level consistent views.


> > I do think we could relax the consistency guarantees a little bit,
> > perhaps along the lines of only caching view rows that have already
> > been read, rather than grabbing everything up front.  But we can't
> > just toss the snapshot concept out the window.  It'd be like deciding
> > that nobody needs MVCC, or even any sort of repeatable read.
>
> I think that'd still a huge win - caching only what's been accessed rather than
> everything will save a lot of memory in very common cases. I did bring it up as
> one approach for that reason.
>
> I do think it has a few usability quirks though. The time-skew between stats
> objects accessed at different times seems like it could be quite confusing?
> E.g. imagine looking at table stats and then later join to index stats and see
> table / index stats not matching up at all.
>
>
> I wonder if a reasonable way out could be to have pg_stat_make_snapshot()
> (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager
> data load. But not use any snapshot / caching behaviour without that?

I think that's a pretty good idea.

I bet the vast majority of all queries against the pg_stat views are
done by automated tools, and they don't care about the snapshot
behaviour, and thus wouldn't have to pay the overhead. In the more
rare cases when you do the live-analysis, you can explicitly request
it.

Another idea could be a per-user GUC of "stats_snapshots" or so, and
then if it's on force the snapshots all times. That way a DBA who
wants the snapshots could set it on their own user but keep it off for
the automated jobs for example. (It'd basically be the same except
automatically calling pg_stat_make_snapshot() the first time stats are
queried)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

On 2021-03-24 14:42:11 +0100, Magnus Hagander wrote:
> On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote:
> > > If I understand what you are proposing, all stats views would become
> > > completely volatile, without even within-query consistency.  That really
> > > is not gonna work.  As an example, you could get not-even-self-consistent
> > > results from a join to a stats view if the planner decides to implement
> > > it as a nestloop with the view on the inside.
> >
> > I don't really think it's a problem that's worth incuring that much cost to
> > prevent. We already have that behaviour for a number of of the pg_stat_* views,
> > e.g. pg_stat_xact_all_tables, pg_stat_replication.
> 
> Aren't those both pretty bad examples though?
> 
> pg_stat_xact_all_tables surely is within-query consistent, and would
> be pretty useless if it wwas within-transaction consistent?

It's not within-query consistent:

postgres[1182102][1]=# SELECT pg_stat_get_xact_numscans('pg_class'::regclass) UNION ALL SELECT count(*) FROM pg_class
UNIONALL SELECT pg_stat_get_xact_numscans('pg_class'::regclass);
 
┌───────────────────────────┐
│ pg_stat_get_xact_numscans │
├───────────────────────────┤
│                         0 │
│                       397 │
│                         1 │
└───────────────────────────┘
(3 rows)


> pg_stat_replication is a snapshot of what things are right now (like
> pg_stat_activity), and not collected statistics.

However, pg_stat_activity does have snapshot semantics...


> Maybe there's inconsistency in that they should've had a different
> name to separate it out, but fundamentally having xact consistent
> views there would be a bad thing, no?

True. One weird thing around the _xact_ versions that we, at best,
*hint* at in the docs, but also contradict, is that _xact_ views are
actually not tied to the transaction.  It's really about unsubmitted
stats. E.g. if executing the following via copy-paste

SELECT count(*) FROM pg_class;
SELECT pg_stat_get_xact_numscans('pg_class'::regclass);
SELECT count(*) FROM pg_class;
SELECT pg_stat_get_xact_numscans('pg_class'::regclass);

will most of the time return
<count>
0
<count>
1

because after the transaction for the first count(*) commits, we'll not
have submitted stats for more than PGSTAT_STAT_INTERVAL. But after the
second count(*) it'll be shorter, therefore the stats won't be
submitted...


> > There's just a huge difference between being able to access a table's stats in
> > O(1) time, or having a single stats access be O(database-objects).
> >
> > And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
> > (for IO over time stats etc) that often are monitored at a somewhat high
> > frequency - they also pay the price of reading in all object stats.  On my
> > example database with 1M tables it takes 0.4s to read pg_stat_database.
> 
> IMV, singeling things out into "larger groups" would be one perfectly
> acceptable compromise. That is, say that pg_stat_user_tables can be
> inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent
> with itself.

I don't think that buys us all that much though. It still is a huge
issue that we need to cache the stats for all relations even though we
only access the stats for table.


> > We currently also fetch the full stats in places like autovacuum.c. Where we
> > don't need repeated access to be consistent - we even explicitly force the
> > stats to be re-read for every single table that's getting vacuumed.
> >
> > Even if we to just cache already accessed stats, places like do_autovacuum()
> > would end up with a completely unnecessary cache of all tables, blowing up
> > memory usage by a large amount on systems with lots of relations.
> 
> autovacuum is already dealing with things being pretty fuzzy though,
> so it shouldn't matter much there?
> 
> But autovacuum might also deserve it's own interface to access the
> data directly and doesn't have to follow the same one as the stats
> views in this new scheme, perhaps?

Yes, we can do that now.


> > > I also believe that the snapshotting behavior has advantages in terms
> > > of being able to perform multiple successive queries and get consistent
> > > results from them.  Only the most trivial sorts of analysis don't need
> > > that.
> >
> > In most cases you'd not do that in a transaction tho, and you'd need to create
> > temporary tables with a snapshot of the stats anyway.
> 
> I'd say in most cases this analysis happens in snapshots anyway, and
> those are snapshots unrelated to what we do in pg_stat. It's either
> snapshotted to tables, or to storage in a completely separate
> database.

Agreed. I wonder some of that work would be made easier if we added a
function to export all the data in the current snapshot as a json
document or such? If we add configurable caching (see below) that'd
really not be a lot of additional work.


> I would *like* to have query-level consistent views. But I may be able
> to compromise on that one for the sake of performance as well.
> 
> I definitely need there to be object-level consistent views.

That'd be free if we didn't use all those separate function calls for
each row in pg_stat_all_tables etc...


> > I wonder if a reasonable way out could be to have pg_stat_make_snapshot()
> > (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager
> > data load. But not use any snapshot / caching behaviour without that?
> 
> I think that's a pretty good idea.

It's what I am leaning towards right now.


> Another idea could be a per-user GUC of "stats_snapshots" or so, and
> then if it's on force the snapshots all times. That way a DBA who
> wants the snapshots could set it on their own user but keep it off for
> the automated jobs for example. (It'd basically be the same except
> automatically calling pg_stat_make_snapshot() the first time stats are
> queried)

Yea, I was thinking similar. We could e.g. have
stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'.

With 'none' there's no consistency beyond that a single pg_stat_get_*
function call will give consistent results.

With 'query' we cache each object on first access (i.e. there can be
inconsistency between different objects if their accesses are further
appart, but not within a stats object).

With 'snapshot' we cache all the stats at the first access, for the
duration of the transaction.


However: I think 'query' is surprisingly hard to implement. There can be
multiple ongoing overlapping queries ongoing at the same time. And even
if there aren't multiple ongoing queries, there's really nothing great
to hook a reset into.

So I'm inclined to instead have 'access', which caches individual stats
object on first access. But lasts longer than the query.


Tom, if we defaulted to 'access' would that satisfy your concerns? Or
would even 'none' as a default do? I'd rather not have SELECT * FROM
pg_stat_all_tables; accumulate an unnecessary copy of most of the
cluster's stats in memory just to immediately throw it away again.


> I bet the vast majority of all queries against the pg_stat views are
> done by automated tools, and they don't care about the snapshot
> behaviour, and thus wouldn't have to pay the overhead. In the more
> rare cases when you do the live-analysis, you can explicitly request
> it.

I don't think I'd often want it in that case either, but ymmv.

Greetings,

Andres Freund



Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

On 2021-03-21 14:53:50 -0700, Andres Freund wrote:
> On 2021-03-21 11:41:30 -0400, Stephen Frost wrote:
> > > 2) How to remove stats of dropped objects?
> > >
> > > [...]
> >
> > The current approach sounds pretty terrible and propagating that forward
> > doesn't seem great.  Guess here I'd disagree with your gut feeling and
> > encourage trying to go 'full in', as you put it, or at least put enough
> > effort into it to get a feeling of if it's going to require a *lot* more
> > work or not and then reconsider if necessary.
> 
> I think my gut's argument is that it's already a huge patch, and that
> it's better to have the the very substantial memory and disk IO savings
> with the crappy vacuum approach, than neither. And given the timeframe
> there does seem to be a substantial danger of "neither" being the
> outcome...  Anyway, I'm mentally sketching out what it'd take.

I implemented this. Far from polished, but it does survive the
regression tests, including new tests in stats.sql.  functions stats,
2PC and lots of naming issues (xl_xact_dropped_stats with members
ndropped and'dropped_stats) are yet to be addressed - but not
architecturally relevant.

I think there are three hairy corner-cases that I haven't thought
sufficiently about:

- It seems likely that there's a relatively narrow window where a crash
  could end up not dropping stats.

  The xact.c integration is basically parallel to smgrGetPendingDeletes
  etc.  I don't see what prevents a checkpoint from happing after
  RecordTransactionCommit() (which does smgrGetPendingDeletes), but
  before smgrDoPendingDeletes(). Which means that if we crash in that
  window, nothing would clean up those files (and thus also not the
  dropped stats that I implemented very similarly).

  Closing that window seems doable (put the pending file/stat deletion
  list in shared memory after logging the WAL record, but before
  MyProc->delayChkpt = false, do something with that in
  CheckPointGuts()), but not trivial.

  If we had a good way to run pgstat_vacuum_stat() on a very
  *occasional* basis via autovac, that'd be ok. But it'd be a lot nicer
  if it were bulletproof.

- Does temporary table cleanup after a crash properly deal with their
  stats?

- I suspect there are a few cases where pending stats in one connection
  could "revive" an already dropped stat. It'd not be hard to address in
  an inefficient way in the shmstats patch, but it'd come with some
  efficiency penalty - but it might an irrelevant efficiency difference.


Definitely for later, but if we got this ironed out, we probably could
stop throwing stats away after crashes. Instead storing stats snapshots
alongside redo LSNs or such. It'd be nice if immediate shutdowns etc
wouldn't lead to autovacuum not doing any vacuuming for quite a while.

Greetings,

Andres Freund



Re: shared memory stats: high level design decisions: consistency, dropping

From
Robert Haas
Date:
On Sun, Mar 21, 2021 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If I understand what you are proposing, all stats views would become
> completely volatile, without even within-query consistency.  That really
> is not gonna work.  As an example, you could get not-even-self-consistent
> results from a join to a stats view if the planner decides to implement
> it as a nestloop with the view on the inside.
>
> I also believe that the snapshotting behavior has advantages in terms
> of being able to perform multiple successive queries and get consistent
> results from them.  Only the most trivial sorts of analysis don't need
> that.
>
> In short, what you are proposing sounds absolutely disastrous for
> usability of the stats views, and I for one will not sign off on it
> being acceptable.
>
> I do think we could relax the consistency guarantees a little bit,
> perhaps along the lines of only caching view rows that have already
> been read, rather than grabbing everything up front.  But we can't
> just toss the snapshot concept out the window.  It'd be like deciding
> that nobody needs MVCC, or even any sort of repeatable read.

So, just as a data point, the output of pg_locks is not stable within
a transaction. In fact, the pg_locks output could technically include
the same exact lock more than once, if it's being moved from the
fast-path table to the main table just as you are reading all the
data. In theory, that could produce the same kinds of problems that
you're concerned about here, and I suspect sometimes it does. But I
haven't seen a lot of people yelling and screaming about it. The
situation isn't ideal, but it's not disastrous either.

I think it's really hard for us as developers to predict what kinds of
effects of these kinds of decisions will have in real-world
deployments. All of us have probably had the experience of making some
behavioral change that we thought would not be too big a deal and it
actually pissed off a bunch of users who were relying on the old
behavior. I know I have. Conversely, I've reluctantly made changes
that seemed rather dangerous to me and heard nary a peep. If somebody
takes the position that changing this behavior is scary because we
don't know how many users will be inconvenienced or how badly, I can
only agree. But saying that it's tantamount to deciding that nobody
needs MVCC is completely over the top. This is statistical data, not
user data, and there are good reasons to think that people don't have
the same expectations in both cases, starting with the fact that we
have some stuff that works like that already.

More than that, there's a huge problem with the way this works today
that can't be fixed without making some compromises. In the test case
Andres mentioned upthread, the stats snapshot burned through 170MB of
RAM. Now, you might dismiss that as not much memory in 2021, but if
you have a lot of backends accessing the stats, that value could be
multiplied by a two digit or even three digit number, and that is
*definitely* a lot of memory, even in 2021. But even if it's not
multiplied by anything, we're shipping with a default work_mem of just
4MB. So, the position we're implicitly taking today is: if you join
two 5MB tables, it's too risky to put the entire contents of one of
them into a single in-memory hash table, because we might run the
machine out of RAM. But if you have millions of objects in your
database and touch the statistics for one of those objects, once, it's
absolutely OK to slurp tens or even hundreds of megabytes of data into
backend-private memory to avoid the possibility that you might later
access another one of those counters and expect snapshot semantics.

To be honest, I don't find either of those positions very believable.
I do not think it likely that the typical user really wants a 5MB hash
join to be done in batches to save memory, and I think it equally
unlikely that everybody wants to read and cache tens or hundreds of
megabytes of data to get MVCC semantics for volatile statistics. I
think there are probably some cases where having that information be
stable across a transaction lifetime is really useful, so if we can
provide that as an optional behavior, I think that would be a pretty
good idea. But I don't think it's reasonable for that to be the only
behavior, and I'm doubtful about whether it should even be the
default. I bet there are a lot of cases where somebody just wants to
take a quick glance at some of the values as they exist right now, and
has no intention of running any more queries that might examine the
same data again later. Not only does caching all the data use a lot of
memory, but having to read all the data in order to cache it is
potentially a lot slower than just reading the data actually
requested. I'm unwilling to dismiss that as a negligible problem.

In short, I agree that there's stuff to worry about here, but I don't
agree that a zero-consistency model is a crazy idea, even though I
also think it would be nice if we can make stronger consistency
available upon request.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: shared memory stats: high level design decisions: consistency, dropping

From
Magnus Hagander
Date:
On Thu, Mar 25, 2021 at 6:20 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-03-24 14:42:11 +0100, Magnus Hagander wrote:
> > On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote:
> > > > If I understand what you are proposing, all stats views would become
> > > > completely volatile, without even within-query consistency.  That really
> > > > is not gonna work.  As an example, you could get not-even-self-consistent
> > > > results from a join to a stats view if the planner decides to implement
> > > > it as a nestloop with the view on the inside.
> > >
> > > I don't really think it's a problem that's worth incuring that much cost to
> > > prevent. We already have that behaviour for a number of of the pg_stat_* views,
> > > e.g. pg_stat_xact_all_tables, pg_stat_replication.
> >
> > Aren't those both pretty bad examples though?
> >
> > pg_stat_xact_all_tables surely is within-query consistent, and would
> > be pretty useless if it wwas within-transaction consistent?
>
> It's not within-query consistent:
>
> postgres[1182102][1]=# SELECT pg_stat_get_xact_numscans('pg_class'::regclass) UNION ALL SELECT count(*) FROM pg_class
UNIONALL SELECT pg_stat_get_xact_numscans('pg_class'::regclass); 
> ┌───────────────────────────┐
> │ pg_stat_get_xact_numscans │
> ├───────────────────────────┤
> │                         0 │
> │                       397 │
> │                         1 │
> └───────────────────────────┘
> (3 rows)

Heh. OK, I admit I didn't consider a UNION query like that -- I only
considered it being present *once* in a query :)

That said, if wanted that can be dealt with a WITH MATERIALIZED as
long as it's in the same query, no?


> > pg_stat_replication is a snapshot of what things are right now (like
> > pg_stat_activity), and not collected statistics.
>
> However, pg_stat_activity does have snapshot semantics...

Yeah, yay consistency.


> > Maybe there's inconsistency in that they should've had a different
> > name to separate it out, but fundamentally having xact consistent
> > views there would be a bad thing, no?
>
> True. One weird thing around the _xact_ versions that we, at best,
> *hint* at in the docs, but also contradict, is that _xact_ views are
> actually not tied to the transaction.  It's really about unsubmitted
> stats. E.g. if executing the following via copy-paste
>
> SELECT count(*) FROM pg_class;
> SELECT pg_stat_get_xact_numscans('pg_class'::regclass);
> SELECT count(*) FROM pg_class;
> SELECT pg_stat_get_xact_numscans('pg_class'::regclass);
>
> will most of the time return
> <count>
> 0
> <count>
> 1
>
> because after the transaction for the first count(*) commits, we'll not
> have submitted stats for more than PGSTAT_STAT_INTERVAL. But after the
> second count(*) it'll be shorter, therefore the stats won't be
> submitted...

That's... cute. I hadn't realized that part, but then I've never
actually had use for the _xact_ views.


> > > There's just a huge difference between being able to access a table's stats in
> > > O(1) time, or having a single stats access be O(database-objects).
> > >
> > > And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
> > > (for IO over time stats etc) that often are monitored at a somewhat high
> > > frequency - they also pay the price of reading in all object stats.  On my
> > > example database with 1M tables it takes 0.4s to read pg_stat_database.
> >
> > IMV, singeling things out into "larger groups" would be one perfectly
> > acceptable compromise. That is, say that pg_stat_user_tables can be
> > inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent
> > with itself.
>
> I don't think that buys us all that much though. It still is a huge
> issue that we need to cache the stats for all relations even though we
> only access the stats for table.

Well, you yourself just mentioned that access to bgwriter and db stats
are often sampled at a higher frequency.

That said, this can often include *individual* tables as well, but
maybe not all at once.


> > > > I also believe that the snapshotting behavior has advantages in terms
> > > > of being able to perform multiple successive queries and get consistent
> > > > results from them.  Only the most trivial sorts of analysis don't need
> > > > that.
> > >
> > > In most cases you'd not do that in a transaction tho, and you'd need to create
> > > temporary tables with a snapshot of the stats anyway.
> >
> > I'd say in most cases this analysis happens in snapshots anyway, and
> > those are snapshots unrelated to what we do in pg_stat. It's either
> > snapshotted to tables, or to storage in a completely separate
> > database.
>
> Agreed. I wonder some of that work would be made easier if we added a
> function to export all the data in the current snapshot as a json
> document or such? If we add configurable caching (see below) that'd
> really not be a lot of additional work.

I'd assume if you want the snapshot in the database, you'd want it in
a database format. That is, if you want the snapshot in the db, you
actually *want* it in a table or similar. I'm not sure json format
would really help that much?

What would probably be interesting to more in that case is if we could
build ourselves, either builtin or as an extension a background worker
that would listen and export openmetrics format talking directly to
the stats and bypassing the need to have an exporter running that
connects as a regular user and ends up converting the things between
many different formats on the way.


> > I would *like* to have query-level consistent views. But I may be able
> > to compromise on that one for the sake of performance as well.
> >
> > I definitely need there to be object-level consistent views.
>
> That'd be free if we didn't use all those separate function calls for
> each row in pg_stat_all_tables etc...

So.. We should fix that?

We could still keep those separate functions for backwards
compatibility if we wanted of course, but move the view to use an SRF
and make that SRF also directly callable with useful parameters.

I mean, it's been over 10 years since we did that for pg_stat_activity
in 9.1, so it's perhaps time to do another view? :)


> > > I wonder if a reasonable way out could be to have pg_stat_make_snapshot()
> > > (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager
> > > data load. But not use any snapshot / caching behaviour without that?
> >
> > I think that's a pretty good idea.
>
> It's what I am leaning towards right now.
>
>
> > Another idea could be a per-user GUC of "stats_snapshots" or so, and
> > then if it's on force the snapshots all times. That way a DBA who
> > wants the snapshots could set it on their own user but keep it off for
> > the automated jobs for example. (It'd basically be the same except
> > automatically calling pg_stat_make_snapshot() the first time stats are
> > queried)
>
> Yea, I was thinking similar. We could e.g. have
> stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'.
>
> With 'none' there's no consistency beyond that a single pg_stat_get_*
> function call will give consistent results.
>
> With 'query' we cache each object on first access (i.e. there can be
> inconsistency between different objects if their accesses are further
> appart, but not within a stats object).
>
> With 'snapshot' we cache all the stats at the first access, for the
> duration of the transaction.
>
>
> However: I think 'query' is surprisingly hard to implement. There can be
> multiple ongoing overlapping queries ongoing at the same time. And even
> if there aren't multiple ongoing queries, there's really nothing great
> to hook a reset into.
>
> So I'm inclined to instead have 'access', which caches individual stats
> object on first access. But lasts longer than the query.

I guess the naive way to do query would be to just, ahem, lock the
stats while a query is ongoing. But that would probably be *really*
terrible in the case of big stats, yes :)


> Tom, if we defaulted to 'access' would that satisfy your concerns? Or
> would even 'none' as a default do? I'd rather not have SELECT * FROM
> pg_stat_all_tables; accumulate an unnecessary copy of most of the
> cluster's stats in memory just to immediately throw it away again.

But in your "cache individual stats object on first access", it would
copy most of the stats over when you did a "SELECT *", no? It would
only be able to avoid that if you had a WHERE clause on it limiting
what you got?

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: shared memory stats: high level design decisions: consistency, dropping

From
Andres Freund
Date:
Hi,

I did end up implementing the configurable fetch consistency. Seems to
work well, and it's not that much code after a bunch of other
cleanups. See below the quoted part at the bottom.

I just posted the latest iteration of that code at
https://www.postgresql.org/message-id/20210405092914.mmxqe7j56lsjfsej%40alap3.anarazel.de

Copying from that mail:

I've spent most of the last 2 1/2 weeks on this now. Unfortunately I think
that, while it has gotten a lot closer, it's still about a week's worth of
work away from being committable.


My main concerns are:


- Test Coverage:


  I've added a fair bit of tests, but it's still pretty bad. There were a lot
  of easy-to-hit bugs in earlier versions that nevertheless passed the test
  just fine.


  Due to the addition of pg_stat_force_next_flush(), and that there's no need
  to wait for the stats collector to write out files, it's now a lot more
  realistic to have proper testing of a lot of the pgstat.c code.


- Architectural Review


  I rejiggered the patchset pretty significantly, and I think it needs more
  review than I see as realistic in the next two days. In particular I don't
  think


- Performance Testing


  I did a small amount, but given that this touches just about every query
  etc, I think that's not enough. My changes unfortunately are substantial
  enough to invalidate Horiguchi-san's earlier tests.


- Currently there's a corner case in which function (but not table!)  stats
  for a dropped function may not be removed. That possibly is not too bad,


- Too many FIXMEs still open


It is quite disappointing to not have the patch go into v14 :(. But I just
don't quite see the path right now. But maybe I am just too tired right now,
and it'll look better tomorrow (err today, in a few hours).


One aspect making this particularly annoying is that there's a number of
stats additions in v14 that'd be easier to make robust with the shared
memory based approach. But I don't think I can get it a committable
shape in 2 days. Nor is it clear to that that it'd be a good idea to
commit it, even if I could just about make it, given that pretty much
everything involves pgstat somewhere.


On 2021-03-25 23:52:58 +0100, Magnus Hagander wrote:
> > Yea, I was thinking similar. We could e.g. have
> > stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'.
> >
> > With 'none' there's no consistency beyond that a single pg_stat_get_*
> > function call will give consistent results.
> >
> > With 'query' we cache each object on first access (i.e. there can be
> > inconsistency between different objects if their accesses are further
> > appart, but not within a stats object).
> >
> > With 'snapshot' we cache all the stats at the first access, for the
> > duration of the transaction.
> >
> >
> > However: I think 'query' is surprisingly hard to implement. There can be
> > multiple ongoing overlapping queries ongoing at the same time. And even
> > if there aren't multiple ongoing queries, there's really nothing great
> > to hook a reset into.
> >
> > So I'm inclined to instead have 'access', which caches individual stats
> > object on first access. But lasts longer than the query.

I went with stats_fetch_consistency = {snapshot, cache, none}, that
seems a bit more obvious than cache. I haven't yet changed [auto]vacuum
so it uses 'none' unconditionally - but that shouldn't be hard.

Greetings,

Andres Freund