Thread: shared memory stats: high level design decisions: consistency, dropping
Hi, I am working on Kyotaro Horiguchi's shared memory stats patch [1] with the goal of getting it into a shape that I'd be happy to commit. That thread is quite long and most are probably skipping over new messages in it. There are two high-level design decisions / questions that I think warrant a wider audience (I'll keep lower level discussion in the other thread). In case it is not obvious, the goal of the shared memory stats patch is to replace the existing statistics collector, to which new stats are reported via an UDP socket, and where clients read data from the stats collector by reading the entire database's stats from disk. The replacement is to put the statistics into a shared memory segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity, etc) being stored directly in a struct in memory. Stats for objects where a variable number exists, e.g. tables, are addressed via a dshash. table that points to the stats that are in turn allocated using dsa.h. 1) What kind of consistency do we want from the pg_stats_* views? Right now the first access to stats in a transaction will trigger a read of both the global and per-database stats from disk. If the on-disk state is too old, we'll ask the stats collector to write out a new file a couple times. For the rest of the transaction that in-memory state is used unless pg_stat_clear_snapshot() is called. Which means that multiple reads from e.g. pg_stat_user_tables will show the same results as before [2]. That makes stats accesses quite expensive if there are lots of objects. But it also means that separate stats accesses - which happen all the time - return something repeatable and kind of consistent. Now, the stats aren't really consistent in the sense that they are really accurate, UDP messages can be lost, or only some of the stats generated by a TX might not yet have been received, other transactions haven't yet sent them. Etc. With the shared memory patch the concept of copying all stats for the current database into local memory at the time of the first stats access doesn't make sense to me. Horiguchi-san had actually implemented that, but I argued that that would be cargo-culting an efficiency hack required by the old storage model forward. The cost of doing this is substantial. On master, with a database that contains 1 million empty tables, any stats access takes ~0.4s and increases memory usage by 170MB. 1.1) I hope everybody agrees with not requiring that stats don't need to be the way they were at the time of first stat access in a transaction, even if that first access was to a different stat object than the currently accessed stat? 1.2) Do we expect repeated accesses to the same stat to stay the same through the transaction? The easiest way to implement stats accesses is to simply fetch the stats from shared memory ([3]). That would obviously result in repeated accesses to the same stat potentially returning changing results over time. I think that's perfectly fine, desirable even, for pg_stat_*. 1.3) What kind of consistency do we expect between columns of views like pg_stat_all_tables? Several of the stats views aren't based on SRFs or composite-type returning functions, but instead fetch each stat separately: E.g. pg_stat_all_tables: SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan, sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, ... pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM pg_class c LEFT JOIN pg_index i ON c.oid = i.indrelid ... Which means that if we do not cache stats, additional stats updates could have been applied between two stats accessors. E.g the seq_scan from before some pgstat_report_stat() but the seq_tup_read from after. If we instead fetch all of a table's stats in one go, we would get consistency between the columns. But obviously that'd require changing all the stats views. Horiguchi-san, in later iterations of the patch, attempted to address this issue by adding a one-entry caches below pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use. But I think that leads to very confusing results. Access stats for the same relation multiple times in a row? Do not see updates. Switch between e.g. a table and its indexes? See updates. I personally think it's fine to have short-term divergences between the columns. The stats aren't that accurate anyway, as we don't submit them all the time. And that if we want consistency between columns, we instead should replace the current view definitions with[set of] record returning function - everything else seems to lead to weird tradeoffs. 2) How to remove stats of dropped objects? In the stats collector world stats for dropped objects (tables, indexes, functions, etc) are dropped after the fact, using a pretty expensive process: Each autovacuum worker cycle and each manual VACUUM does pgstat_vacuum_stat() to detect since-dropped objects. It does that by building hash-tables for all databases, tables and functions, and then comparing that against a freshly loaded stats snapshot. All stats object not in pg_class etc are dropped. The patch currently copies that approach, although that adds some complications, mainly around [3]. Accessing all database objects after each VACUUM, even if the table was tiny, isn't great, performance wise. In a fresh master database with 1 million functions, a VACUUM of an empty table takes ~0.5s, with 1 million tables it's ~1s. Due to partitioning tables with many database objects are of course getting more common. There isn't really a better approach in the stats collector world. As messages to the stats collector can get lost, we need to to be able to re-do dropping of dead stats objects. But now we could instead schedule stats to be removed at commit time. That's not trivial of course, as we'd need to handle cases where the commit fails after the commit record, but before processing the dropped stats. But it seems that integrating the stats that need to be dropped into the commit message would make a lot of sense. With benefits beyond the [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming replication and even opening the door to keep stats across crashes. My gut feeling here is to try to to fix the remaining issues in the "collect oids" approach for 14 and to try to change the approach in 15. And, if that proves too hard, try to see how hard it'd be to "accurately" drop. But I'm also not sure - it might be smarter to go full in, to avoid introducing a system that we'll just rip out again. Comments? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp [2] Except that new tables with show up with lots of 0s [3] There is a cache to avoid repeated dshash lookups for previously-accessed stats, to avoid contention. But that just points to the shared memory area with the stats.
Re: shared memory stats: high level design decisions: consistency, dropping
From
Hannu Krosing
Date:
> But now we could instead schedule stats to be removed at commit time. That's not trivial of course, as we'd need to handle cases where the commit fails after the commit record, but before processing the dropped stats. We likely can not remove them at commit time, but only after the oldest open snapshot moves parts that commit ? Would an approach where we keep stats in a structure logically similar to MVCC we use for normal tables be completely unfeasible ? We would only need to keep one Version per backend in transaction . --- Hannu On Sat, Mar 20, 2021 at 12:51 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > I am working on Kyotaro Horiguchi's shared memory stats patch [1] with > the goal of getting it into a shape that I'd be happy to commit. That > thread is quite long and most are probably skipping over new messages in > it. > > There are two high-level design decisions / questions that I think > warrant a wider audience (I'll keep lower level discussion in the other > thread). > > In case it is not obvious, the goal of the shared memory stats patch is > to replace the existing statistics collector, to which new stats are > reported via an UDP socket, and where clients read data from the stats > collector by reading the entire database's stats from disk. > > The replacement is to put the statistics into a shared memory > segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity, > etc) being stored directly in a struct in memory. Stats for objects > where a variable number exists, e.g. tables, are addressed via a dshash. > table that points to the stats that are in turn allocated using dsa.h. > > > 1) What kind of consistency do we want from the pg_stats_* views? > > Right now the first access to stats in a transaction will trigger a read > of both the global and per-database stats from disk. If the on-disk > state is too old, we'll ask the stats collector to write out a new file > a couple times. > > For the rest of the transaction that in-memory state is used unless > pg_stat_clear_snapshot() is called. Which means that multiple reads from > e.g. pg_stat_user_tables will show the same results as before [2]. > > That makes stats accesses quite expensive if there are lots of > objects. > > But it also means that separate stats accesses - which happen all the > time - return something repeatable and kind of consistent. > > Now, the stats aren't really consistent in the sense that they are > really accurate, UDP messages can be lost, or only some of the stats > generated by a TX might not yet have been received, other transactions > haven't yet sent them. Etc. > > > With the shared memory patch the concept of copying all stats for the > current database into local memory at the time of the first stats access > doesn't make sense to me. Horiguchi-san had actually implemented that, > but I argued that that would be cargo-culting an efficiency hack > required by the old storage model forward. > > The cost of doing this is substantial. On master, with a database that > contains 1 million empty tables, any stats access takes ~0.4s and > increases memory usage by 170MB. > > > 1.1) > > I hope everybody agrees with not requiring that stats don't need to be > the way they were at the time of first stat access in a transaction, > even if that first access was to a different stat object than the > currently accessed stat? > > > 1.2) > > Do we expect repeated accesses to the same stat to stay the same through > the transaction? The easiest way to implement stats accesses is to > simply fetch the stats from shared memory ([3]). That would obviously > result in repeated accesses to the same stat potentially returning > changing results over time. > > I think that's perfectly fine, desirable even, for pg_stat_*. > > > 1.3) > > What kind of consistency do we expect between columns of views like > pg_stat_all_tables? > > Several of the stats views aren't based on SRFs or composite-type > returning functions, but instead fetch each stat separately: > > E.g. pg_stat_all_tables: > SELECT c.oid AS relid, > n.nspname AS schemaname, > c.relname, > pg_stat_get_numscans(c.oid) AS seq_scan, > pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, > sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan, > sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch, > pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, > ... > pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count > FROM pg_class c > LEFT JOIN pg_index i ON c.oid = i.indrelid > ... > > Which means that if we do not cache stats, additional stats updates > could have been applied between two stats accessors. E.g the seq_scan > from before some pgstat_report_stat() but the seq_tup_read from after. > > If we instead fetch all of a table's stats in one go, we would get > consistency between the columns. But obviously that'd require changing > all the stats views. > > Horiguchi-san, in later iterations of the patch, attempted to address > this issue by adding a one-entry caches below > pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is > what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use. > > > But I think that leads to very confusing results. Access stats for the > same relation multiple times in a row? Do not see updates. Switch > between e.g. a table and its indexes? See updates. > > > I personally think it's fine to have short-term divergences between the > columns. The stats aren't that accurate anyway, as we don't submit them > all the time. And that if we want consistency between columns, we > instead should replace the current view definitions with[set of] record > returning function - everything else seems to lead to weird tradeoffs. > > > > 2) How to remove stats of dropped objects? > > In the stats collector world stats for dropped objects (tables, indexes, > functions, etc) are dropped after the fact, using a pretty expensive > process: > > Each autovacuum worker cycle and each manual VACUUM does > pgstat_vacuum_stat() to detect since-dropped objects. It does that by > building hash-tables for all databases, tables and functions, and then > comparing that against a freshly loaded stats snapshot. All stats object > not in pg_class etc are dropped. > > The patch currently copies that approach, although that adds some > complications, mainly around [3]. > > > Accessing all database objects after each VACUUM, even if the table was > tiny, isn't great, performance wise. In a fresh master database with 1 > million functions, a VACUUM of an empty table takes ~0.5s, with 1 > million tables it's ~1s. Due to partitioning tables with many database > objects are of course getting more common. > > > There isn't really a better approach in the stats collector world. As > messages to the stats collector can get lost, we need to to be able to > re-do dropping of dead stats objects. > > > But now we could instead schedule stats to be removed at commit > time. That's not trivial of course, as we'd need to handle cases where > the commit fails after the commit record, but before processing the > dropped stats. > > But it seems that integrating the stats that need to be dropped into the > commit message would make a lot of sense. With benefits beyond the > [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming > replication and even opening the door to keep stats across crashes. > > > My gut feeling here is to try to to fix the remaining issues in the > "collect oids" approach for 14 and to try to change the approach in > 15. And, if that proves too hard, try to see how hard it'd be to > "accurately" drop. But I'm also not sure - it might be smarter to go > full in, to avoid introducing a system that we'll just rip out again. > > > Comments? > > > Greetings, > > Andres Freund > > [1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp > > [2] Except that new tables with show up with lots of 0s > > [3] There is a cache to avoid repeated dshash lookups for > previously-accessed stats, to avoid contention. But that just points > to the shared memory area with the stats. > >
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote: > > But now we could instead schedule stats to be removed at commit > time. That's not trivial of course, as we'd need to handle cases where > the commit fails after the commit record, but before processing the > dropped stats. > > We likely can not remove them at commit time, but only after the > oldest open snapshot moves parts that commit ? I don't see why? A dropped table is dropped, and cannot be accessed anymore. Snapshots don't play a role here - the underlying data is gone (minus a placeholder file to avoid reusing the oid, until the next commit). If you run a vacuum on some unrelated table in the same database, the stats for a dropped table will already be removed long before there's no relation that could theoretically open the table. Note that table level locking would prevent a table from being dropped if a long-running transaction has already accessed it. > Would an approach where we keep stats in a structure logically similar > to MVCC we use for normal tables be completely unfeasible ? Yes, pretty unfeasible. Stats should work on standbys too... Regards, Andres
Re: shared memory stats: high level design decisions: consistency, dropping
From
Hannu Krosing
Date:
On Sat, Mar 20, 2021 at 1:21 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-03-20 01:16:31 +0100, Hannu Krosing wrote: > > > But now we could instead schedule stats to be removed at commit > > time. That's not trivial of course, as we'd need to handle cases where > > the commit fails after the commit record, but before processing the > > dropped stats. > > > > We likely can not remove them at commit time, but only after the > > oldest open snapshot moves parts that commit ? > > I don't see why? A dropped table is dropped, and cannot be accessed > anymore. Snapshots don't play a role here - the underlying data is gone > (minus a placeholder file to avoid reusing the oid, until the next > commit). If you run a vacuum on some unrelated table in the same > database, the stats for a dropped table will already be removed long > before there's no relation that could theoretically open the table. > > Note that table level locking would prevent a table from being dropped > if a long-running transaction has already accessed it. Yeah, just checked. DROP TABLE waits until the reading transaction finishes. > > > Would an approach where we keep stats in a structure logically similar > > to MVCC we use for normal tables be completely unfeasible ? > > Yes, pretty unfeasible. Stats should work on standbys too... I did not mean actually using MVCC and real transaction ids but rather similar approach, where (potentially) different stats rows are kept for each backend. This of course only is a win in case multiple backends can use the same stats row. Else it is easier to copy the backends version into backend local memory. But I myself do not see any problem with stats rows changing all the time. The only worry would be parts of the same row being out of sync. This can of course be solved by locking, but for large number of backends with tiny transactions this locking itself could potentially become a problem. Here alternating between two or more versions could help and then it also starts makes sense to keep the copies in shared memory > Regards, > > Andres
Re: shared memory stats: high level design decisions: consistency, dropping
From
Stephen Frost
Date:
Greetings, * Andres Freund (andres@anarazel.de) wrote: > I am working on Kyotaro Horiguchi's shared memory stats patch [1] with > the goal of getting it into a shape that I'd be happy to commit. That > thread is quite long and most are probably skipping over new messages in > it. Awesome, +1. > 1) What kind of consistency do we want from the pg_stats_* views? > > Right now the first access to stats in a transaction will trigger a read > of both the global and per-database stats from disk. If the on-disk > state is too old, we'll ask the stats collector to write out a new file > a couple times. > > For the rest of the transaction that in-memory state is used unless > pg_stat_clear_snapshot() is called. Which means that multiple reads from > e.g. pg_stat_user_tables will show the same results as before [2]. > > That makes stats accesses quite expensive if there are lots of > objects. > > But it also means that separate stats accesses - which happen all the > time - return something repeatable and kind of consistent. > > Now, the stats aren't really consistent in the sense that they are > really accurate, UDP messages can be lost, or only some of the stats > generated by a TX might not yet have been received, other transactions > haven't yet sent them. Etc. > > > With the shared memory patch the concept of copying all stats for the > current database into local memory at the time of the first stats access > doesn't make sense to me. Horiguchi-san had actually implemented that, > but I argued that that would be cargo-culting an efficiency hack > required by the old storage model forward. > > The cost of doing this is substantial. On master, with a database that > contains 1 million empty tables, any stats access takes ~0.4s and > increases memory usage by 170MB. > > > 1.1) > > I hope everybody agrees with not requiring that stats don't need to be > the way they were at the time of first stat access in a transaction, > even if that first access was to a different stat object than the > currently accessed stat? Agreed, that doesn't seem necessary and blowing up backend memory usage by copying all the stats into local memory seems pretty terrible. > 1.2) > > Do we expect repeated accesses to the same stat to stay the same through > the transaction? The easiest way to implement stats accesses is to > simply fetch the stats from shared memory ([3]). That would obviously > result in repeated accesses to the same stat potentially returning > changing results over time. > > I think that's perfectly fine, desirable even, for pg_stat_*. This seems alright to me. > 1.3) > > What kind of consistency do we expect between columns of views like > pg_stat_all_tables? > > Several of the stats views aren't based on SRFs or composite-type > returning functions, but instead fetch each stat separately: > > E.g. pg_stat_all_tables: > SELECT c.oid AS relid, > n.nspname AS schemaname, > c.relname, > pg_stat_get_numscans(c.oid) AS seq_scan, > pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, > sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan, > sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch, > pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, > ... > pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count > FROM pg_class c > LEFT JOIN pg_index i ON c.oid = i.indrelid > ... > > Which means that if we do not cache stats, additional stats updates > could have been applied between two stats accessors. E.g the seq_scan > from before some pgstat_report_stat() but the seq_tup_read from after. > > If we instead fetch all of a table's stats in one go, we would get > consistency between the columns. But obviously that'd require changing > all the stats views. > > Horiguchi-san, in later iterations of the patch, attempted to address > this issue by adding a one-entry caches below > pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is > what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use. > > > But I think that leads to very confusing results. Access stats for the > same relation multiple times in a row? Do not see updates. Switch > between e.g. a table and its indexes? See updates. > > > I personally think it's fine to have short-term divergences between the > columns. The stats aren't that accurate anyway, as we don't submit them > all the time. And that if we want consistency between columns, we > instead should replace the current view definitions with[set of] record > returning function - everything else seems to lead to weird tradeoffs. Agreed, doesn't seem like a huge issue to have short-term divergences but if we want to fix them then flipping those all to SRFs would make the most sense. > 2) How to remove stats of dropped objects? > > In the stats collector world stats for dropped objects (tables, indexes, > functions, etc) are dropped after the fact, using a pretty expensive > process: > > Each autovacuum worker cycle and each manual VACUUM does > pgstat_vacuum_stat() to detect since-dropped objects. It does that by > building hash-tables for all databases, tables and functions, and then > comparing that against a freshly loaded stats snapshot. All stats object > not in pg_class etc are dropped. > > The patch currently copies that approach, although that adds some > complications, mainly around [3]. > > > Accessing all database objects after each VACUUM, even if the table was > tiny, isn't great, performance wise. In a fresh master database with 1 > million functions, a VACUUM of an empty table takes ~0.5s, with 1 > million tables it's ~1s. Due to partitioning tables with many database > objects are of course getting more common. > > > There isn't really a better approach in the stats collector world. As > messages to the stats collector can get lost, we need to to be able to > re-do dropping of dead stats objects. > > > But now we could instead schedule stats to be removed at commit > time. That's not trivial of course, as we'd need to handle cases where > the commit fails after the commit record, but before processing the > dropped stats. > > But it seems that integrating the stats that need to be dropped into the > commit message would make a lot of sense. With benefits beyond the > [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming > replication and even opening the door to keep stats across crashes. > > > My gut feeling here is to try to to fix the remaining issues in the > "collect oids" approach for 14 and to try to change the approach in > 15. And, if that proves too hard, try to see how hard it'd be to > "accurately" drop. But I'm also not sure - it might be smarter to go > full in, to avoid introducing a system that we'll just rip out again. > > Comments? The current approach sounds pretty terrible and propagating that forward doesn't seem great. Guess here I'd disagree with your gut feeling and encourage trying to go 'full in', as you put it, or at least put enough effort into it to get a feeling of if it's going to require a *lot* more work or not and then reconsider if necessary. Thanks, Stephen
Attachment
Andres Freund <andres@anarazel.de> writes: > 1) What kind of consistency do we want from the pg_stats_* views? That's a hard choice to make. But let me set the record straight: when we did the initial implementation, the stats snapshotting behavior was considered a FEATURE, not an "efficiency hack required by the old storage model". If I understand what you are proposing, all stats views would become completely volatile, without even within-query consistency. That really is not gonna work. As an example, you could get not-even-self-consistent results from a join to a stats view if the planner decides to implement it as a nestloop with the view on the inside. I also believe that the snapshotting behavior has advantages in terms of being able to perform multiple successive queries and get consistent results from them. Only the most trivial sorts of analysis don't need that. In short, what you are proposing sounds absolutely disastrous for usability of the stats views, and I for one will not sign off on it being acceptable. I do think we could relax the consistency guarantees a little bit, perhaps along the lines of only caching view rows that have already been read, rather than grabbing everything up front. But we can't just toss the snapshot concept out the window. It'd be like deciding that nobody needs MVCC, or even any sort of repeatable read. regards, tom lane
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, On 2021-03-21 11:41:30 -0400, Stephen Frost wrote: > > 1.1) > > > > I hope everybody agrees with not requiring that stats don't need to be > > the way they were at the time of first stat access in a transaction, > > even if that first access was to a different stat object than the > > currently accessed stat? > > Agreed, that doesn't seem necessary and blowing up backend memory usage > by copying all the stats into local memory seems pretty terrible. Yea. I've seen instances where most backends had several hundreds MB of stats loaded :(. Even leaving the timing overhead aside, that's really not fun. Of course that application may not have had exactly the greatest design, but ... > > 1.2) > > > > Do we expect repeated accesses to the same stat to stay the same through > > the transaction? The easiest way to implement stats accesses is to > > simply fetch the stats from shared memory ([3]). That would obviously > > result in repeated accesses to the same stat potentially returning > > changing results over time. > > > > I think that's perfectly fine, desirable even, for pg_stat_*. > > This seems alright to me. Seems Tom disagrees :( > > 1.3) > > > > What kind of consistency do we expect between columns of views like > > pg_stat_all_tables? > > [...] > > I personally think it's fine to have short-term divergences between the > > columns. The stats aren't that accurate anyway, as we don't submit them > > all the time. And that if we want consistency between columns, we > > instead should replace the current view definitions with[set of] record > > returning function - everything else seems to lead to weird tradeoffs. > > Agreed, doesn't seem like a huge issue to have short-term divergences > but if we want to fix them then flipping those all to SRFs would make > the most sense. There's also a pretty good efficiency argument for going to SRFs. Doing 18 hashtable lookups + function calls just to return one row of pg_stat_all_tables surely is a higher overhead than unnecessarily returning columns that weren't needed by the user. I do think it makes sense to get idx_scan/idx_tup_fetch via a join though. > > 2) How to remove stats of dropped objects? > > > > [...] > > The current approach sounds pretty terrible and propagating that forward > doesn't seem great. Guess here I'd disagree with your gut feeling and > encourage trying to go 'full in', as you put it, or at least put enough > effort into it to get a feeling of if it's going to require a *lot* more > work or not and then reconsider if necessary. I think my gut's argument is that it's already a huge patch, and that it's better to have the the very substantial memory and disk IO savings with the crappy vacuum approach, than neither. And given the timeframe there does seem to be a substantial danger of "neither" being the outcome... Anyway, I'm mentally sketching out what it'd take. Greetings, Andres Freund
Re: shared memory stats: high level design decisions: consistency, dropping
From
Stephen Frost
Date:
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > If I understand what you are proposing, all stats views would become > completely volatile, without even within-query consistency. That really > is not gonna work. As an example, you could get not-even-self-consistent > results from a join to a stats view if the planner decides to implement > it as a nestloop with the view on the inside. > > I also believe that the snapshotting behavior has advantages in terms > of being able to perform multiple successive queries and get consistent > results from them. Only the most trivial sorts of analysis don't need > that. > > In short, what you are proposing sounds absolutely disastrous for > usability of the stats views, and I for one will not sign off on it > being acceptable. > > I do think we could relax the consistency guarantees a little bit, > perhaps along the lines of only caching view rows that have already > been read, rather than grabbing everything up front. But we can't > just toss the snapshot concept out the window. It'd be like deciding > that nobody needs MVCC, or even any sort of repeatable read. This isn't the same use-case as traditional tables or relational concepts in general- there aren't any foreign keys for the fields that would actually be changing across these accesses to the shared memory stats- we're talking about gross stats numbers like the number of inserts into a table, not an employee_id column. In short, I don't agree that this is a fair comparison. Perhaps there's a good argument to try and cache all this info per backend, but saying that it's because we need MVCC-like semantics for this data because other things need MVCC isn't it and I don't know that saying this is a justifiable case for requiring repeatable read is reasonable either. What specific, reasonable, analysis of the values that we're actually talking about, which are already aggregates themselves, is going to end up being utterly confused? Thanks, Stephen
Attachment
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, On 2021-03-21 12:14:35 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > 1) What kind of consistency do we want from the pg_stats_* views? > > That's a hard choice to make. But let me set the record straight: > when we did the initial implementation, the stats snapshotting behavior > was considered a FEATURE, not an "efficiency hack required by the old > storage model". Oh - sorry for misstating that then. I did try to look for the origins of the approach, and all that I found was that it'd be too expensive to do multiple stats file reads. > If I understand what you are proposing, all stats views would become > completely volatile, without even within-query consistency. That really > is not gonna work. As an example, you could get not-even-self-consistent > results from a join to a stats view if the planner decides to implement > it as a nestloop with the view on the inside. I don't really think it's a problem that's worth incuring that much cost to prevent. We already have that behaviour for a number of of the pg_stat_* views, e.g. pg_stat_xact_all_tables, pg_stat_replication. If the cost were low - or we can find a reasonable way to get to low costs - I think it'd be worth preserving for backward compatibility's sake alone. From an application perspective, I actually rarely want that behaviour for stats views - I'm querying them to get the most recent information, not an older snapshot. And in the cases I do want snapshots, I'd want them for longer than a transaction. There's just a huge difference between being able to access a table's stats in O(1) time, or having a single stats access be O(database-objects). And that includes accesses to things like pg_stat_bgwriter, pg_stat_database (for IO over time stats etc) that often are monitored at a somewhat high frequency - they also pay the price of reading in all object stats. On my example database with 1M tables it takes 0.4s to read pg_stat_database. We currently also fetch the full stats in places like autovacuum.c. Where we don't need repeated access to be consistent - we even explicitly force the stats to be re-read for every single table that's getting vacuumed. Even if we to just cache already accessed stats, places like do_autovacuum() would end up with a completely unnecessary cache of all tables, blowing up memory usage by a large amount on systems with lots of relations. > I also believe that the snapshotting behavior has advantages in terms > of being able to perform multiple successive queries and get consistent > results from them. Only the most trivial sorts of analysis don't need > that. In most cases you'd not do that in a transaction tho, and you'd need to create temporary tables with a snapshot of the stats anyway. > In short, what you are proposing sounds absolutely disastrous for > usability of the stats views, and I for one will not sign off on it > being acceptable. :( That's why I thought it'd be important to bring this up to a wider audience. This has been discussed several times in the thread, and nobody really chimed up wanting the "snapshot" behaviour... > I do think we could relax the consistency guarantees a little bit, > perhaps along the lines of only caching view rows that have already > been read, rather than grabbing everything up front. But we can't > just toss the snapshot concept out the window. It'd be like deciding > that nobody needs MVCC, or even any sort of repeatable read. I think that'd still a huge win - caching only what's been accessed rather than everything will save a lot of memory in very common cases. I did bring it up as one approach for that reason. I do think it has a few usability quirks though. The time-skew between stats objects accessed at different times seems like it could be quite confusing? E.g. imagine looking at table stats and then later join to index stats and see table / index stats not matching up at all. I wonder if a reasonable way out could be to have pg_stat_make_snapshot() (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager data load. But not use any snapshot / caching behaviour without that? It's be a fair bit of code to have that, but I think can see a way to have it not be too bad? Greetings, Andres Freund
On Sun, 21 Mar 2021 at 18:16, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > I also believe that the snapshotting behavior has advantages in terms > > of being able to perform multiple successive queries and get consistent > > results from them. Only the most trivial sorts of analysis don't need > > that. > > > > In short, what you are proposing sounds absolutely disastrous for > > usability of the stats views, and I for one will not sign off on it > > being acceptable. > > > > I do think we could relax the consistency guarantees a little bit, > > perhaps along the lines of only caching view rows that have already > > been read, rather than grabbing everything up front. But we can't > > just toss the snapshot concept out the window. It'd be like deciding > > that nobody needs MVCC, or even any sort of repeatable read. > > This isn't the same use-case as traditional tables or relational > concepts in general- there aren't any foreign keys for the fields that > would actually be changing across these accesses to the shared memory > stats- we're talking about gross stats numbers like the number of > inserts into a table, not an employee_id column. In short, I don't > agree that this is a fair comparison. I use these stats quite a bit and do lots of slicing and dicing with them. I don't think it's as bad as Tom says but I also don't think we can be quite as loosy-goosy as I think Andres or Stephen might be proposing either (though I note that haven't said they don't want any consistency at all). The cases where the consistency really matter for me is when I'm doing math involving more than one statistic. Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look at the ratio between n_tup_upd and n_tup_hot_upd. And no, it doesn't help that these are often large numbers after a long time because I'm actually working with the first derivative of these numbers using snapshots or a time series database. So if you have the seq_tup_read incremented but not seq_scan incremented you could get a wildly incorrect calculation of "tup read per seq scan" which actually matters. I don't think I've ever done math across stats for different objects. I mean, I've plotted them together and looked at which was higher but I don't think that's affected by some plots having peaks slightly out of sync with the other. I suppose you could look at the ratio of access patterns between two tables and know that they're only ever accessed by a single code path at the same time and therefore the ratios would be meaningful. But I don't think users would be surprised to find they're not consistent that way either. So I think we need to ensure that at least all the values for a single row representing a single object are consistent. Or at least that there's *some* correct way to retrieve a consistent row and that the standard views use that. I don't think we need to guarantee that every possible plan will always be consistent even if you access the row multiple times in a self-join or use the lookup function on individual columns separately. -- greg
Re: shared memory stats: high level design decisions: consistency, dropping
From
Alvaro Herrera
Date:
On 2021-Mar-21, Andres Freund wrote: > We currently also fetch the full stats in places like autovacuum.c. Where we > don't need repeated access to be consistent - we even explicitly force the > stats to be re-read for every single table that's getting vacuumed. > > Even if we to just cache already accessed stats, places like do_autovacuum() > would end up with a completely unnecessary cache of all tables, blowing up > memory usage by a large amount on systems with lots of relations. It's certainly not the case that autovacuum needs to keep fully consistent stats. That's just the way that seemed easier (?) to do at the time. Unless I misunderstand things severely, we could just have autovacuum grab all necessary numbers for one database at the start of a run, not cache anything, then re-read the numbers for one table as it rechecks that table. Resetting before re-reading was obviously necessary because the built-in snapshotting made it impossible to freshen up the numbers at the recheck step. -- Álvaro Herrera 39°49'30"S 73°17'W
Re: shared memory stats: high level design decisions: consistency, dropping
From
Magnus Hagander
Date:
On Tue, Mar 23, 2021 at 4:21 AM Greg Stark <stark@mit.edu> wrote: > > On Sun, 21 Mar 2021 at 18:16, Stephen Frost <sfrost@snowman.net> wrote: > > > > Greetings, > > > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > > I also believe that the snapshotting behavior has advantages in terms > > > of being able to perform multiple successive queries and get consistent > > > results from them. Only the most trivial sorts of analysis don't need > > > that. > > > > > > In short, what you are proposing sounds absolutely disastrous for > > > usability of the stats views, and I for one will not sign off on it > > > being acceptable. > > > > > > I do think we could relax the consistency guarantees a little bit, > > > perhaps along the lines of only caching view rows that have already > > > been read, rather than grabbing everything up front. But we can't > > > just toss the snapshot concept out the window. It'd be like deciding > > > that nobody needs MVCC, or even any sort of repeatable read. > > > > This isn't the same use-case as traditional tables or relational > > concepts in general- there aren't any foreign keys for the fields that > > would actually be changing across these accesses to the shared memory > > stats- we're talking about gross stats numbers like the number of > > inserts into a table, not an employee_id column. In short, I don't > > agree that this is a fair comparison. > > I use these stats quite a bit and do lots of slicing and dicing with > them. I don't think it's as bad as Tom says but I also don't think we > can be quite as loosy-goosy as I think Andres or Stephen might be > proposing either (though I note that haven't said they don't want any > consistency at all). > > The cases where the consistency really matter for me is when I'm doing > math involving more than one statistic. > > Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide > seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look > at the ratio between n_tup_upd and n_tup_hot_upd. > > And no, it doesn't help that these are often large numbers after a > long time because I'm actually working with the first derivative of > these numbers using snapshots or a time series database. So if you > have the seq_tup_read incremented but not seq_scan incremented you > could get a wildly incorrect calculation of "tup read per seq scan" > which actually matters. > > I don't think I've ever done math across stats for different objects. > I mean, I've plotted them together and looked at which was higher but > I don't think that's affected by some plots having peaks slightly out > of sync with the other. I suppose you could look at the ratio of > access patterns between two tables and know that they're only ever > accessed by a single code path at the same time and therefore the > ratios would be meaningful. But I don't think users would be surprised > to find they're not consistent that way either. Yeah, it's important to differentiate if things can be inconsistent within a single object, or just between objects. And I agree that in a lot of cases, just having per-object consistent data is probably enough. Normally when you graph things for example, your peaks will look across >1 sample point anyway, and in that case it doesn't much matter does it? But if we said we try to offer per-object consistency only, then for example the idx_scans value in the tables view may see changes to some but not all indexes on that table. Would that be acceptable? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: shared memory stats: high level design decisions: consistency, dropping
From
Magnus Hagander
Date:
On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-03-21 12:14:35 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > 1) What kind of consistency do we want from the pg_stats_* views? > > > > That's a hard choice to make. But let me set the record straight: > > when we did the initial implementation, the stats snapshotting behavior > > was considered a FEATURE, not an "efficiency hack required by the old > > storage model". > > Oh - sorry for misstating that then. I did try to look for the origins of the > approach, and all that I found was that it'd be too expensive to do multiple > stats file reads. > > > > If I understand what you are proposing, all stats views would become > > completely volatile, without even within-query consistency. That really > > is not gonna work. As an example, you could get not-even-self-consistent > > results from a join to a stats view if the planner decides to implement > > it as a nestloop with the view on the inside. > > I don't really think it's a problem that's worth incuring that much cost to > prevent. We already have that behaviour for a number of of the pg_stat_* views, > e.g. pg_stat_xact_all_tables, pg_stat_replication. Aren't those both pretty bad examples though? pg_stat_xact_all_tables surely is within-query consistent, and would be pretty useless if it wwas within-transaction consistent? pg_stat_replication is a snapshot of what things are right now (like pg_stat_activity), and not collected statistics. Maybe there's inconsistency in that they should've had a different name to separate it out, but fundamentally having xact consistent views there would be a bad thing, no? > If the cost were low - or we can find a reasonable way to get to low costs - I > think it'd be worth preserving for backward compatibility's sake alone. From > an application perspective, I actually rarely want that behaviour for stats > views - I'm querying them to get the most recent information, not an older > snapshot. And in the cases I do want snapshots, I'd want them for longer than a > transaction. I agree in general, but I'd want them to be *query-consistent*, not *transaction-consistent*. But the question is as you say, am I willing to pay for that. Less certain of that. > There's just a huge difference between being able to access a table's stats in > O(1) time, or having a single stats access be O(database-objects). > > And that includes accesses to things like pg_stat_bgwriter, pg_stat_database > (for IO over time stats etc) that often are monitored at a somewhat high > frequency - they also pay the price of reading in all object stats. On my > example database with 1M tables it takes 0.4s to read pg_stat_database. IMV, singeling things out into "larger groups" would be one perfectly acceptable compromise. That is, say that pg_stat_user_tables can be inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent with itself. Basically anything that's "global" seems like it could be treated that way, independent of each other. For relations and such having a way to get just a single relation stats or a number of them that will be consistent with each other without getting all of them, could also be a reasonable optimization. Mabye an SRF that takes an array of oids as a parameter and returns consistent data across those, without having to copy/mess with the rest? > We currently also fetch the full stats in places like autovacuum.c. Where we > don't need repeated access to be consistent - we even explicitly force the > stats to be re-read for every single table that's getting vacuumed. > > Even if we to just cache already accessed stats, places like do_autovacuum() > would end up with a completely unnecessary cache of all tables, blowing up > memory usage by a large amount on systems with lots of relations. autovacuum is already dealing with things being pretty fuzzy though, so it shouldn't matter much there? But autovacuum might also deserve it's own interface to access the data directly and doesn't have to follow the same one as the stats views in this new scheme, perhaps? > > I also believe that the snapshotting behavior has advantages in terms > > of being able to perform multiple successive queries and get consistent > > results from them. Only the most trivial sorts of analysis don't need > > that. > > In most cases you'd not do that in a transaction tho, and you'd need to create > temporary tables with a snapshot of the stats anyway. I'd say in most cases this analysis happens in snapshots anyway, and those are snapshots unrelated to what we do in pg_stat. It's either snapshotted to tables, or to storage in a completely separate database. > > In short, what you are proposing sounds absolutely disastrous for > > usability of the stats views, and I for one will not sign off on it > > being acceptable. > > :( > > That's why I thought it'd be important to bring this up to a wider > audience. This has been discussed several times in the thread, and nobody > really chimed up wanting the "snapshot" behaviour... I can chime in with the ones saying I don't think I need that kind of snapshot behaviour. I would *like* to have query-level consistent views. But I may be able to compromise on that one for the sake of performance as well. I definitely need there to be object-level consistent views. > > I do think we could relax the consistency guarantees a little bit, > > perhaps along the lines of only caching view rows that have already > > been read, rather than grabbing everything up front. But we can't > > just toss the snapshot concept out the window. It'd be like deciding > > that nobody needs MVCC, or even any sort of repeatable read. > > I think that'd still a huge win - caching only what's been accessed rather than > everything will save a lot of memory in very common cases. I did bring it up as > one approach for that reason. > > I do think it has a few usability quirks though. The time-skew between stats > objects accessed at different times seems like it could be quite confusing? > E.g. imagine looking at table stats and then later join to index stats and see > table / index stats not matching up at all. > > > I wonder if a reasonable way out could be to have pg_stat_make_snapshot() > (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager > data load. But not use any snapshot / caching behaviour without that? I think that's a pretty good idea. I bet the vast majority of all queries against the pg_stat views are done by automated tools, and they don't care about the snapshot behaviour, and thus wouldn't have to pay the overhead. In the more rare cases when you do the live-analysis, you can explicitly request it. Another idea could be a per-user GUC of "stats_snapshots" or so, and then if it's on force the snapshots all times. That way a DBA who wants the snapshots could set it on their own user but keep it off for the automated jobs for example. (It'd basically be the same except automatically calling pg_stat_make_snapshot() the first time stats are queried) -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, On 2021-03-24 14:42:11 +0100, Magnus Hagander wrote: > On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote: > > > If I understand what you are proposing, all stats views would become > > > completely volatile, without even within-query consistency. That really > > > is not gonna work. As an example, you could get not-even-self-consistent > > > results from a join to a stats view if the planner decides to implement > > > it as a nestloop with the view on the inside. > > > > I don't really think it's a problem that's worth incuring that much cost to > > prevent. We already have that behaviour for a number of of the pg_stat_* views, > > e.g. pg_stat_xact_all_tables, pg_stat_replication. > > Aren't those both pretty bad examples though? > > pg_stat_xact_all_tables surely is within-query consistent, and would > be pretty useless if it wwas within-transaction consistent? It's not within-query consistent: postgres[1182102][1]=# SELECT pg_stat_get_xact_numscans('pg_class'::regclass) UNION ALL SELECT count(*) FROM pg_class UNIONALL SELECT pg_stat_get_xact_numscans('pg_class'::regclass); ┌───────────────────────────┐ │ pg_stat_get_xact_numscans │ ├───────────────────────────┤ │ 0 │ │ 397 │ │ 1 │ └───────────────────────────┘ (3 rows) > pg_stat_replication is a snapshot of what things are right now (like > pg_stat_activity), and not collected statistics. However, pg_stat_activity does have snapshot semantics... > Maybe there's inconsistency in that they should've had a different > name to separate it out, but fundamentally having xact consistent > views there would be a bad thing, no? True. One weird thing around the _xact_ versions that we, at best, *hint* at in the docs, but also contradict, is that _xact_ views are actually not tied to the transaction. It's really about unsubmitted stats. E.g. if executing the following via copy-paste SELECT count(*) FROM pg_class; SELECT pg_stat_get_xact_numscans('pg_class'::regclass); SELECT count(*) FROM pg_class; SELECT pg_stat_get_xact_numscans('pg_class'::regclass); will most of the time return <count> 0 <count> 1 because after the transaction for the first count(*) commits, we'll not have submitted stats for more than PGSTAT_STAT_INTERVAL. But after the second count(*) it'll be shorter, therefore the stats won't be submitted... > > There's just a huge difference between being able to access a table's stats in > > O(1) time, or having a single stats access be O(database-objects). > > > > And that includes accesses to things like pg_stat_bgwriter, pg_stat_database > > (for IO over time stats etc) that often are monitored at a somewhat high > > frequency - they also pay the price of reading in all object stats. On my > > example database with 1M tables it takes 0.4s to read pg_stat_database. > > IMV, singeling things out into "larger groups" would be one perfectly > acceptable compromise. That is, say that pg_stat_user_tables can be > inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent > with itself. I don't think that buys us all that much though. It still is a huge issue that we need to cache the stats for all relations even though we only access the stats for table. > > We currently also fetch the full stats in places like autovacuum.c. Where we > > don't need repeated access to be consistent - we even explicitly force the > > stats to be re-read for every single table that's getting vacuumed. > > > > Even if we to just cache already accessed stats, places like do_autovacuum() > > would end up with a completely unnecessary cache of all tables, blowing up > > memory usage by a large amount on systems with lots of relations. > > autovacuum is already dealing with things being pretty fuzzy though, > so it shouldn't matter much there? > > But autovacuum might also deserve it's own interface to access the > data directly and doesn't have to follow the same one as the stats > views in this new scheme, perhaps? Yes, we can do that now. > > > I also believe that the snapshotting behavior has advantages in terms > > > of being able to perform multiple successive queries and get consistent > > > results from them. Only the most trivial sorts of analysis don't need > > > that. > > > > In most cases you'd not do that in a transaction tho, and you'd need to create > > temporary tables with a snapshot of the stats anyway. > > I'd say in most cases this analysis happens in snapshots anyway, and > those are snapshots unrelated to what we do in pg_stat. It's either > snapshotted to tables, or to storage in a completely separate > database. Agreed. I wonder some of that work would be made easier if we added a function to export all the data in the current snapshot as a json document or such? If we add configurable caching (see below) that'd really not be a lot of additional work. > I would *like* to have query-level consistent views. But I may be able > to compromise on that one for the sake of performance as well. > > I definitely need there to be object-level consistent views. That'd be free if we didn't use all those separate function calls for each row in pg_stat_all_tables etc... > > I wonder if a reasonable way out could be to have pg_stat_make_snapshot() > > (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager > > data load. But not use any snapshot / caching behaviour without that? > > I think that's a pretty good idea. It's what I am leaning towards right now. > Another idea could be a per-user GUC of "stats_snapshots" or so, and > then if it's on force the snapshots all times. That way a DBA who > wants the snapshots could set it on their own user but keep it off for > the automated jobs for example. (It'd basically be the same except > automatically calling pg_stat_make_snapshot() the first time stats are > queried) Yea, I was thinking similar. We could e.g. have stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'. With 'none' there's no consistency beyond that a single pg_stat_get_* function call will give consistent results. With 'query' we cache each object on first access (i.e. there can be inconsistency between different objects if their accesses are further appart, but not within a stats object). With 'snapshot' we cache all the stats at the first access, for the duration of the transaction. However: I think 'query' is surprisingly hard to implement. There can be multiple ongoing overlapping queries ongoing at the same time. And even if there aren't multiple ongoing queries, there's really nothing great to hook a reset into. So I'm inclined to instead have 'access', which caches individual stats object on first access. But lasts longer than the query. Tom, if we defaulted to 'access' would that satisfy your concerns? Or would even 'none' as a default do? I'd rather not have SELECT * FROM pg_stat_all_tables; accumulate an unnecessary copy of most of the cluster's stats in memory just to immediately throw it away again. > I bet the vast majority of all queries against the pg_stat views are > done by automated tools, and they don't care about the snapshot > behaviour, and thus wouldn't have to pay the overhead. In the more > rare cases when you do the live-analysis, you can explicitly request > it. I don't think I'd often want it in that case either, but ymmv. Greetings, Andres Freund
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, On 2021-03-21 14:53:50 -0700, Andres Freund wrote: > On 2021-03-21 11:41:30 -0400, Stephen Frost wrote: > > > 2) How to remove stats of dropped objects? > > > > > > [...] > > > > The current approach sounds pretty terrible and propagating that forward > > doesn't seem great. Guess here I'd disagree with your gut feeling and > > encourage trying to go 'full in', as you put it, or at least put enough > > effort into it to get a feeling of if it's going to require a *lot* more > > work or not and then reconsider if necessary. > > I think my gut's argument is that it's already a huge patch, and that > it's better to have the the very substantial memory and disk IO savings > with the crappy vacuum approach, than neither. And given the timeframe > there does seem to be a substantial danger of "neither" being the > outcome... Anyway, I'm mentally sketching out what it'd take. I implemented this. Far from polished, but it does survive the regression tests, including new tests in stats.sql. functions stats, 2PC and lots of naming issues (xl_xact_dropped_stats with members ndropped and'dropped_stats) are yet to be addressed - but not architecturally relevant. I think there are three hairy corner-cases that I haven't thought sufficiently about: - It seems likely that there's a relatively narrow window where a crash could end up not dropping stats. The xact.c integration is basically parallel to smgrGetPendingDeletes etc. I don't see what prevents a checkpoint from happing after RecordTransactionCommit() (which does smgrGetPendingDeletes), but before smgrDoPendingDeletes(). Which means that if we crash in that window, nothing would clean up those files (and thus also not the dropped stats that I implemented very similarly). Closing that window seems doable (put the pending file/stat deletion list in shared memory after logging the WAL record, but before MyProc->delayChkpt = false, do something with that in CheckPointGuts()), but not trivial. If we had a good way to run pgstat_vacuum_stat() on a very *occasional* basis via autovac, that'd be ok. But it'd be a lot nicer if it were bulletproof. - Does temporary table cleanup after a crash properly deal with their stats? - I suspect there are a few cases where pending stats in one connection could "revive" an already dropped stat. It'd not be hard to address in an inefficient way in the shmstats patch, but it'd come with some efficiency penalty - but it might an irrelevant efficiency difference. Definitely for later, but if we got this ironed out, we probably could stop throwing stats away after crashes. Instead storing stats snapshots alongside redo LSNs or such. It'd be nice if immediate shutdowns etc wouldn't lead to autovacuum not doing any vacuuming for quite a while. Greetings, Andres Freund
On Sun, Mar 21, 2021 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If I understand what you are proposing, all stats views would become > completely volatile, without even within-query consistency. That really > is not gonna work. As an example, you could get not-even-self-consistent > results from a join to a stats view if the planner decides to implement > it as a nestloop with the view on the inside. > > I also believe that the snapshotting behavior has advantages in terms > of being able to perform multiple successive queries and get consistent > results from them. Only the most trivial sorts of analysis don't need > that. > > In short, what you are proposing sounds absolutely disastrous for > usability of the stats views, and I for one will not sign off on it > being acceptable. > > I do think we could relax the consistency guarantees a little bit, > perhaps along the lines of only caching view rows that have already > been read, rather than grabbing everything up front. But we can't > just toss the snapshot concept out the window. It'd be like deciding > that nobody needs MVCC, or even any sort of repeatable read. So, just as a data point, the output of pg_locks is not stable within a transaction. In fact, the pg_locks output could technically include the same exact lock more than once, if it's being moved from the fast-path table to the main table just as you are reading all the data. In theory, that could produce the same kinds of problems that you're concerned about here, and I suspect sometimes it does. But I haven't seen a lot of people yelling and screaming about it. The situation isn't ideal, but it's not disastrous either. I think it's really hard for us as developers to predict what kinds of effects of these kinds of decisions will have in real-world deployments. All of us have probably had the experience of making some behavioral change that we thought would not be too big a deal and it actually pissed off a bunch of users who were relying on the old behavior. I know I have. Conversely, I've reluctantly made changes that seemed rather dangerous to me and heard nary a peep. If somebody takes the position that changing this behavior is scary because we don't know how many users will be inconvenienced or how badly, I can only agree. But saying that it's tantamount to deciding that nobody needs MVCC is completely over the top. This is statistical data, not user data, and there are good reasons to think that people don't have the same expectations in both cases, starting with the fact that we have some stuff that works like that already. More than that, there's a huge problem with the way this works today that can't be fixed without making some compromises. In the test case Andres mentioned upthread, the stats snapshot burned through 170MB of RAM. Now, you might dismiss that as not much memory in 2021, but if you have a lot of backends accessing the stats, that value could be multiplied by a two digit or even three digit number, and that is *definitely* a lot of memory, even in 2021. But even if it's not multiplied by anything, we're shipping with a default work_mem of just 4MB. So, the position we're implicitly taking today is: if you join two 5MB tables, it's too risky to put the entire contents of one of them into a single in-memory hash table, because we might run the machine out of RAM. But if you have millions of objects in your database and touch the statistics for one of those objects, once, it's absolutely OK to slurp tens or even hundreds of megabytes of data into backend-private memory to avoid the possibility that you might later access another one of those counters and expect snapshot semantics. To be honest, I don't find either of those positions very believable. I do not think it likely that the typical user really wants a 5MB hash join to be done in batches to save memory, and I think it equally unlikely that everybody wants to read and cache tens or hundreds of megabytes of data to get MVCC semantics for volatile statistics. I think there are probably some cases where having that information be stable across a transaction lifetime is really useful, so if we can provide that as an optional behavior, I think that would be a pretty good idea. But I don't think it's reasonable for that to be the only behavior, and I'm doubtful about whether it should even be the default. I bet there are a lot of cases where somebody just wants to take a quick glance at some of the values as they exist right now, and has no intention of running any more queries that might examine the same data again later. Not only does caching all the data use a lot of memory, but having to read all the data in order to cache it is potentially a lot slower than just reading the data actually requested. I'm unwilling to dismiss that as a negligible problem. In short, I agree that there's stuff to worry about here, but I don't agree that a zero-consistency model is a crazy idea, even though I also think it would be nice if we can make stronger consistency available upon request. -- Robert Haas EDB: http://www.enterprisedb.com
Re: shared memory stats: high level design decisions: consistency, dropping
From
Magnus Hagander
Date:
On Thu, Mar 25, 2021 at 6:20 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-03-24 14:42:11 +0100, Magnus Hagander wrote: > > On Sun, Mar 21, 2021 at 11:34 PM Andres Freund <andres@anarazel.de> wrote: > > > > If I understand what you are proposing, all stats views would become > > > > completely volatile, without even within-query consistency. That really > > > > is not gonna work. As an example, you could get not-even-self-consistent > > > > results from a join to a stats view if the planner decides to implement > > > > it as a nestloop with the view on the inside. > > > > > > I don't really think it's a problem that's worth incuring that much cost to > > > prevent. We already have that behaviour for a number of of the pg_stat_* views, > > > e.g. pg_stat_xact_all_tables, pg_stat_replication. > > > > Aren't those both pretty bad examples though? > > > > pg_stat_xact_all_tables surely is within-query consistent, and would > > be pretty useless if it wwas within-transaction consistent? > > It's not within-query consistent: > > postgres[1182102][1]=# SELECT pg_stat_get_xact_numscans('pg_class'::regclass) UNION ALL SELECT count(*) FROM pg_class UNIONALL SELECT pg_stat_get_xact_numscans('pg_class'::regclass); > ┌───────────────────────────┐ > │ pg_stat_get_xact_numscans │ > ├───────────────────────────┤ > │ 0 │ > │ 397 │ > │ 1 │ > └───────────────────────────┘ > (3 rows) Heh. OK, I admit I didn't consider a UNION query like that -- I only considered it being present *once* in a query :) That said, if wanted that can be dealt with a WITH MATERIALIZED as long as it's in the same query, no? > > pg_stat_replication is a snapshot of what things are right now (like > > pg_stat_activity), and not collected statistics. > > However, pg_stat_activity does have snapshot semantics... Yeah, yay consistency. > > Maybe there's inconsistency in that they should've had a different > > name to separate it out, but fundamentally having xact consistent > > views there would be a bad thing, no? > > True. One weird thing around the _xact_ versions that we, at best, > *hint* at in the docs, but also contradict, is that _xact_ views are > actually not tied to the transaction. It's really about unsubmitted > stats. E.g. if executing the following via copy-paste > > SELECT count(*) FROM pg_class; > SELECT pg_stat_get_xact_numscans('pg_class'::regclass); > SELECT count(*) FROM pg_class; > SELECT pg_stat_get_xact_numscans('pg_class'::regclass); > > will most of the time return > <count> > 0 > <count> > 1 > > because after the transaction for the first count(*) commits, we'll not > have submitted stats for more than PGSTAT_STAT_INTERVAL. But after the > second count(*) it'll be shorter, therefore the stats won't be > submitted... That's... cute. I hadn't realized that part, but then I've never actually had use for the _xact_ views. > > > There's just a huge difference between being able to access a table's stats in > > > O(1) time, or having a single stats access be O(database-objects). > > > > > > And that includes accesses to things like pg_stat_bgwriter, pg_stat_database > > > (for IO over time stats etc) that often are monitored at a somewhat high > > > frequency - they also pay the price of reading in all object stats. On my > > > example database with 1M tables it takes 0.4s to read pg_stat_database. > > > > IMV, singeling things out into "larger groups" would be one perfectly > > acceptable compromise. That is, say that pg_stat_user_tables can be > > inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent > > with itself. > > I don't think that buys us all that much though. It still is a huge > issue that we need to cache the stats for all relations even though we > only access the stats for table. Well, you yourself just mentioned that access to bgwriter and db stats are often sampled at a higher frequency. That said, this can often include *individual* tables as well, but maybe not all at once. > > > > I also believe that the snapshotting behavior has advantages in terms > > > > of being able to perform multiple successive queries and get consistent > > > > results from them. Only the most trivial sorts of analysis don't need > > > > that. > > > > > > In most cases you'd not do that in a transaction tho, and you'd need to create > > > temporary tables with a snapshot of the stats anyway. > > > > I'd say in most cases this analysis happens in snapshots anyway, and > > those are snapshots unrelated to what we do in pg_stat. It's either > > snapshotted to tables, or to storage in a completely separate > > database. > > Agreed. I wonder some of that work would be made easier if we added a > function to export all the data in the current snapshot as a json > document or such? If we add configurable caching (see below) that'd > really not be a lot of additional work. I'd assume if you want the snapshot in the database, you'd want it in a database format. That is, if you want the snapshot in the db, you actually *want* it in a table or similar. I'm not sure json format would really help that much? What would probably be interesting to more in that case is if we could build ourselves, either builtin or as an extension a background worker that would listen and export openmetrics format talking directly to the stats and bypassing the need to have an exporter running that connects as a regular user and ends up converting the things between many different formats on the way. > > I would *like* to have query-level consistent views. But I may be able > > to compromise on that one for the sake of performance as well. > > > > I definitely need there to be object-level consistent views. > > That'd be free if we didn't use all those separate function calls for > each row in pg_stat_all_tables etc... So.. We should fix that? We could still keep those separate functions for backwards compatibility if we wanted of course, but move the view to use an SRF and make that SRF also directly callable with useful parameters. I mean, it's been over 10 years since we did that for pg_stat_activity in 9.1, so it's perhaps time to do another view? :) > > > I wonder if a reasonable way out could be to have pg_stat_make_snapshot() > > > (accompanying the existing pg_stat_clear_snapshot()) that'd do the full eager > > > data load. But not use any snapshot / caching behaviour without that? > > > > I think that's a pretty good idea. > > It's what I am leaning towards right now. > > > > Another idea could be a per-user GUC of "stats_snapshots" or so, and > > then if it's on force the snapshots all times. That way a DBA who > > wants the snapshots could set it on their own user but keep it off for > > the automated jobs for example. (It'd basically be the same except > > automatically calling pg_stat_make_snapshot() the first time stats are > > queried) > > Yea, I was thinking similar. We could e.g. have > stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'. > > With 'none' there's no consistency beyond that a single pg_stat_get_* > function call will give consistent results. > > With 'query' we cache each object on first access (i.e. there can be > inconsistency between different objects if their accesses are further > appart, but not within a stats object). > > With 'snapshot' we cache all the stats at the first access, for the > duration of the transaction. > > > However: I think 'query' is surprisingly hard to implement. There can be > multiple ongoing overlapping queries ongoing at the same time. And even > if there aren't multiple ongoing queries, there's really nothing great > to hook a reset into. > > So I'm inclined to instead have 'access', which caches individual stats > object on first access. But lasts longer than the query. I guess the naive way to do query would be to just, ahem, lock the stats while a query is ongoing. But that would probably be *really* terrible in the case of big stats, yes :) > Tom, if we defaulted to 'access' would that satisfy your concerns? Or > would even 'none' as a default do? I'd rather not have SELECT * FROM > pg_stat_all_tables; accumulate an unnecessary copy of most of the > cluster's stats in memory just to immediately throw it away again. But in your "cache individual stats object on first access", it would copy most of the stats over when you did a "SELECT *", no? It would only be able to avoid that if you had a WHERE clause on it limiting what you got? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: shared memory stats: high level design decisions: consistency, dropping
From
Andres Freund
Date:
Hi, I did end up implementing the configurable fetch consistency. Seems to work well, and it's not that much code after a bunch of other cleanups. See below the quoted part at the bottom. I just posted the latest iteration of that code at https://www.postgresql.org/message-id/20210405092914.mmxqe7j56lsjfsej%40alap3.anarazel.de Copying from that mail: I've spent most of the last 2 1/2 weeks on this now. Unfortunately I think that, while it has gotten a lot closer, it's still about a week's worth of work away from being committable. My main concerns are: - Test Coverage: I've added a fair bit of tests, but it's still pretty bad. There were a lot of easy-to-hit bugs in earlier versions that nevertheless passed the test just fine. Due to the addition of pg_stat_force_next_flush(), and that there's no need to wait for the stats collector to write out files, it's now a lot more realistic to have proper testing of a lot of the pgstat.c code. - Architectural Review I rejiggered the patchset pretty significantly, and I think it needs more review than I see as realistic in the next two days. In particular I don't think - Performance Testing I did a small amount, but given that this touches just about every query etc, I think that's not enough. My changes unfortunately are substantial enough to invalidate Horiguchi-san's earlier tests. - Currently there's a corner case in which function (but not table!) stats for a dropped function may not be removed. That possibly is not too bad, - Too many FIXMEs still open It is quite disappointing to not have the patch go into v14 :(. But I just don't quite see the path right now. But maybe I am just too tired right now, and it'll look better tomorrow (err today, in a few hours). One aspect making this particularly annoying is that there's a number of stats additions in v14 that'd be easier to make robust with the shared memory based approach. But I don't think I can get it a committable shape in 2 days. Nor is it clear to that that it'd be a good idea to commit it, even if I could just about make it, given that pretty much everything involves pgstat somewhere. On 2021-03-25 23:52:58 +0100, Magnus Hagander wrote: > > Yea, I was thinking similar. We could e.g. have > > stats_snapshot_consistency, an enum of 'none', 'query', 'snapshot'. > > > > With 'none' there's no consistency beyond that a single pg_stat_get_* > > function call will give consistent results. > > > > With 'query' we cache each object on first access (i.e. there can be > > inconsistency between different objects if their accesses are further > > appart, but not within a stats object). > > > > With 'snapshot' we cache all the stats at the first access, for the > > duration of the transaction. > > > > > > However: I think 'query' is surprisingly hard to implement. There can be > > multiple ongoing overlapping queries ongoing at the same time. And even > > if there aren't multiple ongoing queries, there's really nothing great > > to hook a reset into. > > > > So I'm inclined to instead have 'access', which caches individual stats > > object on first access. But lasts longer than the query. I went with stats_fetch_consistency = {snapshot, cache, none}, that seems a bit more obvious than cache. I haven't yet changed [auto]vacuum so it uses 'none' unconditionally - but that shouldn't be hard. Greetings, Andres Freund