shared memory stats: high level design decisions: consistency, dropping - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | shared memory stats: high level design decisions: consistency, dropping |
Date | |
Msg-id | 20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de Whole thread Raw |
Responses |
Re: shared memory stats: high level design decisions: consistency, dropping
Re: shared memory stats: high level design decisions: consistency, dropping Re: shared memory stats: high level design decisions: consistency, dropping |
List | pgsql-hackers |
Hi, I am working on Kyotaro Horiguchi's shared memory stats patch [1] with the goal of getting it into a shape that I'd be happy to commit. That thread is quite long and most are probably skipping over new messages in it. There are two high-level design decisions / questions that I think warrant a wider audience (I'll keep lower level discussion in the other thread). In case it is not obvious, the goal of the shared memory stats patch is to replace the existing statistics collector, to which new stats are reported via an UDP socket, and where clients read data from the stats collector by reading the entire database's stats from disk. The replacement is to put the statistics into a shared memory segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity, etc) being stored directly in a struct in memory. Stats for objects where a variable number exists, e.g. tables, are addressed via a dshash. table that points to the stats that are in turn allocated using dsa.h. 1) What kind of consistency do we want from the pg_stats_* views? Right now the first access to stats in a transaction will trigger a read of both the global and per-database stats from disk. If the on-disk state is too old, we'll ask the stats collector to write out a new file a couple times. For the rest of the transaction that in-memory state is used unless pg_stat_clear_snapshot() is called. Which means that multiple reads from e.g. pg_stat_user_tables will show the same results as before [2]. That makes stats accesses quite expensive if there are lots of objects. But it also means that separate stats accesses - which happen all the time - return something repeatable and kind of consistent. Now, the stats aren't really consistent in the sense that they are really accurate, UDP messages can be lost, or only some of the stats generated by a TX might not yet have been received, other transactions haven't yet sent them. Etc. With the shared memory patch the concept of copying all stats for the current database into local memory at the time of the first stats access doesn't make sense to me. Horiguchi-san had actually implemented that, but I argued that that would be cargo-culting an efficiency hack required by the old storage model forward. The cost of doing this is substantial. On master, with a database that contains 1 million empty tables, any stats access takes ~0.4s and increases memory usage by 170MB. 1.1) I hope everybody agrees with not requiring that stats don't need to be the way they were at the time of first stat access in a transaction, even if that first access was to a different stat object than the currently accessed stat? 1.2) Do we expect repeated accesses to the same stat to stay the same through the transaction? The easiest way to implement stats accesses is to simply fetch the stats from shared memory ([3]). That would obviously result in repeated accesses to the same stat potentially returning changing results over time. I think that's perfectly fine, desirable even, for pg_stat_*. 1.3) What kind of consistency do we expect between columns of views like pg_stat_all_tables? Several of the stats views aren't based on SRFs or composite-type returning functions, but instead fetch each stat separately: E.g. pg_stat_all_tables: SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid) AS seq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan, sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch, pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, ... pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM pg_class c LEFT JOIN pg_index i ON c.oid = i.indrelid ... Which means that if we do not cache stats, additional stats updates could have been applied between two stats accessors. E.g the seq_scan from before some pgstat_report_stat() but the seq_tup_read from after. If we instead fetch all of a table's stats in one go, we would get consistency between the columns. But obviously that'd require changing all the stats views. Horiguchi-san, in later iterations of the patch, attempted to address this issue by adding a one-entry caches below pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use. But I think that leads to very confusing results. Access stats for the same relation multiple times in a row? Do not see updates. Switch between e.g. a table and its indexes? See updates. I personally think it's fine to have short-term divergences between the columns. The stats aren't that accurate anyway, as we don't submit them all the time. And that if we want consistency between columns, we instead should replace the current view definitions with[set of] record returning function - everything else seems to lead to weird tradeoffs. 2) How to remove stats of dropped objects? In the stats collector world stats for dropped objects (tables, indexes, functions, etc) are dropped after the fact, using a pretty expensive process: Each autovacuum worker cycle and each manual VACUUM does pgstat_vacuum_stat() to detect since-dropped objects. It does that by building hash-tables for all databases, tables and functions, and then comparing that against a freshly loaded stats snapshot. All stats object not in pg_class etc are dropped. The patch currently copies that approach, although that adds some complications, mainly around [3]. Accessing all database objects after each VACUUM, even if the table was tiny, isn't great, performance wise. In a fresh master database with 1 million functions, a VACUUM of an empty table takes ~0.5s, with 1 million tables it's ~1s. Due to partitioning tables with many database objects are of course getting more common. There isn't really a better approach in the stats collector world. As messages to the stats collector can get lost, we need to to be able to re-do dropping of dead stats objects. But now we could instead schedule stats to be removed at commit time. That's not trivial of course, as we'd need to handle cases where the commit fails after the commit record, but before processing the dropped stats. But it seems that integrating the stats that need to be dropped into the commit message would make a lot of sense. With benefits beyond the [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming replication and even opening the door to keep stats across crashes. My gut feeling here is to try to to fix the remaining issues in the "collect oids" approach for 14 and to try to change the approach in 15. And, if that proves too hard, try to see how hard it'd be to "accurately" drop. But I'm also not sure - it might be smarter to go full in, to avoid introducing a system that we'll just rip out again. Comments? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp [2] Except that new tables with show up with lots of 0s [3] There is a cache to avoid repeated dshash lookups for previously-accessed stats, to avoid contention. But that just points to the shared memory area with the stats.
pgsql-hackers by date: