shared memory stats: high level design decisions: consistency, dropping - Mailing list pgsql-hackers

Hi,

I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
the goal of getting it into a shape that I'd be happy to commit.  That
thread is quite long and most are probably skipping over new messages in
it.

There are two high-level design decisions / questions that I think
warrant a wider audience (I'll keep lower level discussion in the other
thread).

In case it is not obvious, the goal of the shared memory stats patch is
to replace the existing statistics collector, to which new stats are
reported via an UDP socket, and where clients read data from the stats
collector by reading the entire database's stats from disk.

The replacement is to put the statistics into a shared memory
segment. Fixed-size stats (e.g. bgwriter, checkpointer, wal activity,
etc) being stored directly in a struct in memory. Stats for objects
where a variable number exists, e.g. tables, are addressed via a dshash.
table that points to the stats that are in turn allocated using dsa.h.


1) What kind of consistency do we want from the pg_stats_* views?

Right now the first access to stats in a transaction will trigger a read
of both the global and per-database stats from disk. If the on-disk
state is too old, we'll ask the stats collector to write out a new file
a couple times.

For the rest of the transaction that in-memory state is used unless
pg_stat_clear_snapshot() is called. Which means that multiple reads from
e.g. pg_stat_user_tables will show the same results as before [2].

That makes stats accesses quite expensive if there are lots of
objects.

But it also means that separate stats accesses - which happen all the
time - return something repeatable and kind of consistent.

Now, the stats aren't really consistent in the sense that they are
really accurate, UDP messages can be lost, or only some of the stats
generated by a TX might not yet have been received, other transactions
haven't yet sent them. Etc.


With the shared memory patch the concept of copying all stats for the
current database into local memory at the time of the first stats access
doesn't make sense to me.  Horiguchi-san had actually implemented that,
but I argued that that would be cargo-culting an efficiency hack
required by the old storage model forward.

The cost of doing this is substantial. On master, with a database that
contains 1 million empty tables, any stats access takes ~0.4s and
increases memory usage by 170MB.


1.1)

I hope everybody agrees with not requiring that stats don't need to be
the way they were at the time of first stat access in a transaction,
even if that first access was to a different stat object than the
currently accessed stat?


1.2)

Do we expect repeated accesses to the same stat to stay the same through
the transaction?  The easiest way to implement stats accesses is to
simply fetch the stats from shared memory ([3]). That would obviously
result in repeated accesses to the same stat potentially returning
changing results over time.

I think that's perfectly fine, desirable even, for pg_stat_*.


1.3)

What kind of consistency do we expect between columns of views like
pg_stat_all_tables?

Several of the stats views aren't based on SRFs or composite-type
returning functions, but instead fetch each stat separately:

E.g. pg_stat_all_tables:
 SELECT c.oid AS relid,
    n.nspname AS schemaname,
    c.relname,
    pg_stat_get_numscans(c.oid) AS seq_scan,
    pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
    sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
    sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
    pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
...
    pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
   FROM pg_class c
     LEFT JOIN pg_index i ON c.oid = i.indrelid
...

Which means that if we do not cache stats, additional stats updates
could have been applied between two stats accessors. E.g the seq_scan
from before some pgstat_report_stat() but the seq_tup_read from after.

If we instead fetch all of a table's stats in one go, we would get
consistency between the columns. But obviously that'd require changing
all the stats views.

Horiguchi-san, in later iterations of the patch, attempted to address
this issue by adding a one-entry caches below
pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.


But I think that leads to very confusing results. Access stats for the
same relation multiple times in a row? Do not see updates. Switch
between e.g. a table and its indexes? See updates.


I personally think it's fine to have short-term divergences between the
columns. The stats aren't that accurate anyway, as we don't submit them
all the time.  And that if we want consistency between columns, we
instead should replace the current view definitions with[set of] record
returning function - everything else seems to lead to weird tradeoffs.



2) How to remove stats of dropped objects?

In the stats collector world stats for dropped objects (tables, indexes,
functions, etc) are dropped after the fact, using a pretty expensive
process:

Each autovacuum worker cycle and each manual VACUUM does
pgstat_vacuum_stat() to detect since-dropped objects. It does that by
building hash-tables for all databases, tables and functions, and then
comparing that against a freshly loaded stats snapshot. All stats object
not in pg_class etc are dropped.

The patch currently copies that approach, although that adds some
complications, mainly around [3].


Accessing all database objects after each VACUUM, even if the table was
tiny, isn't great, performance wise. In a fresh master database with 1
million functions, a VACUUM of an empty table takes ~0.5s, with 1
million tables it's ~1s.  Due to partitioning tables with many database
objects are of course getting more common.


There isn't really a better approach in the stats collector world. As
messages to the stats collector can get lost, we need to to be able to
re-do dropping of dead stats objects.


But now we could instead schedule stats to be removed at commit
time. That's not trivial of course, as we'd need to handle cases where
the commit fails after the commit record, but before processing the
dropped stats.

But it seems that integrating the stats that need to be dropped into the
commit message would make a lot of sense. With benefits beyond the
[auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
replication and even opening the door to keep stats across crashes.


My gut feeling here is to try to to fix the remaining issues in the
"collect oids" approach for 14 and to try to change the approach in
15. And, if that proves too hard, try to see how hard it'd be to
"accurately" drop.  But I'm also not sure - it might be smarter to go
full in, to avoid introducing a system that we'll just rip out again.


Comments?


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20180629.173418.190173462.horiguchi.kyotaro%40lab.ntt.co.jp

[2] Except that new tables with show up with lots of 0s

[3] There is a cache to avoid repeated dshash lookups for
    previously-accessed stats, to avoid contention. But that just points
    to the shared memory area with the stats.



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: fdatasync performance problem with large number of DB files
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Custom compression methods