Re: shared memory stats: high level design decisions: consistency, dropping - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: shared memory stats: high level design decisions: consistency, dropping
Date
Msg-id 20210321154130.GG20766@tamriel.snowman.net
Whole thread Raw
In response to shared memory stats: high level design decisions: consistency, dropping  (Andres Freund <andres@anarazel.de>)
Responses Re: shared memory stats: high level design decisions: consistency, dropping  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> I am working on Kyotaro Horiguchi's shared memory stats patch [1] with
> the goal of getting it into a shape that I'd be happy to commit.  That
> thread is quite long and most are probably skipping over new messages in
> it.

Awesome, +1.

> 1) What kind of consistency do we want from the pg_stats_* views?
>
> Right now the first access to stats in a transaction will trigger a read
> of both the global and per-database stats from disk. If the on-disk
> state is too old, we'll ask the stats collector to write out a new file
> a couple times.
>
> For the rest of the transaction that in-memory state is used unless
> pg_stat_clear_snapshot() is called. Which means that multiple reads from
> e.g. pg_stat_user_tables will show the same results as before [2].
>
> That makes stats accesses quite expensive if there are lots of
> objects.
>
> But it also means that separate stats accesses - which happen all the
> time - return something repeatable and kind of consistent.
>
> Now, the stats aren't really consistent in the sense that they are
> really accurate, UDP messages can be lost, or only some of the stats
> generated by a TX might not yet have been received, other transactions
> haven't yet sent them. Etc.
>
>
> With the shared memory patch the concept of copying all stats for the
> current database into local memory at the time of the first stats access
> doesn't make sense to me.  Horiguchi-san had actually implemented that,
> but I argued that that would be cargo-culting an efficiency hack
> required by the old storage model forward.
>
> The cost of doing this is substantial. On master, with a database that
> contains 1 million empty tables, any stats access takes ~0.4s and
> increases memory usage by 170MB.
>
>
> 1.1)
>
> I hope everybody agrees with not requiring that stats don't need to be
> the way they were at the time of first stat access in a transaction,
> even if that first access was to a different stat object than the
> currently accessed stat?

Agreed, that doesn't seem necessary and blowing up backend memory usage
by copying all the stats into local memory seems pretty terrible.

> 1.2)
>
> Do we expect repeated accesses to the same stat to stay the same through
> the transaction?  The easiest way to implement stats accesses is to
> simply fetch the stats from shared memory ([3]). That would obviously
> result in repeated accesses to the same stat potentially returning
> changing results over time.
>
> I think that's perfectly fine, desirable even, for pg_stat_*.

This seems alright to me.

> 1.3)
>
> What kind of consistency do we expect between columns of views like
> pg_stat_all_tables?
>
> Several of the stats views aren't based on SRFs or composite-type
> returning functions, but instead fetch each stat separately:
>
> E.g. pg_stat_all_tables:
>  SELECT c.oid AS relid,
>     n.nspname AS schemaname,
>     c.relname,
>     pg_stat_get_numscans(c.oid) AS seq_scan,
>     pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
>     sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
>     sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
>     pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
> ...
>     pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
>    FROM pg_class c
>      LEFT JOIN pg_index i ON c.oid = i.indrelid
> ...
>
> Which means that if we do not cache stats, additional stats updates
> could have been applied between two stats accessors. E.g the seq_scan
> from before some pgstat_report_stat() but the seq_tup_read from after.
>
> If we instead fetch all of a table's stats in one go, we would get
> consistency between the columns. But obviously that'd require changing
> all the stats views.
>
> Horiguchi-san, in later iterations of the patch, attempted to address
> this issue by adding a one-entry caches below
> pgstat_fetch_stat_tabentry(), pgstat_fetch_stat_dbentry() etc, which is
> what pg_stat_get_numscans(), pg_stat_get_db_tuples_updated() etc use.
>
>
> But I think that leads to very confusing results. Access stats for the
> same relation multiple times in a row? Do not see updates. Switch
> between e.g. a table and its indexes? See updates.
>
>
> I personally think it's fine to have short-term divergences between the
> columns. The stats aren't that accurate anyway, as we don't submit them
> all the time.  And that if we want consistency between columns, we
> instead should replace the current view definitions with[set of] record
> returning function - everything else seems to lead to weird tradeoffs.

Agreed, doesn't seem like a huge issue to have short-term divergences
but if we want to fix them then flipping those all to SRFs would make
the most sense.

> 2) How to remove stats of dropped objects?
>
> In the stats collector world stats for dropped objects (tables, indexes,
> functions, etc) are dropped after the fact, using a pretty expensive
> process:
>
> Each autovacuum worker cycle and each manual VACUUM does
> pgstat_vacuum_stat() to detect since-dropped objects. It does that by
> building hash-tables for all databases, tables and functions, and then
> comparing that against a freshly loaded stats snapshot. All stats object
> not in pg_class etc are dropped.
>
> The patch currently copies that approach, although that adds some
> complications, mainly around [3].
>
>
> Accessing all database objects after each VACUUM, even if the table was
> tiny, isn't great, performance wise. In a fresh master database with 1
> million functions, a VACUUM of an empty table takes ~0.5s, with 1
> million tables it's ~1s.  Due to partitioning tables with many database
> objects are of course getting more common.
>
>
> There isn't really a better approach in the stats collector world. As
> messages to the stats collector can get lost, we need to to be able to
> re-do dropping of dead stats objects.
>
>
> But now we could instead schedule stats to be removed at commit
> time. That's not trivial of course, as we'd need to handle cases where
> the commit fails after the commit record, but before processing the
> dropped stats.
>
> But it seems that integrating the stats that need to be dropped into the
> commit message would make a lot of sense. With benefits beyond the
> [auto-]vacuum efficiency gains, e.g. neatly integrate into streaming
> replication and even opening the door to keep stats across crashes.
>
>
> My gut feeling here is to try to to fix the remaining issues in the
> "collect oids" approach for 14 and to try to change the approach in
> 15. And, if that proves too hard, try to see how hard it'd be to
> "accurately" drop.  But I'm also not sure - it might be smarter to go
> full in, to avoid introducing a system that we'll just rip out again.
>
> Comments?

The current approach sounds pretty terrible and propagating that forward
doesn't seem great.  Guess here I'd disagree with your gut feeling and
encourage trying to go 'full in', as you put it, or at least put enough
effort into it to get a feeling of if it's going to require a *lot* more
work or not and then reconsider if necessary.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: pl/pgsql feature request: shorthand for argument and local variable references
Next
From: Joe Conway
Date:
Subject: Re: "has_column_privilege()" issue with attnums and non-existent columns