Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Autovacuum on partitioned table (autoanalyze)
Date
Msg-id 20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de
Whole thread Raw
In response to Re: Autovacuum on partitioned table (autoanalyze)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Autovacuum on partitioned table (autoanalyze)  (Andres Freund <andres@anarazel.de>)
Re: Autovacuum on partitioned table (autoanalyze)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: Autovacuum on partitioned table (autoanalyze)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
Hi,

On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:
> On 2021-Apr-07, Alvaro Herrera wrote:
>
> > OK, I bit the bullet and re-did the logic in the way I had proposed
> > earlier in the thread: do the propagation on the collector's side, by
> > sending only the list of ancestors: the collector can read the tuple
> > change count by itself, to add it to each ancestor.  This seems less
> > wasteful.  Attached is v16 which does it that way and seems to work
> > nicely under my testing.
>
> Pushed with this approach.  Thanks for persisting with this.

I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.

I have a few questions / concerns:

1) Somehow it seems like a violation to do stuff like
   get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but
   it feels a bit off. Would likely not be too hard to address, e.g. by just
   putting some of pgstat_report_anl_ancestors in partition.c instead.

2) Why does it make sense that autovacuum sends a stats message for every
   partition in the system that had any chances since the last autovacuum
   cycle? On a database with a good number of objects / a short naptime we'll
   often end up sending messages for the same set of tables from separate
   workers, because they don't yet see the concurrent
   tabentry->changes_since_analyze_reported.

3) What is the goal of the autovac_refresh_stats() after the loop doing
   pgstat_report_anl_ancestors()? I think it'll be common that the stats
   collector hasn't even processed the incoming messages by that point, not to
   speak of actually having written out a new stats file. If it took less than
   10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
   backend_read_statsfile() will not wait for a new stats file to be written
   out, and we'll just re-read the state we previously did.

   It's pretty expensive to re-read the stats file in some workloads, so I'm a
   bit concerned that we end up significantly increasing the amount of stats
   updates/reads, without actually gaining anything reliable?

4) In the shared mem stats patch I went to a fair bit of trouble to try to get
   rid of pgstat_vacuum_stat() (which scales extremely poorly to larger
   systems). For that to work pending stats can only be "staged" while holding
   a lock on a relation that prevents the relation from being concurrently
   dropped (pending stats increment a refcount for the shared stats object,
   which ensures that we don't loose track of the fact that a stats object has
   been dropped, even when stats only get submitted later).

   I'm not yet clear on how to make this work for
   pgstat_report_anl_ancestors() - but I probably can find a way. But it does
   feel a bit off to issue stats stuff for tables we're not sure still exist.


I'll go and read through the thread, but my first thought is that having a
hashtable in do_autovacuum() that contains stats for partitioned tables would
be a good bit more efficient than the current approach? We already have a
hashtable for each toast table, compared to that having a hashtable for each
partitioned table doesn't seem like it'd be a problem?

With a small bit of extra work that could even avoid the need for the
additional pass through pg_class. Do the partitioned table data-gathering as
part of the "collect main tables to vacuum" pass, and then do one of

a) do the perform-analyze decision purely off the contents of that
   partioned-table hash
b) fetch the RELOID syscache entry by oid and then decide based on that
c) handle partioned tableas as part of the "check TOAST tables" pass - it's
   not like we gain a meaningful amount of efficiency by using a ScanKey to
   filter for RELKIND_TOASTVALUE, given that there's no index, and that an
   index wouldn't commonly be useful given the percentage of toast tables in
   pg_class

Partitioning makes it a bigger issue that do_autovacuum() does multiple passes
through pg_class (as it makes scenarios in which pg_class is large more
common), so I don't think it's great that partitioning also increases the
number of passes through pg_class.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bauyrzhan Sakhariyev
Date:
Subject: Re: truncating timestamps on arbitrary intervals
Next
From: "Bossart, Nathan"
Date:
Subject: Avoiding data loss with synchronous replication