Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Autovacuum on partitioned table (autoanalyze) |
Date | |
Msg-id | 20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de Whole thread Raw |
In response to | Re: Autovacuum on partitioned table (autoanalyze) (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: Autovacuum on partitioned table (autoanalyze)
Re: Autovacuum on partitioned table (autoanalyze) Re: Autovacuum on partitioned table (autoanalyze) |
List | pgsql-hackers |
Hi, On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote: > On 2021-Apr-07, Alvaro Herrera wrote: > > > OK, I bit the bullet and re-did the logic in the way I had proposed > > earlier in the thread: do the propagation on the collector's side, by > > sending only the list of ancestors: the collector can read the tuple > > change count by itself, to add it to each ancestor. This seems less > > wasteful. Attached is v16 which does it that way and seems to work > > nicely under my testing. > > Pushed with this approach. Thanks for persisting with this. I'm looking at this in the context of rebasing & polishing the shared memory stats patch. I have a few questions / concerns: 1) Somehow it seems like a violation to do stuff like get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but it feels a bit off. Would likely not be too hard to address, e.g. by just putting some of pgstat_report_anl_ancestors in partition.c instead. 2) Why does it make sense that autovacuum sends a stats message for every partition in the system that had any chances since the last autovacuum cycle? On a database with a good number of objects / a short naptime we'll often end up sending messages for the same set of tables from separate workers, because they don't yet see the concurrent tabentry->changes_since_analyze_reported. 3) What is the goal of the autovac_refresh_stats() after the loop doing pgstat_report_anl_ancestors()? I think it'll be common that the stats collector hasn't even processed the incoming messages by that point, not to speak of actually having written out a new stats file. If it took less than 10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(), backend_read_statsfile() will not wait for a new stats file to be written out, and we'll just re-read the state we previously did. It's pretty expensive to re-read the stats file in some workloads, so I'm a bit concerned that we end up significantly increasing the amount of stats updates/reads, without actually gaining anything reliable? 4) In the shared mem stats patch I went to a fair bit of trouble to try to get rid of pgstat_vacuum_stat() (which scales extremely poorly to larger systems). For that to work pending stats can only be "staged" while holding a lock on a relation that prevents the relation from being concurrently dropped (pending stats increment a refcount for the shared stats object, which ensures that we don't loose track of the fact that a stats object has been dropped, even when stats only get submitted later). I'm not yet clear on how to make this work for pgstat_report_anl_ancestors() - but I probably can find a way. But it does feel a bit off to issue stats stuff for tables we're not sure still exist. I'll go and read through the thread, but my first thought is that having a hashtable in do_autovacuum() that contains stats for partitioned tables would be a good bit more efficient than the current approach? We already have a hashtable for each toast table, compared to that having a hashtable for each partitioned table doesn't seem like it'd be a problem? With a small bit of extra work that could even avoid the need for the additional pass through pg_class. Do the partitioned table data-gathering as part of the "collect main tables to vacuum" pass, and then do one of a) do the perform-analyze decision purely off the contents of that partioned-table hash b) fetch the RELOID syscache entry by oid and then decide based on that c) handle partioned tableas as part of the "check TOAST tables" pass - it's not like we gain a meaningful amount of efficiency by using a ScanKey to filter for RELKIND_TOASTVALUE, given that there's no index, and that an index wouldn't commonly be useful given the percentage of toast tables in pg_class Partitioning makes it a bigger issue that do_autovacuum() does multiple passes through pg_class (as it makes scenarios in which pg_class is large more common), so I don't think it's great that partitioning also increases the number of passes through pg_class. Greetings, Andres Freund
pgsql-hackers by date: