Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Autovacuum on partitioned table (autoanalyze) |
Date | |
Msg-id | 840e6c56-4e05-8cdd-5478-9f8ad8130c63@enterprisedb.com Whole thread Raw |
In response to | Re: Autovacuum on partitioned table (autoanalyze) (yuzuko <yuzukohosoya@gmail.com>) |
Responses |
Re: Autovacuum on partitioned table (autoanalyze)
|
List | pgsql-hackers |
Hi, I took a look at this patch. It does not apply because of 5f8727f5a67, so a rebase is needed. But I want to talk about the general approach in general, so it does not matter. The thread is fairly long, both in terms of number of messages and time (started in 2019), so let me restate my understanding of the problem and what the patch aims to do. The problem is that autovacuum never analyzes non-leaf relations in partition hierarchies, because they never get modified and so the value of changes_since_analyze remains 0. This applies both to partitioning based on inheritance and the new fancy declarative partitioning. The consequence is that we never have accurate statistics (MCV, histograms and so on) for the parent, which may lead to poor query plans in cases when we don't use the child statistics for some reason. The patch aims for fix that by propagating the changes_since_analyze to the parent relations, so that the autovacuum properly considers if those non-leaf relations need analyze. I think the goal is right, and propagating the changes_since_analyze is the right solution, but as coded it has a couple issues that may cause trouble in practice. Firstly, the patch propagates the changes_since_analyze values from do_analyze_rel, i.e. from the worker after it analyzes the relation. That may easily lead to cases with unnecessary analyzes - consider a partitioned with 4 child relations: p1 [reltuples=1M, changes_since_analyze=400k] p2 [reltuples=1M, changes_since_analyze=90k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] With the default analyze threshold (10%) this means autoanalyze of p1, and then (in the next round) analyze of the whole partitioned table, because 400k is 10% of 4M. So far so good - we're now in this state: p1 [reltuples=1M, changes_since_analyze=0] p2 [reltuples=1M, changes_since_analyze=90k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] Let's do ~310k more modifications to p2: p1 [reltuples=1M, changes_since_analyze=0] p2 [reltuples=1M, changes_since_analyze=400k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] Now p2 gets analyzed, and the value gets propagate to p, triggering the analyze. But that's bogus - we've already seen 90k of those rows in the last analyze, the "actual" changes_since_analyze is just 310k and that should have not triggered the analyze. I could invent a much more extreme examples with more partitions, and or with multiple autovacuum workers processing the leaf rels concurrently. This seems like a quite serious issue, because analyzes on partitioned tables sample all the partitions, which seems rather expensive. That is not an issue introduced by this patch, of course, but it's good to keep that in mind and not make the matters worse. Note: I do have some ideas about how to improve that, I've started a separate thread about it [1]. IMHO the primary issue is the patch is trying to report the counts from the workers, and it's done incrementally, after the fact. It tries to prevent the issue by not propagating the counts to parents analyzed in the same round, but that doesn't seems sufficient: - There's no guarantee how long ago the parent was analyzed. Maybe it was 1 second ago, or maybe it was 24h ago and there have been many new modifications since then? - The hash table is per worker, so who knows what did the other autovacuum workers do? So not really a good solution, I'm afraid. I propose a different approach - instead of propagating the counts in do_analyze_rel for individual leaf tables, let's do that in bulk in relation_needs_vacanalyze. Before the (existing) first pass over pg_class, we can add a new one, propagating counts from leaf tables to parents. I'd imagine something like this while ((tuple = heap_getnext(relScan, ... != NULL) { Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); ... find all ancestors for classForm ... pgstat_propagate_changes(classForm, ancestors); } The pgstat_propagate_changes() simply informs the pgstat collector that classForm has certain ancestors, and it propagates the value to all of them. There's a problem, though - we can't reset the value for the leaf table, because we need to check if it needs analyze, but we also don't want to sent it again next time. But we can add another counter, tracking that part of changes_since_analyze we already propagated, and propagate only the difference. That is, we now have this: PgStat_Counter changes_since_analyze; PgStat_Counter changes_since_analyze_reported; So for example we start with changes_since_analyze = 10000; changes_since_analyze_reported = 0; and we propagate 10k to parents: changes_since_analyze = 10000; changes_since_analyze_reported = 10000; but we don't analyze anything, and we accumulate 5k more changes: changes_since_analyze = 15000; changes_since_analyze_reported = 10000; so now we propagate only the 5k delta. And so on. It's not exactly atomic change (we still do this per relation), but it's "bulk" in the sense that we force the propagation and don't wait until after the leaf happens to be analyzed. It might need to reread the stats file I think, to get the incremented values, but that seems acceptable. We may need to "sync" the counts for individual relations in a couple places (e.g. after the worker is done with the leaf, it should propagate the remaining delta before resetting the values to 0). Maybe multi-level partitioning needs some additional handling, not sure. regards [1] https://commitfest.postgresql.org/33/3052/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: