Thread: Re: Autovacuum on partitioned table (autoanalyze)
At Tue, 25 Aug 2020 14:28:20 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in > > I attach the latest patch that solves the above Werror. > > Could you please check it again? > > This version now pass the tests in the Travis pipeline as can be seen in the > link below, and is ready to be reviewed in the upcoming commitfest: > > http://cfbot.cputube.org/yuzuko-hosoya.html At Mon, 6 Jul 2020 19:35:37 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in > I think there are other approaches like Tom's idea that Justin previously > referenced, but this patch works the same way as previous patches. > (tracks updated/inserted/deleted tuples and checks whether the partitioned > tables needs auto-analyze, same as nonpartitioned tables) > Because I wanted to be able to analyze partitioned tables by autovacuum > as a first step, and I think this approach is the simplest way to do it. I'm not sure if anything bad happen if parent and children are not agree on statistics. The requirement suggested here seems to be: - We want to update parent's stats when any of its children gets its stats updated. This is curucial especially for time-series partitioning. - However, we don't want analyze the whole-tree every time any of the children was analyzed. To achieve the both, stats-merging seems to the optimal solution. Putting that aside, I had a brief look on the latest patch. /* We only count stats for things that have storage */ - if (!RELKIND_HAS_STORAGE(relkind)) + if (!RELKIND_HAS_STORAGE(relkind) || + relkind == RELKIND_PARTITIONED_TABLE) { rel->pgstat_info = NULL; RELKIND_HAS_STORAGE(RELKIND_PARTITIONED_TABLE) is already false. Maybe you wanted to do "&& relkind !=" instead:p + /* + * If this relation is partitioned, we store all ancestors' oid + * to propagate its changed_tuples to their parents when this + * transaction is committed. + */ + if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL) If the relation was detached then attached to another partition within a transaction, the ancestor list would get stale and the succeeding modification to the relation propagates into wrong ancestors. I think vacuum time is more appropriate to modify ancestors stats. It seems to me that what Alvalo pointed isthe list-order-susceptible manner of collecting children's modified tuples. + ? 0 /* partitioned tables don't have any data, so it's 0 */ If the comment is true, we shouldn't have non-zero t_changed_tuples, too. I think the reason for the lines is something different. # Oops! Time's up now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Horiguchi-san, Thank you for reviewing. On Tue, Sep 15, 2020 at 7:01 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 25 Aug 2020 14:28:20 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in > > > I attach the latest patch that solves the above Werror. > > > Could you please check it again? > > > > This version now pass the tests in the Travis pipeline as can be seen in the > > link below, and is ready to be reviewed in the upcoming commitfest: > > > > http://cfbot.cputube.org/yuzuko-hosoya.html > > At Mon, 6 Jul 2020 19:35:37 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in > > I think there are other approaches like Tom's idea that Justin previously > > referenced, but this patch works the same way as previous patches. > > (tracks updated/inserted/deleted tuples and checks whether the partitioned > > tables needs auto-analyze, same as nonpartitioned tables) > > Because I wanted to be able to analyze partitioned tables by autovacuum > > as a first step, and I think this approach is the simplest way to do it. > > I'm not sure if anything bad happen if parent and children are not > agree on statistics. > > The requirement suggested here seems to be: > > - We want to update parent's stats when any of its children gets its > stats updated. This is curucial especially for time-series > partitioning. > > - However, we don't want analyze the whole-tree every time any of the > children was analyzed. > > To achieve the both, stats-merging seems to the optimal solution. > > Putting that aside, I had a brief look on the latest patch. > > /* We only count stats for things that have storage */ > - if (!RELKIND_HAS_STORAGE(relkind)) > + if (!RELKIND_HAS_STORAGE(relkind) || > + relkind == RELKIND_PARTITIONED_TABLE) > { > rel->pgstat_info = NULL; > > RELKIND_HAS_STORAGE(RELKIND_PARTITIONED_TABLE) is already false. > Maybe you wanted to do "&& relkind !=" instead:p > Oh, indeed. I'll fix it. > > + /* > + * If this relation is partitioned, we store all ancestors' oid > + * to propagate its changed_tuples to their parents when this > + * transaction is committed. > + */ > + if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL) > > If the relation was detached then attached to another partition within > a transaction, the ancestor list would get stale and the succeeding > modification to the relation propagates into wrong ancestors. > > I think vacuum time is more appropriate to modify ancestors stats. It > seems to me that what Alvalo pointed isthe list-order-susceptible > manner of collecting children's modified tuples. > I proposed a patch that modified ancestors stats when vacuuming previously. In that time, having been pointed out by Alvaro and Amit, I tried to update the parents' changes_since_analyze in every ANALYZE. However, in that case, the problem mentioned in [1] occurred, but I could not find a way to avoid it. I think that it can be solved by updating the parents' changes_since_analyze only in the case of auto analyze, but what do you think? > > + ? 0 /* partitioned tables don't have any data, so it's 0 */ > > If the comment is true, we shouldn't have non-zero t_changed_tuples, > too. I think the reason for the lines is something different. > Yes, surely. I think updating the values of live_tuples and dead_tuples is confusing for users. I'll consider another comment. [1] https://www.postgresql.org/message-id/CAKkQ50-bwFEDMBGb1JmDXffXsiU8xk-hN6kJK9CKjdBa7r%3DHdw%40mail.gmail.com -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Hello, I reconsidered a way based on the v5 patch in line with Horiguchi-san's comment. This approach is as follows: - A partitioned table is checked whether it needs analyze like a plain table in relation_needs_vacanalyze(). To do this, we should store partitioned table's stats (changes_since_analyze). - Partitioned table's changes_since_analyze is updated when analyze a leaf partition by propagating its changes_since_analyze. In the next scheduled analyze time, it is used in the above process. That is, the partitioned table is analyzed behind leaf partitions. - The propagation process differs between autoanalyze or plain analyze. In autoanalyze, a leaf partition's changes_since_analyze is propagated to *all* ancestors. Whereas, in plain analyze on an inheritance tree, propagates to ancestors not included the tree to avoid needless counting. Attach the latest patch to this email. Could you check it again please? -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Attachment
Thanks you for the new version. At Fri, 23 Oct 2020 15:12:51 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in > Hello, > > I reconsidered a way based on the v5 patch in line with > Horiguchi-san's comment. > > This approach is as follows: > - A partitioned table is checked whether it needs analyze like a plain > table in relation_needs_vacanalyze(). To do this, we should store > partitioned table's stats (changes_since_analyze). > - Partitioned table's changes_since_analyze is updated when > analyze a leaf partition by propagating its changes_since_analyze. > In the next scheduled analyze time, it is used in the above process. > That is, the partitioned table is analyzed behind leaf partitions. > - The propagation process differs between autoanalyze or plain analyze. > In autoanalyze, a leaf partition's changes_since_analyze is propagated > to *all* ancestors. Whereas, in plain analyze on an inheritance tree, > propagates to ancestors not included the tree to avoid needless counting. > > Attach the latest patch to this email. > Could you check it again please? + /* + * Get its all ancestors to propagate changes_since_analyze count. + * However, when ANALYZE inheritance tree, we get ancestors of + * toprel_oid to avoid needless counting. + */ + if (!OidIsValid(toprel_oid)) + ancestors = get_partition_ancestors(RelationGetRelid(rel)); + else + ancestors = get_partition_ancestors(toprel_oid); This comment doesn't explaining what the code intends but what the code does. The reason for the difference is that if we have a valid toprel_oid, we analyze all descendants of the relation this time, and if we propagate the number to the descendants of the top relation, the next analyze on the relations could happen shortly than expected. - msg.m_live_tuples = livetuples; - msg.m_dead_tuples = deadtuples; + msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + ? 0 /* if this is a partitioned table, skip modifying */ + : livetuples; + msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + ? 0 /* if this is a partitioned table, skip modifying */ + : deadtuples; Two successive branching with the same condition looks odd. And we need an explanation of *why* we don't set the values for partitioned tables. + foreach(lc, ancestors) + { + Oid parentOid = lfirst_oid(lc); + Relation parentrel; + + parentrel = table_open(parentOid, AccessShareLock); I'm not sure, but all of the ancestors always cannot be a parent (in other words, a parent of a parent of mine is not a parent of mine). Isn't just rel sufficient? - * Report ANALYZE to the stats collector, too. However, if doing - * inherited stats we shouldn't report, because the stats collector only - * tracks per-table stats. Reset the changes_since_analyze counter only - * if we analyzed all columns; otherwise, there is still work for - * auto-analyze to do. + * Report ANALYZE to the stats collector, too. Reset the + * changes_since_analyze counter only if we analyzed all columns; + * otherwise, there is still work for auto-analyze to do. */ - if (!inh) + if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) pgstat_report_analyze(onerel, totalrows, totaldeadrows, This still rejects traditional inheritance nonleaf relations. But if we remove the description about that completely in the comment above, we should support traditional inheritance parents here. I think we can do that as far as we need to propagate only per-tuple stats (that mans not per-attribute) like changes_since_analyze. Whichever way we take, do we need the description about the behavior in the documentation? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Oct 23, 2020 at 03:12:51PM +0900, yuzuko wrote: > Hello, > > I reconsidered a way based on the v5 patch in line with > Horiguchi-san's comment. > > This approach is as follows: > - A partitioned table is checked whether it needs analyze like a plain > table in relation_needs_vacanalyze(). To do this, we should store > partitioned table's stats (changes_since_analyze). > - Partitioned table's changes_since_analyze is updated when > analyze a leaf partition by propagating its changes_since_analyze. > In the next scheduled analyze time, it is used in the above process. > That is, the partitioned table is analyzed behind leaf partitions. > - The propagation process differs between autoanalyze or plain analyze. > In autoanalyze, a leaf partition's changes_since_analyze is propagated > to *all* ancestors. Whereas, in plain analyze on an inheritance tree, > propagates to ancestors not included the tree to avoid needless counting. + * Get its all ancestors to propagate changes_since_analyze count. + * However, when ANALYZE inheritance tree, we get ancestors of + * toprel_oid to avoid needless counting. => I don't understand that comment. + /* Find all members of inheritance set taking AccessShareLock */ + children = find_all_inheritors(relid, AccessShareLock, NULL); => Do you know that returns the table itself ? And in pg14dev, each partitioned table has reltuples = -1, not zero... + /* Skip foreign partitions */ + if (childclass->relkind == RELKIND_FOREIGN_TABLE) + continue; => Michael's suggrestion is to use RELKIND_HAS_STORAGE to skip both foreign and partitioned tables. Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think you need to copy it anyway - just call ReleaseSysCache(). Regarding the counters in pg_stat_all_tables: maybe some of these should be null rather than zero ? Or else you should make an 0001 patch to fully implement this view, with all relevant counters, not just n_mod_since_analyze, last_*analyze, and *analyze_count. These are specifically misleading: last_vacuum | last_autovacuum | n_ins_since_vacuum | 0 vacuum_count | 0 autovacuum_count | 0 -- Justin
Horiguchi-san, Thank you for your comments. On Fri, Oct 23, 2020 at 8:23 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thanks you for the new version. > > At Fri, 23 Oct 2020 15:12:51 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in > > Hello, > > > > I reconsidered a way based on the v5 patch in line with > > Horiguchi-san's comment. > > > > This approach is as follows: > > - A partitioned table is checked whether it needs analyze like a plain > > table in relation_needs_vacanalyze(). To do this, we should store > > partitioned table's stats (changes_since_analyze). > > - Partitioned table's changes_since_analyze is updated when > > analyze a leaf partition by propagating its changes_since_analyze. > > In the next scheduled analyze time, it is used in the above process. > > That is, the partitioned table is analyzed behind leaf partitions. > > - The propagation process differs between autoanalyze or plain analyze. > > In autoanalyze, a leaf partition's changes_since_analyze is propagated > > to *all* ancestors. Whereas, in plain analyze on an inheritance tree, > > propagates to ancestors not included the tree to avoid needless counting. > > > > Attach the latest patch to this email. > > Could you check it again please? > > + /* > + * Get its all ancestors to propagate changes_since_analyze count. > + * However, when ANALYZE inheritance tree, we get ancestors of > + * toprel_oid to avoid needless counting. > + */ > + if (!OidIsValid(toprel_oid)) > + ancestors = get_partition_ancestors(RelationGetRelid(rel)); > + else > + ancestors = get_partition_ancestors(toprel_oid); > > This comment doesn't explaining what the code intends but what the > code does. > > The reason for the difference is that if we have a valid toprel_oid, > we analyze all descendants of the relation this time, and if we > propagate the number to the descendants of the top relation, the next > analyze on the relations could happen shortly than expected. > I modified this comment according to your advice. > > - msg.m_live_tuples = livetuples; > - msg.m_dead_tuples = deadtuples; > + msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > + ? 0 /* if this is a partitioned table, skip modifying */ > + : livetuples; > + msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > + ? 0 /* if this is a partitioned table, skip modifying */ > + : deadtuples; > > Two successive branching with the same condition looks odd. And we > need an explanation of *why* we don't set the values for partitioned > tables. I moved this part to the previous block that livetuples and deadtuples are set. Actually, I think the reason why those counters are set 0 when the given relation is a partitioned table is that such a table doesn't have any data. About changes_since_analyze counter, we should support exceptionally in order to check whether partitioned tables need auto analyze. I added this description to the comment of this function. > > + foreach(lc, ancestors) > + { > + Oid parentOid = lfirst_oid(lc); > + Relation parentrel; > + > + parentrel = table_open(parentOid, AccessShareLock); > > I'm not sure, but all of the ancestors always cannot be a parent (in > other words, a parent of a parent of mine is not a parent of > mine). Isn't just rel sufficient? > I changed 'parentrel' to 'rel'. > > > - * Report ANALYZE to the stats collector, too. However, if doing > - * inherited stats we shouldn't report, because the stats collector only > - * tracks per-table stats. Reset the changes_since_analyze counter only > - * if we analyzed all columns; otherwise, there is still work for > - * auto-analyze to do. > + * Report ANALYZE to the stats collector, too. Reset the > + * changes_since_analyze counter only if we analyzed all columns; > + * otherwise, there is still work for auto-analyze to do. > */ > - if (!inh) > + if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > pgstat_report_analyze(onerel, totalrows, totaldeadrows, > > This still rejects traditional inheritance nonleaf relations. But if > we remove the description about that completely in the comment above, > we should support traditional inheritance parents here. I think we > can do that as far as we need to propagate only per-tuple stats (that > mans not per-attribute) like changes_since_analyze. > Regarding manual ANALYZE, not auto ANALYZE, when analyzing declarative partitioning, all children are also analyzed at the same time. However, in the case of traditional inheritance, we need to run that command on each child table individually. That is, they are not analyzed all together by ANALYZE. So I tried to support auto analyze for declarative partitioning for now. Added that we only support declarative partitioning to that comment. > Whichever way we take, do we need the description about the behavior > in the documentation? > Added a description about this patch to the document. I attach the latest patch to this email. Could you please check it again? -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Attachment
Hi Justin, Thank you for your comments. I attached the latest patch(v11) to the previous email. > > + * Get its all ancestors to propagate changes_since_analyze count. > + * However, when ANALYZE inheritance tree, we get ancestors of > + * toprel_oid to avoid needless counting. > > => I don't understand that comment. > I fixed that comment. > + /* Find all members of inheritance set taking AccessShareLock */ > + children = find_all_inheritors(relid, AccessShareLock, NULL); > > => Do you know that returns the table itself ? And in pg14dev, each > partitioned table has reltuples = -1, not zero... > > + /* Skip foreign partitions */ > + if (childclass->relkind == RELKIND_FOREIGN_TABLE) > + continue; > > => Michael's suggrestion is to use RELKIND_HAS_STORAGE to skip both foreign and > partitioned tables. > I overlooked that. Revised that according to your comments. > Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think > you need to copy it anyway - just call ReleaseSysCache(). > Fixed it. > Regarding the counters in pg_stat_all_tables: maybe some of these should be > null rather than zero ? Or else you should make an 0001 patch to fully > implement this view, with all relevant counters, not just n_mod_since_analyze, > last_*analyze, and *analyze_count. These are specifically misleading: > > last_vacuum | > last_autovacuum | > n_ins_since_vacuum | 0 > vacuum_count | 0 > autovacuum_count | 0 > I haven't modified this part yet, but you meant that we should set null to counters about vacuum because partitioned tables are not vacuumed? -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
At Thu, 5 Nov 2020 16:03:12 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in > Hi Justin, > > Thank you for your comments. > I attached the latest patch(v11) to the previous email. > > > > > + * Get its all ancestors to propagate changes_since_analyze count. > > + * However, when ANALYZE inheritance tree, we get ancestors of > > + * toprel_oid to avoid needless counting. > > > > => I don't understand that comment. > > > I fixed that comment. + * Get its all ancestors to propagate changes_since_analyze count. + * However, when we have a valid toprel_oid, that is ANALYZE inheritance + * tree, if we propagate the number to all ancestors, the next analyze + * on partitioned tables in the tree could happen shortly expected. + * So we get ancestors of toprel_oid which are not analyzed this time. In second thought about the reason for the "toprel_oid". It is perhaps to avoid "wrongly" propagated values to ancestors after a manual ANALYZE on a partitioned table. But the same happens after an autoanalyze iteration if some of the ancestors of a leaf relation are analyzed before the leaf relation in a autoanalyze iteration. That can trigger an unnecessary analyzing for some of the ancestors. So we need to do a similar thing for autovacuum, However.. [1(root):analzye]-[2:DONT analyze]-[3:analyze]-[leaf] In this case topre_oid is invalid (since it's autoanalyze) but we should avoid propagating the count to 1 and 3 if it is processed *before* the leaf, but should propagate to 2. toprel_oid doesn't work in that case. So, to propagate the count properly, we need to analyze relations leaf-to-root order, or propagate the counter only to anscestors that haven't been processed in the current iteration. It seems a bit too complex to sort analyze relations in that order. The latter would be relatively simple. See the attached for how it looks like. Anyway, either way we take, it is not pgstat.c's responsibility to do that since the former need to heavily reliant to what analyze does, and the latter need to know what anlyze is doing. > > Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think > > you need to copy it anyway - just call ReleaseSysCache(). > > > Fixed it. Mmm. Unfortunately, that fix leaks cache reference when !RELKIND_HAS_STORAGE. > > Regarding the counters in pg_stat_all_tables: maybe some of these should be > > null rather than zero ? Or else you should make an 0001 patch to fully > > implement this view, with all relevant counters, not just n_mod_since_analyze, > > last_*analyze, and *analyze_count. These are specifically misleading: > > > > last_vacuum | > > last_autovacuum | > > n_ins_since_vacuum | 0 > > vacuum_count | 0 > > autovacuum_count | 0 > > > I haven't modified this part yet, but you meant that we should set > null to counters > about vacuum because partitioned tables are not vacuumed? Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure what is the best way here. Showing null seems reasonable but I'm not sure that doesn't break anything. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 4d8ad754f8..50b55f5d01 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -816,6 +816,18 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu since the last <command>ANALYZE</command>. </para> + <para> + For declaratively partitioned tables, only analyze is supported. + The same <quote>analyze threshold</quote> defined above is used, + but the number of tuples is sum of their childrens' + <structname>pg_class</structname>.<structfield>reltuples</structfield>. + Also, total number of tuples inserted, updated, or deleted since the last + <command>ANALYZE</command> compared with the threshold is calculated by adding up + childrens' number of tuples analyzed in the previous <command>ANALYZE</command>. + This is because partitioned tables don't have any data. So analyze on partitioned + tables are one lap behind their children. + </para> + <para> Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index bc59a2d77d..d94caa4b7e 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1323,8 +1323,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM If a table parameter value is set and the equivalent <literal>toast.</literal> parameter is not, the TOAST table will use the table's parameter value. - Specifying these parameters for partitioned tables is not supported, - but you may specify them for individual leaf partitions. </para> <variablelist> diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 8ccc228a8c..35bc2e5bdb 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] = { "autovacuum_enabled", "Enables autovacuum in this relation", - RELOPT_KIND_HEAP | RELOPT_KIND_TOAST, + RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED, ShareUpdateExclusiveLock }, true @@ -246,7 +246,7 @@ static relopt_int intRelOpts[] = { "autovacuum_analyze_threshold", "Minimum number of tuple inserts, updates or deletes prior to analyze", - RELOPT_KIND_HEAP, + RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED, ShareUpdateExclusiveLock }, -1, 0, INT_MAX @@ -420,7 +420,7 @@ static relopt_real realRelOpts[] = { "autovacuum_analyze_scale_factor", "Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples", - RELOPT_KIND_HEAP, + RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED, ShareUpdateExclusiveLock }, -1, 0.0, 100.0 @@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate) bytea * partitioned_table_reloptions(Datum reloptions, bool validate) { + /* - * There are no options for partitioned tables yet, but this is able to do - * some validation. + * autovacuum_enabled, autovacuum_analyze_threshold and + * autovacuum_analyze_scale_factor are supported for partitioned tables. */ - return (bytea *) build_reloptions(reloptions, validate, - RELOPT_KIND_PARTITIONED, - 0, NULL, 0); + return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED); } /* diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 2e4aa1c4b6..f1982d0f77 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -591,7 +591,7 @@ CREATE VIEW pg_stat_all_tables AS FROM pg_class C LEFT JOIN pg_index I ON C.oid = I.indrelid LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) - WHERE C.relkind IN ('r', 't', 'm') + WHERE C.relkind IN ('r', 't', 'm', 'p') GROUP BY C.oid, N.nspname, C.relname; CREATE VIEW pg_stat_xact_all_tables AS @@ -611,7 +611,7 @@ CREATE VIEW pg_stat_xact_all_tables AS FROM pg_class C LEFT JOIN pg_index I ON C.oid = I.indrelid LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) - WHERE C.relkind IN ('r', 't', 'm') + WHERE C.relkind IN ('r', 't', 'm', 'p') GROUP BY C.oid, N.nspname, C.relname; CREATE VIEW pg_stat_sys_tables AS diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 8af12b5c6b..8758a24955 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -30,6 +30,7 @@ #include "catalog/catalog.h" #include "catalog/index.h" #include "catalog/indexing.h" +#include "catalog/partition.h" #include "catalog/pg_collation.h" #include "catalog/pg_inherits.h" #include "catalog/pg_namespace.h" @@ -38,6 +39,7 @@ #include "commands/progress.h" #include "commands/tablecmds.h" #include "commands/vacuum.h" +#include "common/hashfn.h" #include "executor/executor.h" #include "foreign/fdwapi.h" #include "miscadmin.h" @@ -107,6 +109,45 @@ static void update_attstats(Oid relid, bool inh, static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); +typedef struct analyze_oident +{ + Oid oid; + char status; +} analyze_oident; + +StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32"); +#define SH_PREFIX analyze_oids +#define SH_ELEMENT_TYPE analyze_oident +#define SH_KEY_TYPE Oid +#define SH_KEY oid +#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key) +#define SH_EQUAL(tb, a, b) (a == b) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" + +#define ANALYZED_OIDS_HASH_SIZE 128 +analyze_oids_hash *analyzed_reloids = NULL; + +void +analyze_init_status(void) +{ + if (analyzed_reloids) + analyze_oids_destroy(analyzed_reloids); + + analyzed_reloids = analyze_oids_create(CurrentMemoryContext, + ANALYZED_OIDS_HASH_SIZE, NULL); +} + +void +analyze_destroy_status(void) +{ + if (analyzed_reloids) + analyze_oids_destroy(analyzed_reloids); + + analyzed_reloids = NULL; +} /* * analyze_rel() -- analyze one relation @@ -312,6 +353,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params, Oid save_userid; int save_sec_context; int save_nestlevel; + bool found; if (inh) ereport(elevel, @@ -644,16 +686,67 @@ do_analyze_rel(Relation onerel, VacuumParams *params, } /* - * Report ANALYZE to the stats collector, too. However, if doing - * inherited stats we shouldn't report, because the stats collector only - * tracks per-table stats. Reset the changes_since_analyze counter only - * if we analyzed all columns; otherwise, there is still work for - * auto-analyze to do. + * Report ANALYZE to the stats collector, too. Regarding inherited stats, + * we report only in the case of declarative partitioning. Reset the + * changes_since_analyze counter only if we analyzed all columns; + * otherwise, there is still work for auto-analyze to do. */ - if (!inh) + if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + { + List *ancestors; + ListCell *lc; + Datum oiddatum = ObjectIdGetDatum(RelationGetRelid(onerel)); + Datum countdatum; + int64 change_count; + + /* + * Read current value of n_mod_since_analyze of this relation. This + * might be a bit stale but we don't need such correctness here. + */ + countdatum = + DirectFunctionCall1(pg_stat_get_mod_since_analyze, oiddatum); + change_count = DatumGetInt64(countdatum); + + /* collect all ancestors of this relation */ + ancestors = get_partition_ancestors(RelationGetRelid(onerel)); + + /* + * To let partitioned relations be analyzed, we need to update + * change_since_analyze also for partitioned relations, which don't + * have storage. We move the count of leaf-relations to ancestors + * before resetting. We could instead bump up the counter of all + * ancestors every time leaf relations are updated but that is too + * complex. + */ + foreach (lc, ancestors) + { + Oid toreloid = lfirst_oid(lc); + + /* + * Don't propagate the count to anscestors that have already been + * analyzed in this analyze command or this iteration of + * autoanalyze. + */ + if (analyze_oids_lookup(analyzed_reloids, toreloid) == NULL) + { + Relation rel; + + rel = table_open(toreloid, AccessShareLock); + pgstat_report_partchanges(rel, change_count); + table_close(rel, AccessShareLock); + } + + } + pgstat_report_analyze(onerel, totalrows, totaldeadrows, (va_cols == NIL)); + list_free(ancestors); + } + + /* Recrod this relatoin as "analyzed" */ + analyze_oids_insert(analyzed_reloids, onerel->rd_id, &found); + /* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */ if (!(params->options & VACOPT_VACUUM)) { diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 1b6717f727..9cc7c1bb4f 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -437,6 +437,7 @@ vacuum(List *relations, VacuumParams *params, VacuumSharedCostBalance = NULL; VacuumActiveNWorkers = NULL; + analyze_init_status(); /* * Loop to process each selected relation. */ @@ -487,6 +488,7 @@ vacuum(List *relations, VacuumParams *params, { in_vacuum = false; VacuumCostActive = false; + analyze_destroy_status(); } PG_END_TRY(); diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 2cef56f115..bb568f68b5 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -75,6 +75,7 @@ #include "catalog/dependency.h" #include "catalog/namespace.h" #include "catalog/pg_database.h" +#include "catalog/pg_inherits.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" #include "lib/ilist.h" @@ -2052,11 +2053,11 @@ do_autovacuum(void) * Scan pg_class to determine which tables to vacuum. * * We do this in two passes: on the first one we collect the list of plain - * relations and materialized views, and on the second one we collect - * TOAST tables. The reason for doing the second pass is that during it we - * want to use the main relation's pg_class.reloptions entry if the TOAST - * table does not have any, and we cannot obtain it unless we know - * beforehand what's the main table OID. + * relations, materialized views and partitioned tables, and on the second + * one we collect TOAST tables. The reason for doing the second pass is that + * during it we want to use the main relation's pg_class.reloptions entry + * if the TOAST table does not have any, and we cannot obtain it unless we + * know beforehand what's the main table OID. * * We need to check TOAST tables separately because in cases with short, * wide tables there might be proportionally much more activity in the @@ -2079,7 +2080,8 @@ do_autovacuum(void) bool wraparound; if (classForm->relkind != RELKIND_RELATION && - classForm->relkind != RELKIND_MATVIEW) + classForm->relkind != RELKIND_MATVIEW && + classForm->relkind != RELKIND_PARTITIONED_TABLE) continue; relid = classForm->oid; @@ -2742,6 +2744,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc) Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION || ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW || + ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE || ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE); relopts = extractRelOptions(tup, pg_class_desc, NULL); @@ -3091,7 +3094,42 @@ relation_needs_vacanalyze(Oid relid, */ if (PointerIsValid(tabentry) && AutoVacuumingActive()) { - reltuples = classForm->reltuples; + if (classForm->relkind != RELKIND_PARTITIONED_TABLE) + reltuples = classForm->reltuples; + else + { + /* + * If the relation is a partitioned table, we must add up childrens' + * reltuples. + */ + List *children; + ListCell *lc; + + reltuples = 0; + + /* Find all members of inheritance set taking AccessShareLock */ + children = find_all_inheritors(relid, AccessShareLock, NULL); + + foreach(lc, children) + { + Oid childOID = lfirst_oid(lc); + HeapTuple childtuple; + Form_pg_class childclass; + + childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID)); + childclass = (Form_pg_class) GETSTRUCT(childtuple); + + /* Skip a partitioned table and foreign partitions */ + if (RELKIND_HAS_STORAGE(childclass->relkind)) + { + /* Sum up the child's reltuples for its parent table */ + reltuples += childclass->reltuples; + } + + ReleaseSysCache(childtuple); + } + } + vactuples = tabentry->n_dead_tuples; instuples = tabentry->inserts_since_vacuum; anltuples = tabentry->changes_since_analyze; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index e76e627c6b..633c5743fb 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -360,6 +360,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len); static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len); static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len); +static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len); static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len); static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len); static void pgstat_recv_wal(PgStat_MsgWal *msg, int len); @@ -1556,6 +1557,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared, * * Caller must provide new live- and dead-tuples estimates, as well as a * flag indicating whether to reset the changes_since_analyze counter. + * Exceptional support only changes_since_analyze for partitioned tables, + * though they don't have any data. This counter will tell us whether + * partitioned tables need autoanalyze or not. * -------- */ void @@ -1576,22 +1580,29 @@ pgstat_report_analyze(Relation rel, * off these counts from what we send to the collector now, else they'll * be double-counted after commit. (This approach also ensures that the * collector ends up with the right numbers if we abort instead of - * committing.) + * committing.) However, for partitioned tables, we will not report both + * livetuples and deadtuples because those tables don't have any data. */ if (rel->pgstat_info != NULL) { PgStat_TableXactStatus *trans; - for (trans = rel->pgstat_info->trans; trans; trans = trans->upper) + if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + /* If this rel is partitioned, skip modifying */ + livetuples = deadtuples = 0; + else { - livetuples -= trans->tuples_inserted - trans->tuples_deleted; - deadtuples -= trans->tuples_updated + trans->tuples_deleted; + for (trans = rel->pgstat_info->trans; trans; trans = trans->upper) + { + livetuples -= trans->tuples_inserted - trans->tuples_deleted; + deadtuples -= trans->tuples_updated + trans->tuples_deleted; + } + /* count stuff inserted by already-aborted subxacts, too */ + deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples; + /* Since ANALYZE's counts are estimates, we could have underflowed */ + livetuples = Max(livetuples, 0); + deadtuples = Max(deadtuples, 0); } - /* count stuff inserted by already-aborted subxacts, too */ - deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples; - /* Since ANALYZE's counts are estimates, we could have underflowed */ - livetuples = Max(livetuples, 0); - deadtuples = Max(deadtuples, 0); } pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE); @@ -1605,6 +1616,30 @@ pgstat_report_analyze(Relation rel, pgstat_send(&msg, sizeof(msg)); } +/* -------- + * pgstat_report_partchanges() - + * + * + * Called when a leaf partition is analyzed to tell the collector about + * its parent's changed_tuples. + * -------- + */ +void +pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples) +{ + PgStat_MsgPartChanges msg; + + if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts) + return; + + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES); + msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId; + msg.m_tableoid = RelationGetRelid(rel); + msg.m_changed_tuples = changed_tuples; + pgstat_send(&msg, sizeof(msg)); +} + + /* -------- * pgstat_report_recovery_conflict() - * @@ -1921,7 +1956,8 @@ pgstat_initstats(Relation rel) char relkind = rel->rd_rel->relkind; /* We only count stats for things that have storage */ - if (!RELKIND_HAS_STORAGE(relkind)) + if (!RELKIND_HAS_STORAGE(relkind) && + relkind != RELKIND_PARTITIONED_TABLE) { rel->pgstat_info = NULL; return; @@ -4833,6 +4869,10 @@ PgstatCollectorMain(int argc, char *argv[]) pgstat_recv_analyze(&msg.msg_analyze, len); break; + case PGSTAT_MTYPE_PARTCHANGES: + pgstat_recv_partchanges(&msg.msg_partchanges, len); + break; + case PGSTAT_MTYPE_ARCHIVER: pgstat_recv_archiver(&msg.msg_archiver, len); break; @@ -6701,6 +6741,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len) } } +static void +pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len) +{ + PgStat_StatDBEntry *dbentry; + PgStat_StatTabEntry *tabentry; + + dbentry = pgstat_get_db_entry(msg->m_databaseid, true); + + tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true); + + tabentry->changes_since_analyze += msg->m_changed_tuples; +} /* ---------- * pgstat_recv_archiver() - diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index a4cd721400..c6dcf23898 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -280,6 +280,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation, int options, bool verbose, LOCKMODE lmode); /* in commands/analyze.c */ +extern void analyze_init_status(void); +extern void analyze_destroy_status(void); extern void analyze_rel(Oid relid, RangeVar *relation, VacuumParams *params, List *va_cols, bool in_outer_xact, BufferAccessStrategy bstrategy); diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 257e515bfe..dd7faf9861 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -60,6 +60,7 @@ typedef enum StatMsgType PGSTAT_MTYPE_AUTOVAC_START, PGSTAT_MTYPE_VACUUM, PGSTAT_MTYPE_ANALYZE, + PGSTAT_MTYPE_PARTCHANGES, PGSTAT_MTYPE_ARCHIVER, PGSTAT_MTYPE_BGWRITER, PGSTAT_MTYPE_WAL, @@ -419,6 +420,18 @@ typedef struct PgStat_MsgAnalyze PgStat_Counter m_dead_tuples; } PgStat_MsgAnalyze; +/* ---------- + * PgStat_MsgPartChanges Sent by the autovacuum deamon + * after ANALYZE of leaf partitions + * ---------- + */ +typedef struct PgStat_MsgPartChanges +{ + PgStat_MsgHdr m_hdr; + Oid m_databaseid; + Oid m_tableoid; + PgStat_Counter m_changed_tuples; +} PgStat_MsgPartChanges; /* ---------- * PgStat_MsgArchiver Sent by the archiver to update statistics. @@ -640,6 +653,7 @@ typedef union PgStat_Msg PgStat_MsgAutovacStart msg_autovacuum_start; PgStat_MsgVacuum msg_vacuum; PgStat_MsgAnalyze msg_analyze; + PgStat_MsgPartChanges msg_partchanges; PgStat_MsgArchiver msg_archiver; PgStat_MsgBgWriter msg_bgwriter; PgStat_MsgWal msg_wal; @@ -1387,7 +1401,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared, extern void pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples, bool resetcounter); - +extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples); extern void pgstat_report_recovery_conflict(int reason); extern void pgstat_report_deadlock(void); extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount); diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 097ff5d111..e19e510245 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1804,7 +1804,7 @@ pg_stat_all_tables| SELECT c.oid AS relid, FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) - WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"])) + WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"])) GROUP BY c.oid, n.nspname, c.relname; pg_stat_archiver| SELECT s.archived_count, s.last_archived_wal, @@ -2172,7 +2172,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid, FROM ((pg_class c LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) - WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"])) + WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"])) GROUP BY c.oid, n.nspname, c.relname; pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid, pg_stat_xact_all_tables.schemaname,
I looked at both Yuzuko Hosoya's patch and Kyotaro Horiguchi's, and think we're doing things in a quite complicated manner, which perhaps could be done more easily. Hosoya's patch has pgstat_report_analyze call pgstat_get_tab_entry() for the table being vacuumed, then obtains the list of ancestors, and then sends for each ancestor a new message containing the partition's changes_since_analyze for that ancestor. When stat collector receives that message, it adds the number to the ancestor's m_changed_tuples. Horiguchi's doing a similar thing, only differently: it is do_analyze_rel that reads the count from the collector (this time by calling SQL function pg_stat_get_mod_since_analyze) and then sends number back to the collector for each ancestor. I suggest that a better way to do this, is to forget about the new "partchanges" message completely. Instead, let's add an array of ancestors to the analyze message (borrowing from PgStat_MsgFuncstat). Something like this: #define PGSTAT_NUM_ANCESTORENTRIES \ ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(bool) - \ sizeof(bool) - sizeof(TimestampTz) - sizeof(PgStat_Counter) - \ sizeof(PgStat_Counter) - sizeof(int)) / sizeof(Oid)) typedef struct PgStat_MsgAnalyze { PgStat_MsgHdr m_hdr; Oid m_databaseid; Oid m_tableoid; bool m_autovacuum; bool m_resetcounter; TimestampTz m_analyzetime; PgStat_Counter m_live_tuples; PgStat_Counter m_dead_tuples; int m_nancestors; Oid m_ancestors[PGSTAT_NUM_ANCESTORENTRIES]; } PgStat_MsgAnalyze; For non-partitions, m_nancestors would be 0, so the message would be handled as today. For partitions, the array carries the OID of all ancestors. When the collector receives this message, first it looks up the pgstat entries for each ancestors in the array, and it adds the partition's current changes_since_analyze to the ancestor's changes_since_analyze. Then it does things as currently, including reset the changes_since_analyze counter for the partition. Key point in this is that we don't need to read the number from collector into the backend executing analyze. We just *send* the data about ancestors, and the collector knows what to do with it. One possible complaint is: what if there are more ancestors that fit in the message? I propose that this problem can be ignored, since in order to hit this, you'd need to have (1000-8-4-4-1-1-8-8-8-4)/4 = 238 ancestors (if my math is right). I doubt we'll hit the need to use 238 levels of partitionings before a stat collector rewrite occurs ... (It is possible to remove that restriction by doing more complicated things such as sending the list of ancestor in a new type of message that can be sent several times, prior to the analyze message itself, but I don't think this is worth the trouble.)
On 2020-Nov-10, Kyotaro Horiguchi wrote: > In second thought about the reason for the "toprel_oid". It is perhaps > to avoid "wrongly" propagated values to ancestors after a manual > ANALYZE on a partitioned table. But the same happens after an > autoanalyze iteration if some of the ancestors of a leaf relation are > analyzed before the leaf relation in a autoanalyze iteration. That > can trigger an unnecessary analyzing for some of the ancestors. I'm not sure I understand this point. I think we should only trigger this on analyzes of *leaf* partitions, not intermediate partitioned relations. That way you would never get these unnecesary analyzes. Am I missing something? (So with my proposal in the previous email, we would send the list of ancestor relations after analyzing a leaf partition. Whenever we analyze a non-leaf, then the list of ancestors is sent as an empty list.) > > > Regarding the counters in pg_stat_all_tables: maybe some of these should be > > > null rather than zero ? Or else you should make an 0001 patch to fully > > > implement this view, with all relevant counters, not just n_mod_since_analyze, > > > last_*analyze, and *analyze_count. These are specifically misleading: > > > > > > last_vacuum | > > > last_autovacuum | > > > n_ins_since_vacuum | 0 > > > vacuum_count | 0 > > > autovacuum_count | 0 > > > > > I haven't modified this part yet, but you meant that we should set > > null to counters > > about vacuum because partitioned tables are not vacuumed? > > Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure > what is the best way here. Showing null seems reasonable but I'm not > sure that doesn't break anything. I agree that showing NULLs for the vacuum columns is better. Perhaps the most reasonable way to do this is use -1 as an indicator that NULL ought to be returned from pg_stat_get_vacuum_count() et al, and add a boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum" that says to store -1 instead of 0 in pgstat_recv_tabstat.
Hello Alvaro, Thank you for your comments. > > > In second thought about the reason for the "toprel_oid". It is perhaps > > to avoid "wrongly" propagated values to ancestors after a manual > > ANALYZE on a partitioned table. But the same happens after an > > autoanalyze iteration if some of the ancestors of a leaf relation are > > analyzed before the leaf relation in a autoanalyze iteration. That > > can trigger an unnecessary analyzing for some of the ancestors. > > I'm not sure I understand this point. I think we should only trigger > this on analyzes of *leaf* partitions, not intermediate partitioned > relations. That way you would never get these unnecesary analyzes. > Am I missing something? > > (So with my proposal in the previous email, we would send the list of > ancestor relations after analyzing a leaf partition. Whenever we > analyze a non-leaf, then the list of ancestors is sent as an empty > list.) > The problem Horiguchi-san mentioned is as follows: create table p1 (i int) partition by range(i); create table p1_1 partition of p1 for values from (0) to (500) partition by range(i); create table p1_1_1 partition of p1_1 for values from (0) to (300); insert into p1 select generate_series(0,299); -- After auto analyze (first time) postgres=# select relname, n_mod_since_analyze, last_autoanalyze from pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1'); relname | n_mod_since_analyze | last_autoanalyze ---------+---------------------+------------------------------- p1 | 300 | p1_1 | 300 | p1_1_1 | 0 | 2020-12-02 22:24:18.753574+09 (3 rows) -- Insert more rows postgres=# insert into p1 select generate_series(0,199); postgres=# select relname, n_mod_since_analyze, last_autoanalyze from pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1'); relname | n_mod_since_analyze | last_autoanalyze ---------+---------------------+------------------------------- p1 | 300 | p1_1 | 300 | p1_1_1 | 200 | 2020-12-02 22:24:18.753574+09 (3 rows) -- After auto analyze (second time) postgres=# select relname, n_mod_since_analyze, last_autoanalyze from pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1'); relname | n_mod_since_analyze | last_autoanalyze ---------+---------------------+------------------------------- p1 | 0 | 2020-12-02 22:25:18.857248+09 p1_1 | 200 | 2020-12-02 22:25:18.661932+09 p1_1_1 | 0 | 2020-12-02 22:25:18.792078+09 After 2nd auto analyze, all relations' n_mod_since_analyze should be 0, but p1_1's is not. This is because p1_1 was analyzed before p1_1_1. So p1_1 will be analyzed again in the 3rd auto analyze. That is propagating changes_since_analyze to *all ancestors* may cause unnecessary analyzes even if we do this only when analyzing leaf partitions. So I think we should track ancestors which are not analyzed in the current iteration as Horiguchi-san proposed. Regarding your idea: > typedef struct PgStat_MsgAnalyze > { > PgStat_MsgHdr m_hdr; > Oid m_databaseid; > Oid m_tableoid; > bool m_autovacuum; > bool m_resetcounter; > TimestampTz m_analyzetime; > PgStat_Counter m_live_tuples; > PgStat_Counter m_dead_tuples; > int m_nancestors; > Oid m_ancestors[PGSTAT_NUM_ANCESTORENTRIES]; > } PgStat_MsgAnalyze; I'm not sure but how about storing only ancestors that aren't analyzed in the current iteration in m_ancestors[PGSTAT_NUM_ANCESTORENTRIES] ? > > > > Regarding the counters in pg_stat_all_tables: maybe some of these should be > > > > null rather than zero ? Or else you should make an 0001 patch to fully > > > > implement this view, with all relevant counters, not just n_mod_since_analyze, > > > > last_*analyze, and *analyze_count. These are specifically misleading: > > > > > > > > last_vacuum | > > > > last_autovacuum | > > > > n_ins_since_vacuum | 0 > > > > vacuum_count | 0 > > > > autovacuum_count | 0 > > > > > > > I haven't modified this part yet, but you meant that we should set > > > null to counters > > > about vacuum because partitioned tables are not vacuumed? > > > > Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure > > what is the best way here. Showing null seems reasonable but I'm not > > sure that doesn't break anything. > > I agree that showing NULLs for the vacuum columns is better. Perhaps > the most reasonable way to do this is use -1 as an indicator that NULL > ought to be returned from pg_stat_get_vacuum_count() et al, and add a > boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum" > that says to store -1 instead of 0 in pgstat_recv_tabstat. > Thank you for the advice. I'll fix it based on this idea. -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Hello Yuzuko, On 2020-Dec-02, yuzuko wrote: > The problem Horiguchi-san mentioned is as follows: > [explanation] Hmm, I see. So the problem is that if some ancestor is analyzed first, then analyze of one of its partition will cause a redundant analyze of the ancestor, because the number of tuples that is propagated from the partition represents a set that had already been included in the ancestor's analysis. If the problem was just that, then I think it would be very simple to solve: just make sure to sort the tables to vacuum so that all leaves are vacuumed first, and then all ancestors, sorted from the bottom up. Problem solved. But I'm not sure that that's the whole story, for two reasons: one, two workers can run simultaneously, where one analyzes the partition and the other analyzes the ancestor. Then the order is not guaranteed (and each process will get no effect from remembering whether it did that one or not). Second, manual analyzes can occur in any order. Maybe it's more useful to think about this in terms of rememebering that partition P had changed_tuples set to N when we analyzed ancestor A. Then, when we analyze partition P, we send the message listing A as ancestor; on receipt of that message, we see M+N changed tuples in P, but we know that we had already seen N, so we only record M. I'm not sure how to implement this idea however, since on analyze of ancestor A we don't have the list of partitions, so we can't know the N for each partition.
Hello Alvaro, On Thu, Dec 3, 2020 at 10:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hello Yuzuko, > > On 2020-Dec-02, yuzuko wrote: > > > The problem Horiguchi-san mentioned is as follows: > > [explanation] > > Hmm, I see. So the problem is that if some ancestor is analyzed first, > then analyze of one of its partition will cause a redundant analyze of > the ancestor, because the number of tuples that is propagated from the > partition represents a set that had already been included in the > ancestor's analysis. > > If the problem was just that, then I think it would be very simple to > solve: just make sure to sort the tables to vacuum so that all leaves > are vacuumed first, and then all ancestors, sorted from the bottom up. > Problem solved. > Indeed. When discussed with Horiguchi-san before, He mentioned the same way: > So, to propagate the count properly, we need to analyze relations > leaf-to-root order, or propagate the counter only to anscestors that > haven't been processed in the current iteration. It seems a bit too > complex to sort analyze relations in that order. but we didn't select it because of its complexity as you also said. > But I'm not sure that that's the whole story, for two reasons: one, two > workers can run simultaneously, where one analyzes the partition and the > other analyzes the ancestor. Then the order is not guaranteed (and > each process will get no effect from remembering whether it did that one > or not). Second, manual analyzes can occur in any order. > > Maybe it's more useful to think about this in terms of rememebering that > partition P had changed_tuples set to N when we analyzed ancestor A. > Then, when we analyze partition P, we send the message listing A as > ancestor; on receipt of that message, we see M+N changed tuples in P, > but we know that we had already seen N, so we only record M. > > I'm not sure how to implement this idea however, since on analyze of > ancestor A we don't have the list of partitions, so we can't know the N > for each partition. > I thought about it for a while, but I can't come up with how to implement it. And also I think the other way Horiguchi-san suggested in [1] would be more simple to solve the problem we are facing. Attach the new patch based on his patch. What do you think? [1] https://www.postgresql.org/message-id/20201110.203557.1420746510378864931.horikyota.ntt%40gmail.com -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Attachment
Hi, I took a look at this patch. It does not apply because of 5f8727f5a67, so a rebase is needed. But I want to talk about the general approach in general, so it does not matter. The thread is fairly long, both in terms of number of messages and time (started in 2019), so let me restate my understanding of the problem and what the patch aims to do. The problem is that autovacuum never analyzes non-leaf relations in partition hierarchies, because they never get modified and so the value of changes_since_analyze remains 0. This applies both to partitioning based on inheritance and the new fancy declarative partitioning. The consequence is that we never have accurate statistics (MCV, histograms and so on) for the parent, which may lead to poor query plans in cases when we don't use the child statistics for some reason. The patch aims for fix that by propagating the changes_since_analyze to the parent relations, so that the autovacuum properly considers if those non-leaf relations need analyze. I think the goal is right, and propagating the changes_since_analyze is the right solution, but as coded it has a couple issues that may cause trouble in practice. Firstly, the patch propagates the changes_since_analyze values from do_analyze_rel, i.e. from the worker after it analyzes the relation. That may easily lead to cases with unnecessary analyzes - consider a partitioned with 4 child relations: p1 [reltuples=1M, changes_since_analyze=400k] p2 [reltuples=1M, changes_since_analyze=90k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] With the default analyze threshold (10%) this means autoanalyze of p1, and then (in the next round) analyze of the whole partitioned table, because 400k is 10% of 4M. So far so good - we're now in this state: p1 [reltuples=1M, changes_since_analyze=0] p2 [reltuples=1M, changes_since_analyze=90k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] Let's do ~310k more modifications to p2: p1 [reltuples=1M, changes_since_analyze=0] p2 [reltuples=1M, changes_since_analyze=400k] p3 [reltuples=1M, changes_since_analyze=90k] p4 [reltuples=1M, changes_since_analyze=90k] Now p2 gets analyzed, and the value gets propagate to p, triggering the analyze. But that's bogus - we've already seen 90k of those rows in the last analyze, the "actual" changes_since_analyze is just 310k and that should have not triggered the analyze. I could invent a much more extreme examples with more partitions, and or with multiple autovacuum workers processing the leaf rels concurrently. This seems like a quite serious issue, because analyzes on partitioned tables sample all the partitions, which seems rather expensive. That is not an issue introduced by this patch, of course, but it's good to keep that in mind and not make the matters worse. Note: I do have some ideas about how to improve that, I've started a separate thread about it [1]. IMHO the primary issue is the patch is trying to report the counts from the workers, and it's done incrementally, after the fact. It tries to prevent the issue by not propagating the counts to parents analyzed in the same round, but that doesn't seems sufficient: - There's no guarantee how long ago the parent was analyzed. Maybe it was 1 second ago, or maybe it was 24h ago and there have been many new modifications since then? - The hash table is per worker, so who knows what did the other autovacuum workers do? So not really a good solution, I'm afraid. I propose a different approach - instead of propagating the counts in do_analyze_rel for individual leaf tables, let's do that in bulk in relation_needs_vacanalyze. Before the (existing) first pass over pg_class, we can add a new one, propagating counts from leaf tables to parents. I'd imagine something like this while ((tuple = heap_getnext(relScan, ... != NULL) { Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); ... find all ancestors for classForm ... pgstat_propagate_changes(classForm, ancestors); } The pgstat_propagate_changes() simply informs the pgstat collector that classForm has certain ancestors, and it propagates the value to all of them. There's a problem, though - we can't reset the value for the leaf table, because we need to check if it needs analyze, but we also don't want to sent it again next time. But we can add another counter, tracking that part of changes_since_analyze we already propagated, and propagate only the difference. That is, we now have this: PgStat_Counter changes_since_analyze; PgStat_Counter changes_since_analyze_reported; So for example we start with changes_since_analyze = 10000; changes_since_analyze_reported = 0; and we propagate 10k to parents: changes_since_analyze = 10000; changes_since_analyze_reported = 10000; but we don't analyze anything, and we accumulate 5k more changes: changes_since_analyze = 15000; changes_since_analyze_reported = 10000; so now we propagate only the 5k delta. And so on. It's not exactly atomic change (we still do this per relation), but it's "bulk" in the sense that we force the propagation and don't wait until after the leaf happens to be analyzed. It might need to reread the stats file I think, to get the incremented values, but that seems acceptable. We may need to "sync" the counts for individual relations in a couple places (e.g. after the worker is done with the leaf, it should propagate the remaining delta before resetting the values to 0). Maybe multi-level partitioning needs some additional handling, not sure. regards [1] https://commitfest.postgresql.org/33/3052/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/30/21 4:09 AM, Tomas Vondra wrote: > Hi, > > ... > > We may need to "sync" the counts for individual relations in a couple > places (e.g. after the worker is done with the leaf, it should propagate > the remaining delta before resetting the values to 0). Maybe multi-level > partitioning needs some additional handling, not sure. > I forgot to mention one additional thing yesterday - I wonder if we need to do something similar after a partition is attached/detached. That can also change the parent's statistics significantly, so maybe we should handle all partition's rows as changes_since_analyze? Not necessarily something this patch has to handle, but might be related. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas, Thank you for reviewing the patch. > Firstly, the patch propagates the changes_since_analyze values from > do_analyze_rel, i.e. from the worker after it analyzes the relation. > That may easily lead to cases with unnecessary analyzes - consider a > partitioned with 4 child relations: > [ explanation ] > I didn't realize that till now. Indeed, this approach increments parent's changes_since_analyze counter according to its leaf partition's counter when the leaf partition is analyzed, so it will cause unnecessary ANALYZE on partitioned tables as you described. > I propose a different approach - instead of propagating the counts in > do_analyze_rel for individual leaf tables, let's do that in bulk in > relation_needs_vacanalyze. Before the (existing) first pass over > pg_class, we can add a new one, propagating counts from leaf tables to > parents. > Thank you for your suggestion. I think it could solve all the issues you mentioned. I modified the patch based on this approach: - Create a new counter, PgStat_Counter changes_since_analyze_reported, to track changes_since_analyze we already propagated to ancestors. This is used for internal processing and users may not need to know it. So this counter is not displayed at pg_stat_all_tables view for now. - Create a new function, pgstat_propagate_changes() which propagates changes_since_analyze counter to all ancestors and saves changes_since_analyze_reported. This function is called in do_autovacuum() before relation_needs_vacanalyze(). > Note: I do have some ideas about how to improve that, I've started a > separate thread about it [1]. > I'm also interested in merging children's statistics for partitioned tables because it will make ANALYZE on inheritance trees more efficient. So I'll check it later. > I forgot to mention one additional thing yesterday - I wonder if we need > to do something similar after a partition is attached/detached. That can > also change the parent's statistics significantly, so maybe we should > handle all partition's rows as changes_since_analyze? Not necessarily > something this patch has to handle, but might be related. > Regarding attached/detached partitions, I think we should update statistics of partitioned tables according to the new inheritance tree. The latest patch hasn't handled this case yet, but I'll give it a try soon. Attach the v13 patch to this email. Could you please check it again? -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Attachment
Thanks for the quick rework. I like this design much better and I think this is pretty close to committable. Here's a rebased copy with some small cleanups (most notably, avoid calling pgstat_propagate_changes when the partition doesn't have a tabstat entry; also, free the lists that are allocated in a couple of places). I didn't actually verify that it works. -- Álvaro Herrera Valdivia, Chile "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Attachment
On 4/3/21 9:42 PM, Alvaro Herrera wrote: > Thanks for the quick rework. I like this design much better and I think > this is pretty close to committable. Here's a rebased copy with some > small cleanups (most notably, avoid calling pgstat_propagate_changes > when the partition doesn't have a tabstat entry; also, free the lists > that are allocated in a couple of places). > > I didn't actually verify that it works. > Yeah, this approach seems much simpler, I think. That being said, I think there's a couple issues: 1) I still don't understand why inheritance and declarative partitioning are treated differently. Seems unnecessary nad surprising, but maybe there's a good reason? 2) pgstat_recv_tabstat Should it really reset changes_since_analyze_reported in both branches? AFAICS if the "found" branch does this tabentry->changes_since_analyze_reported = 0; that means we lose the counter any time tabstats are received, no? That'd be wrong, because we'd propagate the changes repeatedly. 3) pgstat_recv_analyze Shouldn't it propagate the counters before resetting them? I understand that for the just-analyzed relation we can't do better, but why not to propagate the counters to parents? (Not necessarily from this place in the stat collector, maybe the analyze process should do that.) 4) pgstat_recv_reportedchanges I think this needs to be more careful when updating the value - the stats collector might have received other messages modifying those counters (including e.g. PGSTAT_MTYPE_ANALYZE with a reset), so maybe we can get into situation with (changes_since_analyze_reported > changes_since_analyze) if we just blindly increment the value. I'd bet would lead to funny stuff. So maybe this should be careful to never exceed this? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Apr-04, Tomas Vondra wrote: > 1) I still don't understand why inheritance and declarative partitioning > are treated differently. Seems unnecessary nad surprising, but maybe > there's a good reason? I suppose the rationale is that for inheritance we have always done it that way -- similar things have been done that way for inheritance historically, to avoid messing with long-standing behavior. We have done that in a bunch of places -- DDL behavior, FKs, etc. Maybe in this case it's not justified. It *will* change behavior, in the sense that we are going to capture stats that have never been captured before. That might or might not affect query plans for designs using regular inheritance. But it seems reasonable to think that those changes will be for the good; and if it does break plans for some people and they want to revert to the original behavior, they can just set autovacuum_enabled to off for the parent tables. So, I agree that we should enable this new feature for inheritance parents too. I can't comment on the other issues. I hope to give this a closer look tomorrow my time; with luck Hosoya-san will have commented by then. -- Álvaro Herrera 39°49'30"S 73°17'W "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
On 4/4/21 10:05 PM, Alvaro Herrera wrote: > On 2021-Apr-04, Tomas Vondra wrote: > >> 1) I still don't understand why inheritance and declarative partitioning >> are treated differently. Seems unnecessary nad surprising, but maybe >> there's a good reason? > > I suppose the rationale is that for inheritance we have always done it > that way -- similar things have been done that way for inheritance > historically, to avoid messing with long-standing behavior. We have > done that in a bunch of places -- DDL behavior, FKs, etc. Maybe in this > case it's not justified. It *will* change behavior, in the sense that > we are going to capture stats that have never been captured before. > That might or might not affect query plans for designs using regular > inheritance. But it seems reasonable to think that those changes will > be for the good; and if it does break plans for some people and they > want to revert to the original behavior, they can just set > autovacuum_enabled to off for the parent tables. > > So, I agree that we should enable this new feature for inheritance > parents too. > Not sure. AFAICS the missing stats on parents are an issue both for inheritance and partitioning. Maybe there is a reason to maintain the current behavior with inheritance, but I don't see it. With the other features, I think the reason for not implementing that for inheritance was that it'd be more complex, compared to declarative partitioning (which has stricter limitations on the partitions, etc.). But in this case I think there's no difference in complexity, the same code can handle both cases. In fact, one of the first posts in this threads links to this: https://www.postgresql.org/message-id/4823.1262132964%40sss.pgh.pa.us i.e. Tom actually proposed doing something like this back in 2009, so presumably he though it's desirable back then. OTOH he argued against adding another per-table counter and proposed essentially what the patch did before, i.e. propagating the counter after analyze. But we know that may trigger analyze too often ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Apr-04, Tomas Vondra wrote: > In fact, one of the first posts in this threads links to this: > > https://www.postgresql.org/message-id/4823.1262132964%40sss.pgh.pa.us > > i.e. Tom actually proposed doing something like this back in 2009, so > presumably he though it's desirable back then. > > OTOH he argued against adding another per-table counter and proposed > essentially what the patch did before, i.e. propagating the counter > after analyze. But we know that may trigger analyze too often ... Yeah, I think that's a doomed approach. The reason to avoid another column is to avoid bloat, which is good but if we end up with an unworkable design then we know we have to backtrack on it. I was thinking that we could get away with having a separate pgstat struct for partitioned tables, to avoid enlarging the struct for all tables, but if we're going to also include legacy inheritance in the feature clearly that doesn't work. -- Álvaro Herrera Valdivia, Chile "After a quick R of TFM, all I can say is HOLY CR** THAT IS COOL! PostgreSQL was amazing when I first started using it at 7.2, and I'm continually astounded by learning new features and techniques made available by the continuing work of the development team." Berend Tober, http://archives.postgresql.org/pgsql-hackers/2007-08/msg01009.php
On 4/4/21 9:08 PM, Tomas Vondra wrote: > On 4/3/21 9:42 PM, Alvaro Herrera wrote: >> Thanks for the quick rework. I like this design much better and I think >> this is pretty close to committable. Here's a rebased copy with some >> small cleanups (most notably, avoid calling pgstat_propagate_changes >> when the partition doesn't have a tabstat entry; also, free the lists >> that are allocated in a couple of places). >> >> I didn't actually verify that it works. >>> ... > > 3) pgstat_recv_analyze > > Shouldn't it propagate the counters before resetting them? I understand > that for the just-analyzed relation we can't do better, but why not to > propagate the counters to parents? (Not necessarily from this place in > the stat collector, maybe the analyze process should do that.) > FWIW the scenario I had in mind is something like this: create table t (a int, b int) partition by hash (a); create table p0 partition of t for values with (modulus 2, remainder 0); create table p1 partition of t for values with (modulus 2, remainder 1); insert into t select i, i from generate_series(1,1000000) s(i); select relname, n_mod_since_analyze from pg_stat_user_tables; test=# select relname, n_mod_since_analyze from pg_stat_user_tables; relname | n_mod_since_analyze ---------+--------------------- t | 0 p0 | 499375 p1 | 500625 (3 rows) test=# analyze p0, p1; ANALYZE test=# select relname, n_mod_since_analyze from pg_stat_user_tables; relname | n_mod_since_analyze ---------+--------------------- t | 0 p0 | 0 p1 | 0 (3 rows) This may seem a bit silly - who would analyze the hash partitions directly? However, with other partitioning schemes (list, range) it's quite plausible that people load data directly into partition. They can analyze the parent explicitly too, but with multi-level partitioning that probably requires analyzing all the ancestors. The other possible scenario is about rows inserted while p0/p1 are being processed by autoanalyze. That may actually take quite a bit of time, depending on vacuum cost limit. So I still think we should propagate the delta after the analyze, before we reset the counters. I also realized relation_needs_vacanalyze is not really doing what I suggested - it propagates the counts, but does so in the existing loop which checks which relations need vacuum/analyze. That means we may skip the parent table in the *current* round, because it'll see the old (not yet updated) counts. It's likely to be processed in the next autovacuum round, but that may actually not happen. The trouble is the reltuples for the parent is calculated using *current* child reltuples values, but we're comparing it to the *old* value of changes_since_analyze. So e.g. if enough rows were inserted into the partitions, it may still be below the analyze threshold. What I proposed is adding a separate loop that *only* propagates the counts, and then re-read the current stats (perhaps only if we actually propagated anything). And then decide which relations need analyze. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Apr-04, Tomas Vondra wrote: > 1) I still don't understand why inheritance and declarative partitioning > are treated differently. Seems unnecessary nad surprising, but maybe > there's a good reason? I think there is a good reason to treat them the same: pgstat does not have a provision to keep stats both of the table with children, and the table without children. It can only have one of those. For partitioning that doesn't matter: since the table-without-children doesn't have anything on its own (no scans, no tuples, no nothing) then we can just use the entry to store the table-with-children data. But for the inheritance case, the parent can have its own tuples and counts its own scans and so on; so if we change things, we'll overwrite the stats. Maybe in the long-term we should allow pgstat to differentiate those cases, but that seems not in scope for this patch. I'm working on the code to fix the other issues. -- Álvaro Herrera 39°49'30"S 73°17'W
Hi, On 2021-04-06 16:56:49 -0400, Alvaro Herrera wrote: > I think there is a good reason to treat them the same: pgstat does not > have a provision to keep stats both of the table with children, and the > table without children. It can only have one of those. For > partitioning that doesn't matter: since the table-without-children > doesn't have anything on its own (no scans, no tuples, no nothing) then > we can just use the entry to store the table-with-children data. But > for the inheritance case, the parent can have its own tuples and counts > its own scans and so on; so if we change things, we'll overwrite the > stats. Maybe in the long-term we should allow pgstat to differentiate > those cases, but that seems not in scope for this patch. FWIW, I think it shouldn't be too hard to do that once the shared memory stats patch goes in (not 14, unfortunately). The hardest part will be to avoid exploding the number of interface functions, but I think we can figure out a way to deal with that. Greetings, Andres Freund
Hello, Thank you for reviewing. I'm working on fixing the patch according to the comments. I'll send it as soon as I can. > On 2021-04-06 16:56:49 -0400, Alvaro Herrera wrote: > > I think there is a good reason to treat them the same: pgstat does not > > have a provision to keep stats both of the table with children, and the > > table without children. It can only have one of those. For > > partitioning that doesn't matter: since the table-without-children > > doesn't have anything on its own (no scans, no tuples, no nothing) then > > we can just use the entry to store the table-with-children data. But > > for the inheritance case, the parent can have its own tuples and counts > > its own scans and so on; so if we change things, we'll overwrite the > > stats. Maybe in the long-term we should allow pgstat to differentiate > > those cases, but that seems not in scope for this patch. > > FWIW, I think it shouldn't be too hard to do that once the shared memory > stats patch goes in (not 14, unfortunately). The hardest part will be to > avoid exploding the number of interface functions, but I think we can > figure out a way to deal with that. > I've been thinking about traditional inheritance, I realized that we need additional handling to support them because unlike declarative partitioning, parents may have some rows in the case of traditional inheritance as Alvaro mentioned. So I think we should support only declarative partitioning in this patch for now, but what do you think? I'm not sure but if we can solve this matter at low cost by using the shared memory stats patch, should we wait for the patch? -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
On 2021-Apr-07, yuzuko wrote: > I'm working on fixing the patch according to the comments. > I'll send it as soon as I can. Thanks, I've been giving it a look too. > I've been thinking about traditional inheritance, I realized that we > need additional > handling to support them because unlike declarative partitioning, > parents may have > some rows in the case of traditional inheritance as Alvaro mentioned. > So I think we should support only declarative partitioning in this > patch for now, > but what do you think? Yeah, not fixable at present I think. > I'm not sure but if we can solve this matter at low cost by using the > shared memory stats patch, should we wait for the patch? Let's do that for 15. -- Álvaro Herrera 39°49'30"S 73°17'W "The problem with the future is that it keeps turning into the present" (Hobbes)
Hi, I fixed the patch according to the following comments. Attach the latest patch. It is based on v14 patch Alvaro attached before. On Mon, Apr 5, 2021 at 4:08 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 4/3/21 9:42 PM, Alvaro Herrera wrote: > > Thanks for the quick rework. I like this design much better and I think > > this is pretty close to committable. Here's a rebased copy with some > > small cleanups (most notably, avoid calling pgstat_propagate_changes > > when the partition doesn't have a tabstat entry; also, free the lists > > that are allocated in a couple of places). > > > > I didn't actually verify that it works. > > > > Yeah, this approach seems much simpler, I think. That being said, I > think there's a couple issues: > > 1) I still don't understand why inheritance and declarative partitioning > are treated differently. Seems unnecessary nad surprising, but maybe > there's a good reason? > As we discussed in this thread, this patch should handle only declarative partitioning for now. > > 2) pgstat_recv_tabstat > > Should it really reset changes_since_analyze_reported in both branches? > AFAICS if the "found" branch does this > > tabentry->changes_since_analyze_reported = 0; > > that means we lose the counter any time tabstats are received, no? > That'd be wrong, because we'd propagate the changes repeatedly. > I changed the changes_since_analyze_reported counter not to reset. > > 3) pgstat_recv_analyze > > Shouldn't it propagate the counters before resetting them? I understand > that for the just-analyzed relation we can't do better, but why not to > propagate the counters to parents? (Not necessarily from this place in > the stat collector, maybe the analyze process should do that.) > I realized that we should propagate the counters for manual ANALYZE too. thanks to the examples you offered in another email. I fixed that for manual ANALYZE. > > 4) pgstat_recv_reportedchanges > > I think this needs to be more careful when updating the value - the > stats collector might have received other messages modifying those > counters (including e.g. PGSTAT_MTYPE_ANALYZE with a reset), so maybe we > can get into situation with > > (changes_since_analyze_reported > changes_since_analyze) > > if we just blindly increment the value. I'd bet would lead to funny > stuff. So maybe this should be careful to never exceed this? > pgstat_propagate_changes() calls pgstat_report_reportedchanges() only if (changes_since_analyze_reported < changes_since_analyze). So I think we won't get into the such situation > (changes_since_analyze_reported > changes_since_analyze) but am I missing something? > I also realized relation_needs_vacanalyze is not really doing what I > suggested - it propagates the counts, but does so in the existing loop > which checks which relations need vacuum/analyze. > > That means we may skip the parent table in the *current* round, because > it'll see the old (not yet updated) counts. It's likely to be processed > in the next autovacuum round, but that may actually not happen. The > trouble is the reltuples for the parent is calculated using *current* > child reltuples values, but we're comparing it to the *old* value of > changes_since_analyze. So e.g. if enough rows were inserted into the > partitions, it may still be below the analyze threshold. > Indeed, the partitioned table was not analyzed at the same timing as its leaf partitions due to the delay of propagating counters. According to your proposal, I added a separate loop to propagate the counters before collecting a list of relations to vacuum/analyze. -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Attachment
OK, I bit the bullet and re-did the logic in the way I had proposed earlier in the thread: do the propagation on the collector's side, by sending only the list of ancestors: the collector can read the tuple change count by itself, to add it to each ancestor. This seems less wasteful. Attached is v16 which does it that way and seems to work nicely under my testing. However, I just noticed there is a huge problem, which is that the new code in relation_needs_vacanalyze() is doing find_all_inheritors(), and we don't necessarily have a snapshot that lets us do that. While adding a snapshot acquisition at that spot is a very easy fix, I hesitate to fix it that way, because the whole idea there seems quite wasteful: we have to look up, open and lock every single partition, on every single autovacuum iteration through the database. That seems bad. I'm inclined to think that a better idea may be to store reltuples for the partitioned table in pg_class.reltuples, instead of having to add up the reltuples of each partition. I haven't checked if this is likely to break anything. (Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a partition, then we repeatedly propagate the counts to the parent table, so we would cause the parent to be analyzed more times than it should. Sounds like we should not send the ancestor list when a column list is given to manual analyze. I haven't verified this, however.) -- Álvaro Herrera Valdivia, Chile Syntax error: function hell() needs an argument. Please choose what hell you want to involve.
Attachment
On 2021-Apr-07, Alvaro Herrera wrote: > OK, I bit the bullet and re-did the logic in the way I had proposed > earlier in the thread: do the propagation on the collector's side, by > sending only the list of ancestors: the collector can read the tuple > change count by itself, to add it to each ancestor. This seems less > wasteful. Attached is v16 which does it that way and seems to work > nicely under my testing. Pushed with this approach. Thanks for persisting with this. -- Álvaro Herrera Valdivia, Chile
On 2021-Apr-07, Alvaro Herrera wrote: > However, I just noticed there is a huge problem, which is that the new > code in relation_needs_vacanalyze() is doing find_all_inheritors(), and > we don't necessarily have a snapshot that lets us do that. While adding > a snapshot acquisition at that spot is a very easy fix, I hesitate to > fix it that way, because the whole idea there seems quite wasteful: we > have to look up, open and lock every single partition, on every single > autovacuum iteration through the database. That seems bad. I'm > inclined to think that a better idea may be to store reltuples for the > partitioned table in pg_class.reltuples, instead of having to add up the > reltuples of each partition. I haven't checked if this is likely to > break anything. I forgot to comment on this aspect. First, I was obviously mistaken about there not being an active snapshot. I mean, it's correct that there isn't. The issue is that it's really a bug to require that there is one; it just hasn't failed before because partially detached partitions aren't very common. So I patched that as a bug in a preliminary patch. Next, the idea of storing the number of tuples in pg_class.reltuples is a nice one, and I think we should consider it in the long run. However, while it can be done as a quick job (shown in the attached, which AFAICT works fine) there are side-effects -- for example, TRUNCATE doesn't clear the value, which is surely wrong. I suspect that if I try to handle it in this way, it would blow up in some corner case I forgot to consider. So, I decided not to go that way, at least for now. -- Álvaro Herrera Valdivia, Chile
Attachment
On 4/8/21 5:22 AM, Alvaro Herrera wrote: > OK, I bit the bullet and re-did the logic in the way I had proposed > earlier in the thread: do the propagation on the collector's side, by > sending only the list of ancestors: the collector can read the tuple > change count by itself, to add it to each ancestor. This seems less > wasteful. Attached is v16 which does it that way and seems to work > nicely under my testing. > > However, I just noticed there is a huge problem, which is that the new > code in relation_needs_vacanalyze() is doing find_all_inheritors(), and > we don't necessarily have a snapshot that lets us do that. While adding > a snapshot acquisition at that spot is a very easy fix, I hesitate to > fix it that way, because the whole idea there seems quite wasteful: we > have to look up, open and lock every single partition, on every single > autovacuum iteration through the database. That seems bad. I'm > inclined to think that a better idea may be to store reltuples for the > partitioned table in pg_class.reltuples, instead of having to add up the > reltuples of each partition. I haven't checked if this is likely to > break anything. > How would that value get updated, for the parent? > (Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a > partition, then we repeatedly propagate the counts to the parent table, > so we would cause the parent to be analyzed more times than it should. > Sounds like we should not send the ancestor list when a column list is > given to manual analyze. I haven't verified this, however.) > Are you sure? I haven't tried, but shouldn't this be prevented by only sending the delta between the current and last reported value? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Apr-08, Tomas Vondra wrote: > On 4/8/21 5:22 AM, Alvaro Herrera wrote: > > However, I just noticed there is a huge problem, which is that the new > > code in relation_needs_vacanalyze() is doing find_all_inheritors(), and > > we don't necessarily have a snapshot that lets us do that. While adding > > a snapshot acquisition at that spot is a very easy fix, I hesitate to > > fix it that way, because the whole idea there seems quite wasteful: we > > have to look up, open and lock every single partition, on every single > > autovacuum iteration through the database. That seems bad. I'm > > inclined to think that a better idea may be to store reltuples for the > > partitioned table in pg_class.reltuples, instead of having to add up the > > reltuples of each partition. I haven't checked if this is likely to > > break anything. > > How would that value get updated, for the parent? Same as for any other relation: ANALYZE would set it, after it's done scanning the table. We would to make sure that nothing resets it to empty, though, and that it doesn't cause issues elsewhere. (The patch I sent contains the minimal change to make it work, but of course that's missing having other pieces of code maintain it.) > > (Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a > > partition, then we repeatedly propagate the counts to the parent table, > > so we would cause the parent to be analyzed more times than it should. > > Sounds like we should not send the ancestor list when a column list is > > given to manual analyze. I haven't verified this, however.) > > Are you sure? I haven't tried, but shouldn't this be prevented by only > sending the delta between the current and last reported value? I did try, and yes it behaves as you say. -- Álvaro Herrera Valdivia, Chile Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'. After collecting 500 such letters, he mused, a university somewhere in Arizona would probably grant him a degree. (Don Knuth)
On 4/8/21 5:27 PM, Alvaro Herrera wrote: > On 2021-Apr-08, Tomas Vondra wrote: > >> On 4/8/21 5:22 AM, Alvaro Herrera wrote: > >>> However, I just noticed there is a huge problem, which is that the new >>> code in relation_needs_vacanalyze() is doing find_all_inheritors(), and >>> we don't necessarily have a snapshot that lets us do that. While adding >>> a snapshot acquisition at that spot is a very easy fix, I hesitate to >>> fix it that way, because the whole idea there seems quite wasteful: we >>> have to look up, open and lock every single partition, on every single >>> autovacuum iteration through the database. That seems bad. I'm >>> inclined to think that a better idea may be to store reltuples for the >>> partitioned table in pg_class.reltuples, instead of having to add up the >>> reltuples of each partition. I haven't checked if this is likely to >>> break anything. >> >> How would that value get updated, for the parent? > > Same as for any other relation: ANALYZE would set it, after it's done > scanning the table. We would to make sure that nothing resets it to > empty, though, and that it doesn't cause issues elsewhere. (The patch I > sent contains the minimal change to make it work, but of course that's > missing having other pieces of code maintain it.) > So ANALYZE would inspect the child relations, sum the reltuples and set it for the parent? IMO that'd be problematic because it'd mean we're comparing the current number of changes with reltuples value which may be arbitrarily stale (if we haven't analyzed the parent for a while). That's essentially the issue I described when explaining why I think the code needs to propagate the changes, reread the stats and then evaluate which relations need vacuuming. It's similar to the issue of comparing old changes_since_analyze vs. current reltuples, which is why the code is rereading the stats before checking the thresholds. This time it's the opposite direction - the reltuples might be stale. FWIW I think the current refresh logic is not quite correct, because autovac_refresh_stats does some throttling (STATS_READ_DELAY). It probably needs a "force" parameter to ensure it actually reads the current stats in this one case. >>> (Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a >>> partition, then we repeatedly propagate the counts to the parent table, >>> so we would cause the parent to be analyzed more times than it should. >>> Sounds like we should not send the ancestor list when a column list is >>> given to manual analyze. I haven't verified this, however.) >> >> Are you sure? I haven't tried, but shouldn't this be prevented by only >> sending the delta between the current and last reported value? > > I did try, and yes it behaves as you say. > OK, good. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Apr-08, Tomas Vondra wrote: > On 4/8/21 5:27 PM, Alvaro Herrera wrote: > > > Same as for any other relation: ANALYZE would set it, after it's done > > scanning the table. We would to make sure that nothing resets it to > > empty, though, and that it doesn't cause issues elsewhere. (The patch I > > sent contains the minimal change to make it work, but of course that's > > missing having other pieces of code maintain it.) > > So ANALYZE would inspect the child relations, sum the reltuples and set > it for the parent? IMO that'd be problematic because it'd mean we're > comparing the current number of changes with reltuples value which may > be arbitrarily stale (if we haven't analyzed the parent for a while). What? Not at all. reltuples would be set by ANALYZE on one run, and then the value is available for the future autovacuum run. That's how it works for regular tables too, so I'm not sure what you problem have with that. The (possibly stale) reltuples value is multiplied by the scale factor, and added to the analyze_threshold value, and that's compared with the current changes_since_analyze to determine whether to analyze or not. > That's essentially the issue I described when explaining why I think the > code needs to propagate the changes, reread the stats and then evaluate > which relations need vacuuming. It's similar to the issue of comparing > old changes_since_analyze vs. current reltuples, which is why the code > is rereading the stats before checking the thresholds. This time it's > the opposite direction - the reltuples might be stale. Well, I don't think the issue is the same. reltuples is always stale, even for regular tables, because that's just how it works. changes_since_analyze is not stale for regular tables, and that's why it makes sense to propagate it from partitions to ancestors prior to checking the analyze condition. > FWIW I think the current refresh logic is not quite correct, because > autovac_refresh_stats does some throttling (STATS_READ_DELAY). It > probably needs a "force" parameter to ensure it actually reads the > current stats in this one case. Hmm ... good catch, but actually that throttling only applies to the launcher. do_autovacuum runs in the worker, so there's no throttling. -- Álvaro Herrera 39°49'30"S 73°17'W
Hi, On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote: > On 2021-Apr-07, Alvaro Herrera wrote: > > > OK, I bit the bullet and re-did the logic in the way I had proposed > > earlier in the thread: do the propagation on the collector's side, by > > sending only the list of ancestors: the collector can read the tuple > > change count by itself, to add it to each ancestor. This seems less > > wasteful. Attached is v16 which does it that way and seems to work > > nicely under my testing. > > Pushed with this approach. Thanks for persisting with this. I'm looking at this in the context of rebasing & polishing the shared memory stats patch. I have a few questions / concerns: 1) Somehow it seems like a violation to do stuff like get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but it feels a bit off. Would likely not be too hard to address, e.g. by just putting some of pgstat_report_anl_ancestors in partition.c instead. 2) Why does it make sense that autovacuum sends a stats message for every partition in the system that had any chances since the last autovacuum cycle? On a database with a good number of objects / a short naptime we'll often end up sending messages for the same set of tables from separate workers, because they don't yet see the concurrent tabentry->changes_since_analyze_reported. 3) What is the goal of the autovac_refresh_stats() after the loop doing pgstat_report_anl_ancestors()? I think it'll be common that the stats collector hasn't even processed the incoming messages by that point, not to speak of actually having written out a new stats file. If it took less than 10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(), backend_read_statsfile() will not wait for a new stats file to be written out, and we'll just re-read the state we previously did. It's pretty expensive to re-read the stats file in some workloads, so I'm a bit concerned that we end up significantly increasing the amount of stats updates/reads, without actually gaining anything reliable? 4) In the shared mem stats patch I went to a fair bit of trouble to try to get rid of pgstat_vacuum_stat() (which scales extremely poorly to larger systems). For that to work pending stats can only be "staged" while holding a lock on a relation that prevents the relation from being concurrently dropped (pending stats increment a refcount for the shared stats object, which ensures that we don't loose track of the fact that a stats object has been dropped, even when stats only get submitted later). I'm not yet clear on how to make this work for pgstat_report_anl_ancestors() - but I probably can find a way. But it does feel a bit off to issue stats stuff for tables we're not sure still exist. I'll go and read through the thread, but my first thought is that having a hashtable in do_autovacuum() that contains stats for partitioned tables would be a good bit more efficient than the current approach? We already have a hashtable for each toast table, compared to that having a hashtable for each partitioned table doesn't seem like it'd be a problem? With a small bit of extra work that could even avoid the need for the additional pass through pg_class. Do the partitioned table data-gathering as part of the "collect main tables to vacuum" pass, and then do one of a) do the perform-analyze decision purely off the contents of that partioned-table hash b) fetch the RELOID syscache entry by oid and then decide based on that c) handle partioned tableas as part of the "check TOAST tables" pass - it's not like we gain a meaningful amount of efficiency by using a ScanKey to filter for RELKIND_TOASTVALUE, given that there's no index, and that an index wouldn't commonly be useful given the percentage of toast tables in pg_class Partitioning makes it a bigger issue that do_autovacuum() does multiple passes through pg_class (as it makes scenarios in which pg_class is large more common), so I don't think it's great that partitioning also increases the number of passes through pg_class. Greetings, Andres Freund
Hi, On 2021-07-22 13:54:58 -0700, Andres Freund wrote: > On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote: > > On 2021-Apr-07, Alvaro Herrera wrote: > > > > > OK, I bit the bullet and re-did the logic in the way I had proposed > > > earlier in the thread: do the propagation on the collector's side, by > > > sending only the list of ancestors: the collector can read the tuple > > > change count by itself, to add it to each ancestor. This seems less > > > wasteful. Attached is v16 which does it that way and seems to work > > > nicely under my testing. > > > > Pushed with this approach. Thanks for persisting with this. > > I'm looking at this in the context of rebasing & polishing the shared > memory stats patch. > > I have a few questions / concerns: Another one, and I think this might warrant thinking about for v14: Isn't this going to create a *lot* of redundant sampling? Especially if you have any sort of nested partition tree. In the most absurd case a partition with n parents will get sampled n times, solely due to changes to itself. Look at the following example: BEGIN; DROP TABLE if exists p; CREATE TABLE p (i int) partition by range(i); CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i); CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000); CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000); CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000); CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000); CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000); -- create some initial data INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps; COMMIT; UPDATE p_0_4 SET i = i; Whenever the update is executed, all partitions will be sampled at least twice (once for p and once for p_0), with p_0_4 sampled three times. Of course, this is an extreme example, but it's not hard to imagine cases where v14 will cause the number of auto-analyzes increase sufficiently to bog down autovacuum to a problematic degree. Additionally, while analyzing all child partitions for a partitioned tables are AccessShareLock'ed at once. If a partition hierarchy has more than one level, it actually is likely that multiple autovacuum workers will end up processing the ancestors separately. This seems like it might contribute to lock exhaustion issues with larger partition hierarchies? Greetings, Andres Freund
Hi, CCing RMT because I think we need to do something about this for v14. On 2021-07-27 19:23:42 -0700, Andres Freund wrote: > On 2021-07-22 13:54:58 -0700, Andres Freund wrote: > > On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote: > > > On 2021-Apr-07, Alvaro Herrera wrote: > > > > > > > OK, I bit the bullet and re-did the logic in the way I had proposed > > > > earlier in the thread: do the propagation on the collector's side, by > > > > sending only the list of ancestors: the collector can read the tuple > > > > change count by itself, to add it to each ancestor. This seems less > > > > wasteful. Attached is v16 which does it that way and seems to work > > > > nicely under my testing. > > > > > > Pushed with this approach. Thanks for persisting with this. > > > > I'm looking at this in the context of rebasing & polishing the shared > > memory stats patch. > > > > I have a few questions / concerns: > > Another one, and I think this might warrant thinking about for v14: > > Isn't this going to create a *lot* of redundant sampling? Especially if you > have any sort of nested partition tree. In the most absurd case a partition > with n parents will get sampled n times, solely due to changes to itself. > > Look at the following example: > > BEGIN; > DROP TABLE if exists p; > CREATE TABLE p (i int) partition by range(i); > CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i); > CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000); > CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000); > CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000); > CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000); > CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000); > -- create some initial data > INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps; > COMMIT; > > UPDATE p_0_4 SET i = i; > > > Whenever the update is executed, all partitions will be sampled at least twice > (once for p and once for p_0), with p_0_4 sampled three times. > > Of course, this is an extreme example, but it's not hard to imagine cases > where v14 will cause the number of auto-analyzes increase sufficiently to bog > down autovacuum to a problematic degree. > > > Additionally, while analyzing all child partitions for a partitioned tables > are AccessShareLock'ed at once. If a partition hierarchy has more than one > level, it actually is likely that multiple autovacuum workers will end up > processing the ancestors separately. This seems like it might contribute to > lock exhaustion issues with larger partition hierarchies? I started to write a patch rejiggering autovacuum.c portion of this change. While testing it I hit the case of manual ANALYZEs leaving changes_since_analyze for partitioned tables in a bogus state - without a minimally invasive way to fix that. After a bit of confused staring I realized that the current code has a very similar problem: Using the same setup as above: INSERT INTO p VALUES (0,0); /* repeat as many times as desired */ ANALYZE p_0_0; At this point the system will have lost track of the changes to p_0_0, unless an autovacuum worker was launched between the INSERTs and the ANALYZE (which would cause pgstat_report_anl_ancestors() to report the change count upwards). There appears to be code trying to address that, but I don't see how it ever does anything meaningful? /* * Now report ANALYZE to the stats collector. For regular tables, we do * it only if not doing inherited stats. For partitioned tables, we only * do it for inherited stats. (We're never called for not-inherited stats * on partitioned tables anyway.) * * Reset the changes_since_analyze counter only if we analyzed all * columns; otherwise, there is still work for auto-analyze to do. */ if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) pgstat_report_analyze(onerel, totalrows, totaldeadrows, (va_cols == NIL)); /* * If this is a manual analyze of all columns of a permanent leaf * partition, and not doing inherited stats, also let the collector know * about the ancestor tables of this partition. Autovacuum does the * equivalent of this at the start of its run, so there's no reason to do * it there. */ if (!inh && !IsAutoVacuumWorkerProcess() && (va_cols == NIL) && onerel->rd_rel->relispartition && onerel->rd_rel->relkind == RELKIND_RELATION && onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) { pgstat_report_anl_ancestors(RelationGetRelid(onerel)); } The pgstat_report_analyze() triggers pgstat_recv_analyze() to reset the counter that pgstat_recv_anl_ancestors() would use to report changes upwards: /* * If commanded, reset changes_since_analyze to zero. This forgets any * changes that were committed while the ANALYZE was in progress, but we * have no good way to estimate how many of those there were. */ if (msg->m_resetcounter) { tabentry->changes_since_analyze = 0; tabentry->changes_since_analyze_reported = 0; } And if one instead inverts the order of pgstat_report_analyze() and pgstat_report_anl_ancestors() one gets a slightly different problem: A manual ANALYZE of the partition root results in the partition root having a non-zero changes_since_analyze afterwards. expand_vacuum() causes child partitions to be added to the list of relations, which *first* causes the partition root to be analyzed, and *then* partitions. The partitions then report their changes_since_analyze upwards. I don't think the code as is is fit for v14. It looks like it was rewritten with a new approach just before the freeze ([1]), and as far as I can tell the concerns I quoted above weren't even discussed in the whole thread. Alvaro, any comments? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql
On 7/29/21 9:03 PM, Andres Freund wrote: > Hi, > > CCing RMT because I think we need to do something about this for v14. Thanks. We are now aware of it. [...] > I don't think the code as is is fit for v14. It looks like it was rewritten > with a new approach just before the freeze ([1]), and as far as I can tell the > concerns I quoted above weren't even discussed in the whole thread. Alvaro, > any comments? > I discussed this briefly with Alvaro late last night. He's now aware of the issue, but I believe he's away for some days, and probably won't be able to respond until his return. Sorry I don't have more news, but I didn't want anyone thinking this was being ignored. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
At Thu, 29 Jul 2021 18:03:55 -0700, Andres Freund <andres@anarazel.de> wrote in > And if one instead inverts the order of pgstat_report_analyze() and > pgstat_report_anl_ancestors() one gets a slightly different problem: A manual > ANALYZE of the partition root results in the partition root having a non-zero > changes_since_analyze afterwards. expand_vacuum() causes child partitions to be > added to the list of relations, which *first* causes the partition root to be > analyzed, and *then* partitions. The partitions then report their > changes_since_analyze upwards. For the last behavior, as Andres suggested, the scan order need to be reversed (or to be in the same order with autovacuum). Since find_all_inheritors scans breadth-first so just reversing the result works. The breadth-first is currently not in the contract of the interface of the function. I suppose we can add such a contract? Finally, I ended up with the attached. - reverse the relation order within a tree - reverse the order of pgstat_report_analyze and pgstat_report_analyze. Inheritance expansion is performed per-tree basis so it works fine even if multiple relations are given to vacuum(). > I don't think the code as is is fit for v14. It looks like it was rewritten > with a new approach just before the freeze ([1]), and as far as I can tell the > concerns I quoted above weren't even discussed in the whole thread. Alvaro, > any comments? > > Greetings, > > Andres Freund > > [1] https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql FYI: this bahaves as the follows. CREATE TABLE p (a int) PARTITION BY RANGE (a); CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (200) PARTITION BY RANGE(a); CREATE TABLE c11 PARTITION OF c1 FOR VALUES FROM (0) TO (100); CREATE TABLE c12 PARTITION OF c1 FOR VALUES FROM (100) TO (200); CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (200) TO (400) PARTITION BY RANGE(a); CREATE TABLE c21 PARTITION OF c2 FOR VALUES FROM (200) TO (300); CREATE TABLE c22 PARTITION OF c2 FOR VALUES FROM (300) TO (400); INSERT INTO p (SELECT a FROM generate_series(0, 400 - 1) a, generate_series(0, 10) b); INSERT INTO p (SELECT 200 FROM generate_series(0, 99)); SELECT relid, relname, n_mod_since_analyze FROM pg_stat_user_tables ORDER BY relid; relid | relname | n_mod_since_analyze -------+---------+--------------------- 16426 | p | 0 16429 | c1 | 0 16432 | c11 | 0 16435 | c12 | 0 16438 | c2 | 0 16441 | c21 | 100 16444 | c22 | 0 16447 | sa | 0 (8 rows) After "ANALYZE c21;" relid | relname | n_mod_since_analyze -------+---------+--------------------- 16426 | p | 100 16429 | c1 | 0 16432 | c11 | 0 16435 | c12 | 0 16438 | c2 | 100 16441 | c21 | 0 16444 | c22 | 0 16447 | sa | 0 After "ANALYZE c2;" relid | relname | n_mod_since_analyze -------+---------+--------------------- 16426 | p | 100 16429 | c1 | 0 16432 | c11 | 0 16435 | c12 | 0 16438 | c2 | 0 16441 | c21 | 0 16444 | c22 | 0 16447 | sa | 0 After "ANALYZE p;" (all zero) However, this gives a strange-looking side-effect, which affected regression results. =# VACUUM ANALYZE p(a, a); ERROR: column "a" of relation "c22" appears more than once (Prevously it complained about p.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 16f7602f1b7755f288c508f1e57e0eae3c305813 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 4 Aug 2021 13:40:59 +0900 Subject: [PATCH] Fix changes_since_analyze's motion on manual analyze on partitioned tables The analyze-stats machinery assumed bottom-to-top-ordered relation scans but actually it was in the opposite order in manual ANALYZE. Addition to that the current code tries to propagate changes_since_analyze to parents after stats reporting which resets the number to propagate. As the result, when doing manual ANALYZE on a partition, changes_since_analyze vanishes instead of being propagated to parents. On the other hand when doing that on a partitioned tables, the leaf relations end up with having bogus stats values. To fix this, reverse the order relations on running manual ANALYZE and move stats-reporting after stats-propagation. --- src/backend/catalog/pg_inherits.c | 3 ++- src/backend/commands/analyze.c | 26 +++++++++++++------------- src/backend/commands/vacuum.c | 25 +++++++++++++++---------- src/test/regress/expected/vacuum.out | 18 +++++++++--------- 4 files changed, 39 insertions(+), 33 deletions(-) diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c index ae990d4877..3451578580 100644 --- a/src/backend/catalog/pg_inherits.c +++ b/src/backend/catalog/pg_inherits.c @@ -239,7 +239,8 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached, /* * find_all_inheritors - * Returns a list of relation OIDs including the given rel plus - * all relations that inherit from it, directly or indirectly. + * all relations that inherit from it, directly or indirectly, in + * breadth-first ordering. * Optionally, it also returns the number of parents found for * each such relation within the inheritance tree rooted at the * given rel. diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 0c9591415e..414da6630b 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -682,19 +682,6 @@ do_analyze_rel(Relation onerel, VacuumParams *params, in_outer_xact); } - /* - * Now report ANALYZE to the stats collector. For regular tables, we do - * it only if not doing inherited stats. For partitioned tables, we only - * do it for inherited stats. (We're never called for not-inherited stats - * on partitioned tables anyway.) - * - * Reset the changes_since_analyze counter only if we analyzed all - * columns; otherwise, there is still work for auto-analyze to do. - */ - if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - pgstat_report_analyze(onerel, totalrows, totaldeadrows, - (va_cols == NIL)); - /* * If this is a manual analyze of all columns of a permanent leaf * partition, and not doing inherited stats, also let the collector know @@ -711,6 +698,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params, pgstat_report_anl_ancestors(RelationGetRelid(onerel)); } + /* + * Now report ANALYZE to the stats collector. For regular tables, we do + * it only if not doing inherited stats. For partitioned tables, we only + * do it for inherited stats. (We're never called for not-inherited stats + * on partitioned tables anyway.) + * + * Reset the changes_since_analyze counter only if we analyzed all + * columns; otherwise, there is still work for auto-analyze to do. + */ + if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + pgstat_report_analyze(onerel, totalrows, totaldeadrows, + (va_cols == NIL)); + /* * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup. * diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 5c4bc15b44..0a378487fb 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -826,13 +826,15 @@ expand_vacuum_rel(VacuumRelation *vrel, int options) ReleaseSysCache(tuple); /* - * If it is, make relation list entries for its partitions. Note that - * the list returned by find_all_inheritors() includes the passed-in - * OID, so we have to skip that. There's no point in taking locks on - * the individual partitions yet, and doing so would just add - * unnecessary deadlock risk. For this last reason we do not check - * yet the ownership of the partitions, which get added to the list to - * process. Ownership will be checked later on anyway. + * If it is, make relation list entries for its partitions in a + * bottom-to-top manner. Note that the list returned by + * find_all_inheritors() is in top-to-bottom ordering and includes the + * passed-in OID, so we have to reverse the order and skip the + * passed-in OID. There's no point in taking locks on the individual + * partitions yet, and doing so would just add unnecessary deadlock + * risk. For this last reason we do not check yet the ownership of the + * partitions, which get added to the list to process. Ownership will + * be checked later on anyway. */ if (include_parts) { @@ -852,9 +854,12 @@ expand_vacuum_rel(VacuumRelation *vrel, int options) * later. */ oldcontext = MemoryContextSwitchTo(vac_context); - vacrels = lappend(vacrels, makeVacuumRelation(NULL, - part_oid, - vrel->va_cols)); + + /* Make the order reversed */ + vacrels = lcons(makeVacuumRelation(NULL, + part_oid, + vrel->va_cols), + vacrels); MemoryContextSwitchTo(oldcontext); } } diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out index 3e70e4c788..5b9726225e 100644 --- a/src/test/regress/expected/vacuum.out +++ b/src/test/regress/expected/vacuum.out @@ -196,9 +196,9 @@ VACUUM (FULL) vacparted; VACUUM (FREEZE) vacparted; -- check behavior with duplicate column mentions VACUUM ANALYZE vacparted(a,b,a); -ERROR: column "a" of relation "vacparted" appears more than once +ERROR: column "a" of relation "vacparted1" appears more than once ANALYZE vacparted(a,b,b); -ERROR: column "b" of relation "vacparted" appears more than once +ERROR: column "b" of relation "vacparted1" appears more than once -- partitioned table with index CREATE TABLE vacparted_i (a int primary key, b varchar(100)) PARTITION BY HASH (a); @@ -239,7 +239,7 @@ ANALYZE vacparted (b), vactst; ANALYZE vactst, does_not_exist, vacparted; ERROR: relation "does_not_exist" does not exist ANALYZE vactst (i), vacparted (does_not_exist); -ERROR: column "does_not_exist" of relation "vacparted" does not exist +ERROR: column "does_not_exist" of relation "vacparted1" does not exist ANALYZE vactst, vactst; BEGIN; -- ANALYZE behaves differently inside a transaction block ANALYZE vactst, vactst; @@ -319,24 +319,24 @@ WARNING: skipping "pg_authid" --- only superuser can vacuum it -- independently. VACUUM vacowned_parted; WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it -WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it +WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM vacowned_part2; WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it ANALYZE vacowned_parted; WARNING: skipping "vacowned_parted" --- only table or database owner can analyze it -WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it +WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it ANALYZE vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it ANALYZE vacowned_part2; WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it VACUUM (ANALYZE) vacowned_parted; WARNING: skipping "vacowned_parted" --- only table or database owner can vacuum it -WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it +WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM (ANALYZE) vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM (ANALYZE) vacowned_part2; @@ -389,22 +389,22 @@ ALTER TABLE vacowned_parted OWNER TO regress_vacuum; ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER; SET ROLE regress_vacuum; VACUUM vacowned_parted; -WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it +WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM vacowned_part2; WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it ANALYZE vacowned_parted; -WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it +WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it ANALYZE vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can analyze it ANALYZE vacowned_part2; WARNING: skipping "vacowned_part2" --- only table or database owner can analyze it VACUUM (ANALYZE) vacowned_parted; -WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it WARNING: skipping "vacowned_part2" --- only table or database owner can vacuum it +WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM (ANALYZE) vacowned_part1; WARNING: skipping "vacowned_part1" --- only table or database owner can vacuum it VACUUM (ANALYZE) vacowned_part2; -- 2.27.0
Hi On 2021-Jul-27, Andres Freund wrote: > Isn't this going to create a *lot* of redundant sampling? Especially if you > have any sort of nested partition tree. In the most absurd case a partition > with n parents will get sampled n times, solely due to changes to itself. It seems to me that you're barking up the wrong tree on this point. This problem you describe is not something that was caused by this patch; ANALYZE has always worked like this. We have discussed the idea of avoiding redundant sampling, but it's clear that it is not a simple problem, and solving it was not in scope for this patch. > Additionally, while analyzing all child partitions for a partitioned tables > are AccessShareLock'ed at once. If a partition hierarchy has more than one > level, it actually is likely that multiple autovacuum workers will end up > processing the ancestors separately. This seems like it might contribute to > lock exhaustion issues with larger partition hierarchies? I agree this seems a legitimate problem. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Hi, On 2021-08-09 16:02:33 -0400, Alvaro Herrera wrote: > On 2021-Jul-27, Andres Freund wrote: > > > Isn't this going to create a *lot* of redundant sampling? Especially if you > > have any sort of nested partition tree. In the most absurd case a partition > > with n parents will get sampled n times, solely due to changes to itself. > > It seems to me that you're barking up the wrong tree on this point. > This problem you describe is not something that was caused by this > patch; ANALYZE has always worked like this. We have discussed the idea > of avoiding redundant sampling, but it's clear that it is not a simple > problem, and solving it was not in scope for this patch. I don't agree. There's a difference between this happening after a manual ANALYZE on partition roots, and this continuously happening in production workloads due to auto-analyzes... Greetings, Andres Freund
Hello, On 2021-Jul-22, Andres Freund wrote: > 1) Somehow it seems like a violation to do stuff like > get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but > it feels a bit off. Would likely not be too hard to address, e.g. by just > putting some of pgstat_report_anl_ancestors in partition.c instead. I understand the complain about this being a modularity violation -- the point being that pgstat.c has no business accessing system catalogs at all. Before this function, all pgstat_report_* functions were just assembling a message from counters accumulated somewhere and sending the bytes to the collector, and this new function is a deviation from that. It seems that we could improve this by having a function (maybe in partition.c as you propose), something like static void report_partition_ancestors(Oid relid) { ancestors = get_partition_ancestors( ... ); array = palloc(sizeof(Oid) * list_length(ancestors)); foreach(lc, ancestors) { array[i++] = lfirst_oid(lc); } pgstat_report_partition_ancestors(oid, array); } and then pgstat.c works with the given array without having to consult system catalogs. > 2) Why does it make sense that autovacuum sends a stats message for every > partition in the system that had any [changes] since the last autovacuum > cycle? On a database with a good number of objects / a short naptime we'll > often end up sending messages for the same set of tables from separate > workers, because they don't yet see the concurrent > tabentry->changes_since_analyze_reported. The traffic could be large, yeah, and I agree it seems undesirable. If collector kept a record of the list of ancestors of each table, then we wouldn't need to do this (we would have to know if collector knows a particular partition or not, though ... I have no ideas on that.) > 3) What is the goal of the autovac_refresh_stats() after the loop doing > pgstat_report_anl_ancestors()? I think it'll be common that the stats > collector hasn't even processed the incoming messages by that point, not to > speak of actually having written out a new stats file. If it took less than > 10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(), > backend_read_statsfile() will not wait for a new stats file to be written > out, and we'll just re-read the state we previously did. > > It's pretty expensive to re-read the stats file in some workloads, so I'm a > bit concerned that we end up significantly increasing the amount of stats > updates/reads, without actually gaining anything reliable? This is done once per autovacuum run and the point is precisely to let the next block absorb the updates that were sent. In manual ANALYZE we do it to inform future autovacuum runs. Note that the PGSTAT_RETRY_DELAY limit is used by the autovac launcher only, and this code is running in the worker; we do flush out the old data. Yes, it's expensive, but we're not doing it once per table, just once per worker run. > 4) In the shared mem stats patch I went to a fair bit of trouble to try to get > rid of pgstat_vacuum_stat() (which scales extremely poorly to larger > systems). For that to work pending stats can only be "staged" while holding > a lock on a relation that prevents the relation from being concurrently > dropped (pending stats increment a refcount for the shared stats object, > which ensures that we don't loose track of the fact that a stats object has > been dropped, even when stats only get submitted later). > > I'm not yet clear on how to make this work for > pgstat_report_anl_ancestors() - but I probably can find a way. But it does > feel a bit off to issue stats stuff for tables we're not sure still exist. I assume you refer to locking the *partition*, right? You're not talking about locking the ancestor mentioned in the message. I don't know how does the shmem-collector work, but it shouldn't be a problem that an ancestor goes away (ALTER TABLE parent DETACH; DROP TABLE parent); as long as you've kept a lock on the partition, it should be fine. Or am I misinterpreting what you mean? > I'll go and read through the thread, but my first thought is that having a > hashtable in do_autovacuum() that contains stats for partitioned tables would > be a good bit more efficient than the current approach? We already have a > hashtable for each toast table, compared to that having a hashtable for each > partitioned table doesn't seem like it'd be a problem? > With a small bit of extra work that could even avoid the need for the > additional pass through pg_class. Do the partitioned table data-gathering as > part of the "collect main tables to vacuum" pass, and then do one of I'll have to re-read the thread to remember why did I make it a separate pass. I think I did it that way because otherwise there was a requirement on the pg_class scan order. (Some earlier version of the patch did not have a separate pass and there was some problem or other. Maybe you're right that a hash table is sufficient.) -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "We're here to devour each other alive" (Hobbes)
On 2021-Aug-09, Andres Freund wrote: > I don't agree. There's a difference between this happening after a manual > ANALYZE on partition roots, and this continuously happening in production > workloads due to auto-analyzes... Hmm. That's not completely untrue. I bring a radical proposal that may be sufficient to close this particular hole. What if we made partition only affected their top-level parents to become auto-analyzed, and not any intermediate ancestors? Any intermediate partitioned partitions could be analyzed manually if the user wished, and perhaps some reloption could enable autovacuum to do it (with the caveat that it'd cause multiple sampling of partitions). I don't yet have a clear picture on how to implement this, but I'll explore it while waiting for opinions on the idea. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)
On 2021-Aug-10, Alvaro Herrera wrote: > I bring a radical proposal that may be sufficient to close this > particular hole. What if we made partition only affected their > top-level parents to become auto-analyzed, and not any intermediate > ancestors? Any intermediate partitioned partitions could be analyzed > manually if the user wished, and perhaps some reloption could enable > autovacuum to do it (with the caveat that it'd cause multiple sampling > of partitions). I don't yet have a clear picture on how to implement > this, but I'll explore it while waiting for opinions on the idea. So, with this patch (a quick and dirty job) we no longer sample all partitions twice; we no longer propagate the tuple counts to p_0. We don't have stats on p_0 anymore, only on p and on the individual partitions. I didn't move the new #include to a more decent place because 1. that stuff is going to move to partition.c as a new function, including the new include; 2. that new function also needs to read the reloptions for p_0 to allow the user to enable stat acquisition for p_0 with "alter table p_0 set (autovacuum_enabled=1)"; 3. need to avoid reporting ancestors of a partition repeatedly, which forestalls the performance objection about reading reloptions too frequently. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Attachment
On 2021-Aug-09, Alvaro Herrera wrote: > > 3) What is the goal of the autovac_refresh_stats() after the loop doing > > pgstat_report_anl_ancestors()? I think it'll be common that the stats > > collector hasn't even processed the incoming messages by that point, not to > > speak of actually having written out a new stats file. If it took less than > > 10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(), > > backend_read_statsfile() will not wait for a new stats file to be written > > out, and we'll just re-read the state we previously did. > > > > It's pretty expensive to re-read the stats file in some workloads, so I'm a > > bit concerned that we end up significantly increasing the amount of stats > > updates/reads, without actually gaining anything reliable? > > This is done once per autovacuum run and the point is precisely to let > the next block absorb the updates that were sent. In manual ANALYZE we > do it to inform future autovacuum runs. > > Note that the PGSTAT_RETRY_DELAY limit is used by the autovac launcher > only, and this code is running in the worker; we do flush out the old > data. Yes, it's expensive, but we're not doing it once per table, just > once per worker run. I misunderstood what you were talking about here -- I thought it was about the delay in autovac_refresh_stats (STATS_READ_DELAY, 1s). Now that I look at this again I realize what your point is, and you're right, there isn't sufficient time for the collector to absorb the messages we sent before the next scan pg_class scan starts. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)
After thinking about the described issues for a while, my proposal is to completely revamp the way this feature works. See below. Now, the proposal seems awfully invasive, but it's *the* way I see to avoid the pgstat traffic. For pg14, maybe we can live with it, and just use the smaller patches that Horiguchi-san and I have posted, which solve the other issues; also, Euler Taveira suggested that we could add a reloption to turn the feature off completely for some tables (maybe make it off by default and have a reloption to turn it on for specific partition hierarchies), so that it doesn't cause unduly pain for people with large partitioning hierarchies. * PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is the OID of a single partitioned ancestor, to which the changed-tuple counts are propagated up. Normally this is the topmost ancestor; but if the user wishes some intermediate ancestor to receive the counts they can use ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on). * Corollary 1: for the normal case of single-level partitioning, the parent partitioned table behaves as currently. * Corollary 2: for multi-level partitioning with no especially configured intermediate ancestors, only the leaf partitions and the top-level partitioned table will be analyzed. Intermediate ancestors are ignored by autovacuum. * Corollary 3: for multi-level partitioning with some intermediate ancestor(s) marked as autovacuum_enabled=on, that ancestor will receive all the counts from all of its partitions, so it will get analyzed itself; and it'll also forward those counts up to its report-ancestor. * On ALTER TABLE .. ATTACH PARTITION or CREATE TABLE PARTITION AS, we send a message to collector with the analyze-ancestor OID. * Backends running manual ANALYZE as well as autovacuum will examine each table's "relispartition" flag and its pgstat table entry; if it is a partition and doesn't have reportAncestorOid set, determine which ancestor should analyze counts be reported to; include this OID in the regular PgStat_MsgAnalyze. This fixes the situation after a crash or other stats reset. Also, it's not unduly expensive to do, because it's only in the rare case that the value sent by ATTACH was lost. * Possible race condition in the previous step may cause multiple backends to send the same info. Not a serious problem so we don't try to handle it. * When tuple change counts for a partition are received by pgstat_recv_tabstat, they are propagated up to the indicated parent table in addition to being saved in the table itself. (Bonus points: when a table is attached or detached as a partition, the live tuples count is propagated to the newly acquired parent.) What do people think of this? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Hi, On 2021-08-11 18:33:07 -0400, Alvaro Herrera wrote: > After thinking about the described issues for a while, my proposal is to > completely revamp the way this feature works. See below. > > Now, the proposal seems awfully invasive, but it's *the* way I see to > avoid the pgstat traffic. For pg14, maybe we can live with it, and just > use the smaller patches that Horiguchi-san and I have posted, which > solve the other issues; also, Euler Taveira suggested that we could add > a reloption to turn the feature off completely for some tables (maybe > make it off by default and have a reloption to turn it on for specific > partition hierarchies), so that it doesn't cause unduly pain for people > with large partitioning hierarchies. I think we should revert the changes for 14 - to me the feature clearly isn't mature enough to be released. > * PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is > the OID of a single partitioned ancestor, to which the changed-tuple > counts are propagated up. > Normally this is the topmost ancestor; but if the user wishes some > intermediate ancestor to receive the counts they can use > ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on). > > * Corollary 1: for the normal case of single-level partitioning, the > parent partitioned table behaves as currently. > > * Corollary 2: for multi-level partitioning with no especially > configured intermediate ancestors, only the leaf partitions and the > top-level partitioned table will be analyzed. Intermediate ancestors > are ignored by autovacuum. > > * Corollary 3: for multi-level partitioning with some intermediate > ancestor(s) marked as autovacuum_enabled=on, that ancestor will > receive all the counts from all of its partitions, so it will get > analyzed itself; and it'll also forward those counts up to its > report-ancestor. This seems awfully confusing to me. One fundamental issue here is that we separately build stats for partitioned tables and partitions. Can we instead tackle this by reusing the stats for partitions for the inheritance stats? I think it's a bit easier to do that for partitioned tables than for old school inheritance roots, because there's no other rows in partitioned tables. Greetings, Andres Freund
Here is a proposal for 14. This patch has four main changes: * The mod counts are only propagated to the topmost parent, not to each ancestor. This means that we'll only analyze thetopmost partitioned table and not each intermediate partitioned table; seems a good compromise to avoid sampling all partitionsmultiple times per round. * One pgstat message is sent containing many partition/parent pairs, not just one. This reduces the number of messages sent. 123 partitions fit in one message (messages are 1000 bytes). This is done once per autovacuum worker run, so it shouldn'tbe too bad. * There's a sleep between sending the message and re-reading stats. It would be great to have a mechanism by which pgstatcollector says "I've received and processed up to this point", but we don't have that; what we can do is sleep PGSTAT_STAT_INTERVALand then reread the file, so we're certain that the file we read is at least as new as that time. Thisis far longer than it takes to process the messages. Note that if the messages do take longer than that to be processedby the collector, it's not a big loss anyway; those tables will be processed by the next autovacuum run. * I changed vacuum_expand_rel to put the main-rel OID at the end. (This is a variation of Horiguchi-san proposed patch; insteadof making the complete list be in the opposite order, it's just that one OID that appears at the other end). Thishas the same effect as his patch: any error reports thrown by vacuum/analyze mention the first partition rather thanthe main table. This part is in 0002 and I'm not totally convinced it's a sane idea. Minor changes: * I reduced autovacuum from three passes over pg_class to two passes, per your observation that we can acquire toast associationtogether with processing partitions, and then use that in the second pass to collect everything. * I moved the catalog-accessing code to partition.c, so we don't need to have pgstat.c doing it. Some doc changes are pending, and some more commentary in parts of the code, but I think this is much more sensible. I dolament the lack of a syscache for pg_inherits.
Attachment
On 2021-Aug-13, Álvaro Herrera wrote: > Some doc changes are pending, and some more commentary in parts of the > code, but I think this is much more sensible. I do lament the lack of > a syscache for pg_inherits. Thinking about this again, this one here is the killer problem, I think; this behaves pretty horribly if you have more than one partition level, because it'll have to do one indexscan *per level per partition*. (For example, five partitions two levels down mean ten index scans). There's no cache for this, and no way to disable it. So for situations with a lot of partitions, it could be troublesome. Granted, it only needs to be done for partitions with DML changes since the previous autovacuum worker run in the affected database, but still it could be significant. Now we could perhaps have a hash table in partition_analyze_report_ancestors() to avoid the need for repeated indexscans for partitions of the same hierarchy (an open-coded cache to take the place of the missing pg_inherits syscache); and perhaps even use a single seqscan of pg_inherits to capture the whole story first and then filter down to the partitions that we were asked to process ... (so are we building a mini-optimizer to determine which strategy to use in each case?). That all sounds too much to be doing in the beta. So I'm leaning towards the idea that we need to revert the patch and start over for pg15. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)
Here's the reversal patch for the 14 branch. (It applies cleanly to master, but the unused member of PgStat_StatTabEntry needs to be removed and catversion bumped). -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)
Attachment
=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes: > Here's the reversal patch for the 14 branch. (It applies cleanly to > master, but the unused member of PgStat_StatTabEntry needs to be > removed and catversion bumped). I don't follow the connection to catversion? I agree that we probably don't want to change PgStat_StatTabEntry in v14 at this point. But it'd be a good idea to attach a comment to the entry saying it's unused but left there for ABI reasons. regards, tom lane
On 2021-Aug-16, Tom Lane wrote: > =?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes: > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > master, but the unused member of PgStat_StatTabEntry needs to be > > removed and catversion bumped). > > I don't follow the connection to catversion? Sorry, I misspoke -- I mean PGSTAT_FORMAT_FILE_ID. I shouldn't just change it, since if I do then the file is reported as corrupted and all counters are lost. So in the posted patch I did as you suggest: > I agree that we probably don't want to change PgStat_StatTabEntry in > v14 at this point. But it'd be a good idea to attach a comment to > the entry saying it's unused but left there for ABI reasons. It's only in branch master that I'd change the pgstat format version and remove the field. This is what I meant with the patch being for v14 and a tweak needed for this in master. A catversion bump would be required to change the definition of pg_stat_user_tables, which the patch being reverted originally changed to include relkind 'p'. A straight revert would remove that, but in my reversal patch I chose to keep it in place. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)
Another possible problem is that before the revert, we accept ALTER TABLE some_partitioned_table SET (autovacuum_enabled=on/off); (also autovacuum_analyze_scale_factor and autovacuum_analyze_threshold) but after the revert this is will throw a syntax error. What do people think we should do about that? 1. Do nothing. If somebody finds in that situation, they can use ALTER TABLE .. RESET ... to remove the settings. 2. Silently accept the option and do nothing. 3. Accept the option and throw a warning that it's a no-op. 4. Something else Opinions? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
On 2021-Aug-16, Álvaro Herrera wrote: > Here's the reversal patch for the 14 branch. (It applies cleanly to > master, but the unused member of PgStat_StatTabEntry needs to be > removed and catversion bumped). I have pushed this to both branches. (I did not remove the item from the release notes in the 14 branch.) It upsets me to have reverted it, but after spending so much time trying to correct the problems, I believe it just wasn't salvageable within the beta-period code freeze constraints. I described the issues I ran into in earlier messages; I think a good starting point to re-develop this is to revert the reversal commit, then apply my patch at https://postgr.es/m/0794d7ca-5183-486b-9c5e-6d434867cecd@www.fastmail.com then do something about the remaining problems that were complained about. (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so that the collector knows to propagate counts from children to ancestors when the upd/ins/del counts are received. However, consider developing it as follow-up to Horiguchi-san's shmem pgstat rather than current pgstat implementation.) Thanks -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
On Mon, Aug 16, 2021 at 05:42:48PM -0400, Álvaro Herrera wrote: > On 2021-Aug-16, Álvaro Herrera wrote: > > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > master, but the unused member of PgStat_StatTabEntry needs to be > > removed and catversion bumped). > > I have pushed this to both branches. (I did not remove the item from > the release notes in the 14 branch.) | I retained the addition of relkind 'p' to tables included by | pg_stat_user_tables, because reverting that would require a catversion | bump. Right now, on v15dev, it shows 0, which is misleading. Shouldn't it be null ? analyze_count | 0 Note that having analyze_count and last_analyze would be an an independently useful change. Since parent tables aren't analyzed automatically, I have a script to periodically process them if they weren't processed recently. Right now, for partitioned tables, the best I could find is to check its partitions: | MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid In 20200418050815.GE26953@telsasoft.com I wrote: |This patch includes partitioned tables in pg_stat_*_tables, which is great; I |complained awhile ago that they were missing [0]. It might be useful if that |part was split out into a separate 0001 patch (?). | [0] https://www.postgresql.org/message-id/20180601221428.GU5164%40telsasoft.com -- Justin
Hi, On 2021-08-16 17:42:48 -0400, Álvaro Herrera wrote: > On 2021-Aug-16, Álvaro Herrera wrote: > > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > master, but the unused member of PgStat_StatTabEntry needs to be > > removed and catversion bumped). > > I have pushed this to both branches. (I did not remove the item from > the release notes in the 14 branch.) > > It upsets me to have reverted it, but after spending so much time trying > to correct the problems, I believe it just wasn't salvageable within the > beta-period code freeze constraints. :( > I described the issues I ran into > in earlier messages; I think a good starting point to re-develop this is > to revert the reversal commit, then apply my patch at > https://postgr.es/m/0794d7ca-5183-486b-9c5e-6d434867cecd@www.fastmail.com > then do something about the remaining problems that were complained > about. (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so > that the collector knows to propagate counts from children to ancestors > when the upd/ins/del counts are received. My suspicion is that it'd be a lot easier to implement this efficiently if there were no propagation done outside of actually analyzing tables. I.e. have do_autovacuum() build a hashtable of (parent_table_id, count) and use that to make the analyze decisions. And then only propagate up the costs to parents of tables when a child is analyzed (and thus looses its changes_since_analyze) value. Then we can use hashtable_value + changes_since_analyze for partitioning decisions of partitioned tables. I've prototyped this, and it does seem to make do_autovacuum() cheaper. I've attached that prototype, but note it's in a rough state. However, unless we change the way inheritance parents are stored, it still requires repetitive get_partition_ancestors() (or get_partition_parent()) calls in do_autovacuum(), which I think is problematic due to the index scans you pointed out as well. The obvious way to address that would be to store parent oids in pg_class - I suspect duplicating parents in pg_class is the best way out, but pretty it is not. > However, consider developing it as follow-up to Horiguchi-san's shmem > pgstat rather than current pgstat implementation.) +1 It might be worth to first tackle reusing samples from a relation's children when building inheritance stats. Either by storing the samples somewhere (not cheap) and reusing them, or by at least updating a partition's stats when analyzing the parent. Greetings, Andres Freund
Attachment
Hi, On 2021-08-16 13:13:55 -0400, Álvaro Herrera wrote: > Another possible problem is that before the revert, we accept > ALTER TABLE some_partitioned_table SET (autovacuum_enabled=on/off); > (also autovacuum_analyze_scale_factor and autovacuum_analyze_threshold) > but after the revert this is will throw a syntax error. What do people > think we should do about that? > > 1. Do nothing. If somebody finds in that situation, they can use > ALTER TABLE .. RESET ... > to remove the settings. > > 2. Silently accept the option and do nothing. > 3. Accept the option and throw a warning that it's a no-op. > 4. Something else 1) seems OK to me. Greetings, Andres Freund
On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote: > On Mon, Aug 16, 2021 at 05:42:48PM -0400, Álvaro Herrera wrote: > > On 2021-Aug-16, Álvaro Herrera wrote: > > > > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > > master, but the unused member of PgStat_StatTabEntry needs to be > > > removed and catversion bumped). > > > > I have pushed this to both branches. (I did not remove the item from > > the release notes in the 14 branch.) > > | I retained the addition of relkind 'p' to tables included by > | pg_stat_user_tables, because reverting that would require a catversion > | bump. > > Right now, on v15dev, it shows 0, which is misleading. > Shouldn't it be null ? > > analyze_count | 0 > > Note that having analyze_count and last_analyze would be an an independently > useful change. Since parent tables aren't analyzed automatically, I have a > script to periodically process them if they weren't processed recently. Right > now, for partitioned tables, the best I could find is to check its partitions: > | MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid > > In 20200418050815.GE26953@telsasoft.com I wrote: > |This patch includes partitioned tables in pg_stat_*_tables, which is great; I > |complained awhile ago that they were missing [0]. It might be useful if that > |part was split out into a separate 0001 patch (?). > | [0] https://www.postgresql.org/message-id/20180601221428.GU5164%40telsasoft.com I suggest the attached (which partially reverts the revert), to allow showing correct data for analyze_count and last_analyzed. Arguably these should be reported as null in v14 for partitioned tables, since they're not "known to be zero", but rather "currently unpopulated". n_mod_since_analyze | 0 n_ins_since_vacuum | 0 Justin
Attachment
On 2021-Aug-17, Justin Pryzby wrote: > I suggest the attached (which partially reverts the revert), to allow showing > correct data for analyze_count and last_analyzed. Yeah, that makes sense and my keeping of the pg_stat_all_tables entries seems pretty useless without this change. I have pushed a slightly modified version of this to 14 and master. > Arguably these should be reported as null in v14 for partitioned tables, since > they're not "known to be zero", but rather "currently unpopulated". > > n_mod_since_analyze | 0 > n_ins_since_vacuum | 0 I don't disagree, but it's not easy to implement this at present. I think almost all counters should be nulls for partitioned tables. For some of them one could make a case that it'd be more convenient to propagate numbers up from partitions. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/