Thread: ANALYZE ONLY
Hello Postgres Hackers An application that I am developing uses Postgresql, and includes a fairly large number of partitioned tables which are used to store time series data. The tables are partitioned by time, and typically there is only one partition at a time - the current partition - that is actually being updated. Older partitions are available for query and eventually dropped. As per the documentation, partitioned tables are not analyzed by the autovacuum workers, although their partitions are. Statistics are needed on the partitioned table level for at least some query planning activities. The problem is that giving an ANALYZE command targeting a partitioned table causes it to update statistics for the partitioned table AND all the individual partitions. There is currently no option to prevent it from including the partitions. This is wasteful for our application: for one thing the autovacuum has already analyzed the individual partitions; for another most of the partitions will have had no changes, so they don't need to be analyzed repeatedly. I took some measurements when running ANALYZE on one of our tables. It took approx 4 minutes to analyze the partitioned table, then 29 minutes to analyze the partitions. We have hundreds of these tables, so the cost is very significant. For my use case at least it would be fantastic if we could add an ONLY option to ANALYZE, which would cause it to analyze the named table only and not descend into the partitions. I took a look at the source and it looks doable, but before launching into it I thought I would ask a few questions here. 1. Would such a feature be welcomed? Are there any traps I might not have thought of? 2. The existing ANALYZE command has the following structure: ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ] It would be easiest to add ONLY as another option, but that doesn't look quite right to me - surely the ONLY should be attached to the table name? An alternative would be: ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ] Any feedback or advice would be great. Regards Mike.
On Tue, 20 Aug 2024 at 07:52, Michael Harris <harmic@gmail.com> wrote: > 1. Would such a feature be welcomed? Are there any traps I might not > have thought of? I think this sounds like a reasonable feature. > 2. The existing ANALYZE command has the following structure: > > ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ] > > It would be easiest to add ONLY as another option, but that > doesn't look quite > right to me - surely the ONLY should be attached to the table name? > An alternative would be: > > ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ] > > Any feedback or advice would be great. Personally I'd prefer a new option to be added. But I agree ONLY isn't a good name then. Maybe something like SKIP_PARTITIONS.
On 20.8.24 10:42, Jelte Fennema-Nio wrote:
On Tue, 20 Aug 2024 at 07:52, Michael Harris <harmic@gmail.com> wrote:1. Would such a feature be welcomed? Are there any traps I might not have thought of?I think this sounds like a reasonable feature.2. The existing ANALYZE command has the following structure: ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ] It would be easiest to add ONLY as another option, but that doesn't look quite right to me - surely the ONLY should be attached to the table name? An alternative would be: ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ] Any feedback or advice would be great.Personally I'd prefer a new option to be added. But I agree ONLY isn't a good name then. Maybe something like SKIP_PARTITIONS.
Hi everyone,
Your proposal is indeed interesting, but I have a question: can't your issue be resolved by properly configuring autovacuum
instead of developing a new feature for ANALYZE
?
From my perspective, ANALYZE
is intended to forcibly gather statistics from all partitioned tables. If the goal is to ensure that statistics are updated at the right moment, adjusting the autovacuum_analyze_threshold
and autovacuum_analyze_scale_factor
parameters might be the solution.
-- Regards, Ilia Evdokimov, Tantor Labs LCC.
On Tue, 20 Aug 2024 at 23:25, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote: > Your proposal is indeed interesting, but I have a question: can't your issue be resolved by properly configuring autovacuuminstead of developing a new feature for ANALYZE? Basically, no. There's a "tip" in [1] which provides information on the limitation, namely: "The autovacuum daemon does not issue ANALYZE commands for partitioned tables. Inheritance parents will only be analyzed if the parent itself is changed - changes to child tables do not trigger autoanalyze on the parent table. If your queries require statistics on parent tables for proper planning, it is necessary to periodically run a manual ANALYZE on those tables to keep the statistics up to date." There is also some discussion about removing the limitation in [2]. While I agree that it would be nice to have autovacuum handle this, it's not clear how exactly it would work. Additionally, if we had that, it would still be useful if the ANALYZE command could be instructed to just gather statistics for the partitioned table only. David [1] https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-STATISTICS [2] https://www.postgresql.org/message-id/flat/CAKkQ508_PwVgwJyBY%3D0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA%40mail.gmail.com
Hi Michael,
Thanks for starting this thread. I've also spent a bit time on this after reading your first thread on this issue [1]
Michael Harris <harmic@gmail.com>, 20 Ağu 2024 Sal, 08:52 tarihinde şunu yazdı:
The problem is that giving an ANALYZE command targeting a partitioned table
causes it to update statistics for the partitioned table AND all the individual
partitions. There is currently no option to prevent it from including the
partitions.
This is wasteful for our application: for one thing the autovacuum
has already analyzed the individual partitions; for another most of
the partitions
will have had no changes, so they don't need to be analyzed repeatedly.
I agree that it's a waste to analyze partitions when they're already analyzed by autovacuum. It would be nice to have a way to run analyze only on a partitioned table without its partitions.
I took some measurements when running ANALYZE on one of our tables. It
took approx
4 minutes to analyze the partitioned table, then 29 minutes to analyze the
partitions. We have hundreds of these tables, so the cost is very significant.
I quickly tweaked the code a bit to exclude partitions when a partitioned table is being analyzed. I can confirm that there is a significant gain even on a simple case like a partitioned table with 10 partitions and 1M rows in each partition.
1. Would such a feature be welcomed? Are there any traps I might not
have thought of?
2. The existing ANALYZE command has the following structure:
ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
It would be easiest to add ONLY as another option, but that
doesn't look quite
right to me - surely the ONLY should be attached to the table name
An alternative would be:
ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ]
I feel closer to adding this as an option instead of a new keyword in ANALYZE grammar. To me, it would be easier to have this option and then give the names of partitioned tables as opposed to typing ONLY before each partition table.
But we should think of another name as ONLY is used differently (attached to the table name as you mentioned) in other contexts.
I've been also thinking about how this new option should affect inheritance tables. Should it have just no impact on them or only analyze the parent table without taking child tables into account? There are two records for an inheritance parent table in pg_statistic, one row for only the parent table and a second row including children. We might only analyze the parent table if this new "ONLY" option is specified. I'm not sure if that would be something users would need or not, but I think this option should behave similarly for both partitioned tables and inheritance tables.
If we decide to go with only partition tables and not care about inheritance, then naming this option to SKIP_PARTITIONS as Jelte suggested sounds fine. But that name wouldn't work if this option will affect inheritance tables.
Thanks,
Melih Mutlu
Microsoft
Melih Mutlu <m.melihmutlu@gmail.com>, 20 Ağu 2024 Sal, 19:26 tarihinde şunu yazdı:
Hi Michael,Thanks for starting this thread. I've also spent a bit time on this after reading your first thread on this issue [1]
Forgot to add the reference [1]
On Tue, Aug 20, 2024 at 1:52 AM Michael Harris <harmic@gmail.com> wrote: > 2. The existing ANALYZE command has the following structure: > > ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ] > > It would be easiest to add ONLY as another option, but that > doesn't look quite > right to me - surely the ONLY should be attached to the table name? > An alternative would be: > > ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ] > > Any feedback or advice would be great. I like trying to use ONLY somehow. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote: > I like trying to use ONLY somehow. Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table; or as a table modifier like gram.y's extended_relation_expr? Making it a command option means that the option would apply to all tables listed, whereas if it was more like an extended_relation_expr, the option would be applied per table listed in the command. 1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its partitions but get stats on ptab2 and stats on its partitions too. 2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2 without doing that on any of their partitions. Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the behaviour of #2. If we did it as a per-table option, then we'd need to consider what should happen if someone did: "VACUUM ONLY parttab;". Probably silently doing nothing wouldn't be good. Maybe a warning, akin to what's done in: postgres=# analyze (skip_locked) a; WARNING: skipping analyze of "a" --- lock not available David
David Rowley <dgrowleyml@gmail.com> writes: > On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote: >> I like trying to use ONLY somehow. > Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table; > or as a table modifier like gram.y's extended_relation_expr? > Making it a command option means that the option would apply to all > tables listed, whereas if it was more like an extended_relation_expr, > the option would be applied per table listed in the command. > 1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its > partitions but get stats on ptab2 and stats on its partitions too. > 2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2 > without doing that on any of their partitions. FWIW, I think that's the right approach, for consistency with the way that ONLY works in DML. regards, tom lane
On Tue, Aug 20, 2024 at 6:53 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote: > > I like trying to use ONLY somehow. > > Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table; > or as a table modifier like gram.y's extended_relation_expr? The table modifier idea seems nice to me. > If we did it as a per-table option, then we'd need to consider what > should happen if someone did: "VACUUM ONLY parttab;". Probably > silently doing nothing wouldn't be good. Maybe a warning, akin to > what's done in: > > postgres=# analyze (skip_locked) a; > WARNING: skipping analyze of "a" --- lock not available Perhaps. I'm not sure about this part. -- Robert Haas EDB: http://www.enterprisedb.com
David Rowley <dgrowleyml@gmail.com>, 21 Ağu 2024 Çar, 01:53 tarihinde şunu yazdı:
On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
> I like trying to use ONLY somehow.
Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table;
or as a table modifier like gram.y's extended_relation_expr?
Making it a command option means that the option would apply to all
tables listed, whereas if it was more like an extended_relation_expr,
the option would be applied per table listed in the command.
1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its
partitions but get stats on ptab2 and stats on its partitions too.
2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2
without doing that on any of their partitions.
I believe we should go this route if we want this to be called "ONLY" so that it would be consistent with other places too.
Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the
behaviour of #2.
Havin it as an option would be easier to use when we have several partition tables. But I agree that if we call it "ONLY ", it may be confusing and the other approach would be appropriate.
If we did it as a per-table option, then we'd need to consider what
should happen if someone did: "VACUUM ONLY parttab;". Probably
silently doing nothing wouldn't be good. Maybe a warning, akin to
what's done in:
postgres=# analyze (skip_locked) a;
WARNING: skipping analyze of "a" --- lock not available
+1 to raising a warning message in that case instead of staying silent. We might also not allow ONLY if ANALYZE is not present in VACUUM query and raise an error. But that would require changes in grams.y and could complicate things. So it may not be necessary and we may be fine with just a warning.
Regards,
--Melih Mutlu
Microsoft
Thank you all for the replies & discussion. It sounds like more are in favour of using the ONLY syntax attached to the tables is the best way to go, with the main advantages being: - consistency with other commands - flexibility in allowing to specify whether to include partitions for individual tables when supplying a list of tables I will start working on an implementation along those lines. It looks like we can simply replace qualified_name with relation_expr in the production for vacuum_relation within gram.y. One other thing I noticed when reading the code. The function expand_vacuum_rel in vacuum.c seems to be responsible for adding the partitions. If I am reading it correctly, it only adds child tables in the case of a partitioned table, not in the case of an inheritance parent: include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE); .. if (include_parts) { .. add partitions .. This is a little different to some other contexts where the ONLY keyword is used, in that ONLY would be the default and only available mode of operation for an inheritance parent. Regards, Mike On Wed, 21 Aug 2024 at 20:04, Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > David Rowley <dgrowleyml@gmail.com>, 21 Ağu 2024 Çar, 01:53 tarihinde şunu yazdı: >> >> On Wed, 21 Aug 2024 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote: >> > I like trying to use ONLY somehow. >> >> Do you mean as an ANALYZE command option, i.e. ANALYZE (only) table; >> or as a table modifier like gram.y's extended_relation_expr? >> >> Making it a command option means that the option would apply to all >> tables listed, whereas if it was more like an extended_relation_expr, >> the option would be applied per table listed in the command. >> >> 1. ANALYZE ONLY ptab, ptab2; -- gather stats on ptab but not on its >> partitions but get stats on ptab2 and stats on its partitions too. >> 2. ANALYZE ONLY ptab, ONLY ptab2; -- gather stats on ptab and ptab2 >> without doing that on any of their partitions. > > > I believe we should go this route if we want this to be called "ONLY" so that it would be consistent with other placestoo. > >> Whereas: "ANALYZE (ONLY) ptab, ptab2;" would always give you the >> behaviour of #2. > > > Havin it as an option would be easier to use when we have several partition tables. But I agree that if we call it "ONLY", it may be confusing and the other approach would be appropriate. > >> >> If we did it as a per-table option, then we'd need to consider what >> should happen if someone did: "VACUUM ONLY parttab;". Probably >> silently doing nothing wouldn't be good. Maybe a warning, akin to >> what's done in: >> >> postgres=# analyze (skip_locked) a; >> WARNING: skipping analyze of "a" --- lock not available > > > +1 to raising a warning message in that case instead of staying silent. We might also not allow ONLY if ANALYZE is notpresent in VACUUM query and raise an error. But that would require changes in grams.y and could complicate things. Soit may not be necessary and we may be fine with just a warning. > > Regards, > -- > Melih Mutlu > Microsoft
On Thu, 22 Aug 2024 at 11:32, Michael Harris <harmic@gmail.com> wrote: > One other thing I noticed when reading the code. The function > expand_vacuum_rel in vacuum.c seems to be responsible for adding the > partitions. If I am reading it correctly, it only adds child tables in > the case of a partitioned table, not in the case of an inheritance > parent: > > include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE); > .. > if (include_parts) > { > .. add partitions .. > > This is a little different to some other contexts where the ONLY > keyword is used, in that ONLY would be the default and only available > mode of operation for an inheritance parent. That's inconvenient and quite long-established behaviour. I had a look as far back as 9.2 and we only analyze parents there too. I'm keen on the ONLY syntax, but it would be strange if ONLY did the same thing as not using ONLY for inheritance parents. I feel like we might need to either bite the bullet and make ONLY work consistently with both, or think of another way to have ANALYZE not recursively gather stats for each partition on partitioned tables. Could we possibly get away with changing inheritance parent behaviour? David
David Rowley <dgrowleyml@gmail.com> writes: > I feel like we might need to either bite the bullet and make ONLY work > consistently with both, or think of another way to have ANALYZE not > recursively gather stats for each partition on partitioned tables. > Could we possibly get away with changing inheritance parent behaviour? Yeah, my thought was to change the behavior so it's consistent in that case too. This doesn't seem to me like a change that'd really cause anybody serious problems: ANALYZE without ONLY would do more than before, but it's work you probably want done anyway. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, my thought was to change the behavior so it's consistent in > that case too. This doesn't seem to me like a change that'd > really cause anybody serious problems: ANALYZE without ONLY would > do more than before, but it's work you probably want done anyway. Would we want to apply that change to VACUUM too? That might be a bit drastic, especially if you had a multi-level inheritance structure featuring large tables. It feels a bit like VACUUM and ANALYZE have opposite natural defaults here. For VACUUM it does not make much sense to vacuum only at the partitioned table level and not include the partitions, since it would do nothing - that might be why the existing code always adds the partitions.
On Thu, 22 Aug 2024 at 13:38, Michael Harris <harmic@gmail.com> wrote: > Would we want to apply that change to VACUUM too? That might be a > bit drastic, especially if you had a multi-level inheritance structure featuring > large tables. I think they'd need to work the same way as for "VACUUM (ANALYZE)", it would be strange to analyze some tables that you didn't vacuum. It's just a much bigger pill to swallow in terms of the additional effort. > It feels a bit like VACUUM and ANALYZE have opposite natural defaults here. > For VACUUM it does not make much sense to vacuum only at the partitioned > table level and not include the partitions, since it would do nothing > - that might > be why the existing code always adds the partitions. Yeah, I suspect that's exactly why it was coded that way. David
Hi Michael,
Thanks for the patch.
I quickly tried running some ANALYZE ONLY queries, it seems like it works fine.
-ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
+ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ [ ONLY ] <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
<para>
For partitioned tables, <command>ANALYZE</command> gathers statistics by
sampling rows from all partitions; in addition, it will recurse into each
partition and update its statistics. Each leaf partition is analyzed only
once, even with multi-level partitioning. No statistics are collected for
only the parent table (without data from its partitions), because with
partitioning it's guaranteed to be empty.
</para>
We may also want to update the above note in ANALYZE doc.
+-- ANALYZE ONLY / VACUUM ONLY on partitioned table
+CREATE TABLE only_parted (a int, b char) PARTITION BY LIST (a);
+CREATE TABLE only_parted1 PARTITION OF vacparted FOR VALUES IN (1);
+INSERT INTO only_parted1 VALUES (1, 'a');
Tests don't seem right to me.
I believe the above should be " PARTITION OF only_parted " instead of vacparted.
It may better to insert into partitioned table (only_parted) instead of the partition (only_parted1);
Also it may be a good idea to test VACUUM ONLY for inheritance tables the same way you test ANALYZE ONLY.
Lastly, the patch includes an unrelated file (compile_flags.txt) and has whitespace errors when I apply it.
Regards,
-- Melih Mutlu
Microsoft
Hi Michael,
Michael Harris <harmic@gmail.com>, 23 Ağu 2024 Cum, 13:01 tarihinde şunu yazdı:
V2 of the patch is attached.
-ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
+ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ [ ONLY ] <replaceable class="parameter">table_and_columns</replaceable> [, ...] ]
I believe moving "[ ONLY ]" to the definitions of table_and_columns below, as you did with "[ * ]", might be better to be consistent with other places (see [1])
+ if ((options & VACOPT_VACUUM) && is_partitioned_table && ! include_children)
There are also some issues with coding conventions in some places (e.g. the space between "!" and "include_children" abode). I think running pgindent would resolve such issue in most places.
Regards,
-- Melih Mutlu
Microsoft
Hi Michael, On 2024-08-23 19:01, Michael Harris wrote: > V2 of the patch is attached. Thanks for the proposal and the patch. You said this patch was a first draft version, so these may be too minor comments, but I will share them: -- https://www.postgresql.org/docs/devel/progress-reporting.html > Note that when ANALYZE is run on a partitioned table, all of its > partitions are also recursively analyzed. Should we also note this is the default, i.e. not using ONLY option behavior here? -- https://www.postgresql.org/docs/devel/ddl-partitioning.html > If you are using manual VACUUM or ANALYZE commands, don't forget that > you need to run them on each child table individually. A command like: > > ANALYZE measurement; > will only process the root table. This part also should be modified, shouldn't it? When running ANALYZE VERBOSE ONLY on a partition table, the INFO message is like this: =# ANALYZE VERBOSE ONLY only_parted; INFO: analyzing "public.only_parted" inheritance tree I may be wrong, but 'inheritance tree' makes me feel it includes child tables. Removing 'inheritance tree' and just describing the table name as below might be better: INFO: analyzing "public.only_parted" -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Thu, Aug 29, 2024 at 7:31 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > Hi Michael, > > Michael Harris <harmic@gmail.com>, 23 Ağu 2024 Cum, 13:01 tarihinde şunu yazdı: >> >> V2 of the patch is attached. > > > Thanks for updating the patch. I have a few more minor feedbacks. > >> -ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_and_columns</replaceable>[, ...] ] >> +ANALYZE [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ [ ONLY ] <replaceable class="parameter">table_and_columns</replaceable>[, ...] ] > > > I believe moving "[ ONLY ]" to the definitions of table_and_columns below, as you did with "[ * ]", might be better tobe consistent with other places (see [1]) > >> + if ((options & VACOPT_VACUUM) && is_partitioned_table && ! include_children) > > I think you are right. ANALYZE [ ( option [, ...] ) ] [ [ ONLY ] table_and_columns [, ...] ] seems not explain commands like: ANALYZE ONLY only_parted(a), ONLY only_parted(b);
Hi Michael, On Fri, 23 Aug 2024 at 22:01, Michael Harris <harmic@gmail.com> wrote: > V2 of the patch is attached. (I understand this is your first PostgreSQL patch) I just wanted to make sure you knew about the commitfest cycle we use for the development of PostgreSQL. If you've not got one already, can you register a community account. That'll allow you to include this patch in the September commitfest that starts on the 1st. See https://commitfest.postgresql.org/49/ I understand there's now some sort of cool-off period for new community accounts, so if you have trouble adding the patch before the CF starts, let me know and I should be able to do it for you. The commitfest normally closes for new patches around 12:00 UTC on the 1st of the month. There's a bit more information in https://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission David
Thanks David I had not read that wiki page well enough, so many thanks for alerting me that I had to create a commitfest entry. I already had a community account so I was able to login and create it here: https://commitfest.postgresql.org/49/5226/ I was not sure what to put for some of the fields - for 'reviewer' should I have put the people who have provided feedback on this thread? Is there anything else I need to do? Thanks again Cheers Mike On Fri, 30 Aug 2024 at 18:44, David Rowley <dgrowleyml@gmail.com> wrote: > > Hi Michael, > > On Fri, 23 Aug 2024 at 22:01, Michael Harris <harmic@gmail.com> wrote: > > V2 of the patch is attached. > > (I understand this is your first PostgreSQL patch) > > I just wanted to make sure you knew about the commitfest cycle we use > for the development of PostgreSQL. If you've not got one already, can > you register a community account. That'll allow you to include this > patch in the September commitfest that starts on the 1st. See > https://commitfest.postgresql.org/49/ > > I understand there's now some sort of cool-off period for new > community accounts, so if you have trouble adding the patch before the > CF starts, let me know and I should be able to do it for you. The > commitfest normally closes for new patches around 12:00 UTC on the 1st > of the month. > > There's a bit more information in > https://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission > > David
On Sun, 1 Sept 2024 at 13:41, Michael Harris <harmic@gmail.com> wrote: > https://commitfest.postgresql.org/49/5226/ > > I was not sure what to put for some of the fields - for 'reviewer' should > I have put the people who have provided feedback on this thread? I think it's fairly common that only reviewers either add themselves or don't bother. I don't think it's common that patch authors add reviewers. Some reviewers like their reviews to be more informal and putting themselves as a reviewer can put other people off reviewing as they think it's already handled by someone else. That may result in the patch being neglected by reviewers. > Is there anything else I need to do? I see you set the target version as 17. That should be blank or "18". PG17 is about to go into release candidate, so it is too late for that. It's been fairly consistent that we open for new patches in early July and close before the 2nd week of April the following year. We're currently 2 months into PG18 dev. We don't back-patch new features. Nothing else aside from continuing to address reviews, as you are already. David
On 2024-09-01 10:31, Michael Harris wrote: > Hello Atsushi & Melih > > Thank you both for your further feedback. > > On Thu, 29 Aug 2024 at 21:31, Melih Mutlu <m.melihmutlu@gmail.com> > wrote: >> I believe moving "[ ONLY ]" to the definitions of table_and_columns >> below, as you did with "[ * ]", might be better to be consistent with >> other places (see [1]) > > Agreed. I have updated this. > >>> + if ((options & VACOPT_VACUUM) && is_partitioned_table && ! >>> include_children) >> >> There are also some issues with coding conventions in some places >> (e.g. the space between "!" and "include_children" abode). I think >> running pgindent would resolve such issue in most places. > > I fixed that extra space, and then ran pgindent. It did not report any > more problems. > > On Fri, 30 Aug 2024 at 16:45, torikoshia <torikoshia@oss.nttdata.com> > wrote: >> -- https://www.postgresql.org/docs/devel/progress-reporting.html >> > Note that when ANALYZE is run on a partitioned table, all of its >> > partitions are also recursively analyzed. >> >> Should we also note this is the default, i.e. not using ONLY option >> behavior here? > >> -- https://www.postgresql.org/docs/devel/ddl-partitioning.html >> > If you are using manual VACUUM or ANALYZE commands, don't forget that >> > you need to run them on each child table individually. A command like: >> > >> > ANALYZE measurement; >> > will only process the root table. >> >> This part also should be modified, shouldn't it? > > Agreed. I have updated both of these. Thanks! >> When running ANALYZE VERBOSE ONLY on a partition table, the INFO >> message >> is like this: >> >> =# ANALYZE VERBOSE ONLY only_parted; >> INFO: analyzing "public.only_parted" inheritance tree >> >> I may be wrong, but 'inheritance tree' makes me feel it includes child >> tables. >> Removing 'inheritance tree' and just describing the table name as >> below >> might be better: >> >> INFO: analyzing "public.only_parted" > > I'm not sure about that one. If I understand the source correctly, > that particular progress > message tells the user that the system is gathering stats from the > inheritance > tree in order to update the stats of the given table, not that it is > actually updating > the stats of the descendant tables. That makes sense. > When analyzing an inheritance structure with the ONLY you see > something like this: > > => ANALYZE VERBOSE ONLY only_inh_parent; > INFO: analyzing "public.only_inh_parent" > INFO: "only_inh_parent": scanned 0 of 0 pages, containing 0 live rows > and 0 dead rows; 0 rows in sample, 0 estimated total rows > INFO: analyzing "public.only_inh_parent" inheritance tree > INFO: "only_inh_child": scanned 1 of 1 pages, containing 3 live rows > and 0 dead rows; 3 rows in sample, 3 estimated total rows > ANALYZE > > The reason you don't see the first one for partitioned tables is that > it corresponds > to sampling the contents of the parent table itself, which in the case > of a partitioned > table is guaranteed to be empty, so it is skipped. > > I agree the text could be confusing, and in fact is probably confusing > even today > without the ONLY keyword, Yeah, it seems this isn't dependent on your proposal. (BTW I'm also wondering if the expression “inheritance" is appropriate when the target is a partitioned table, but this should be discussed in another thread) -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On Sun, 1 Sept 2024 at 13:32, Michael Harris <harmic@gmail.com> wrote: > v3 of the patch is attached, and I will submit it to the commitfest. I think this patch is pretty good. I only have a few things I'd point out: 1. The change to ddl.sgml, technically you're removing the caveat, but I think the choice you've made to mention the updated behaviour is likely a good idea still. So I agree with this change, but just wanted to mention it as someone else might think otherwise. 2. I think there's some mixing of tense in the changes to analyze.sgml: "If <literal>ONLY</literal> was not specified, it will also recurse into each partition and update its statistics." You've written "was" (past tense), but then the existing text uses "will" (future tense). I guess if the point in time is after parse and before work has been done, then that's correct, but I think using "is" instead of "was" is better. 3. In vacuum.sgml; "If <literal>ONLY</literal> is not specified, the table and all its descendant tables or partitions (if any) are vacuumed" Maybe "are also vacuumed" instead of "are vacuumed" is more clear? It's certainly true for inheritance parents, but I guess you could argue that's wrong for partitioned tables. 4. A very minor detail, but I think in vacuum.c the WARNING you've added should use RelationGetRelationName(). We seem to be very inconsistent with using that macro and I see it's not used just above for the lock warning, which I imagine you copied. Aside from those, that just leaves me with the behavioural change. I noted Tom was ok with the change in behaviour for ANALYZE (mentioned in [1]). Tom, wondering if you feel the same for VACUUM too? If we're doing this, I think we'd need to be quite clear about it on the release notes. David [1] https://postgr.es/m/1919201.1724289136@sss.pgh.pa.us
On 2024-09-09 16:56, Michael Harris wrote: Thanks for updating the patch. Here is a minor comment. > @@ -944,20 +948,32 @@ expand_vacuum_rel(VacuumRelation *vrel, > MemoryContext vac_context, > MemoryContextSwitchTo(oldcontext); > } .. > + * Unless the user has specified ONLY, make relation list > entries for > + * its partitions and/or descendant tables. Regarding the addition of partition descendant tables, should we also update the below comment on expand_vacuum_rel? Currently it refers only partitions: | * Given a VacuumRelation, fill in the table OID if it wasn't specified, | * and optionally add VacuumRelations for partitions of the table. Other than this and the following, it looks good to me. On Mon, Sep 9, 2024 at 10:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > Aside from those, that just leaves me with the behavioural change. I > noted Tom was ok with the change in behaviour for ANALYZE (mentioned > in [1]). Tom, wondering if you feel the same for VACUUM too? If we're > doing this, I think we'd need to be quite clear about it on the > release notes. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024-09-11 16:47, Michael Harris wrote: > Thanks for the feedback. > > On Tue, 10 Sept 2024 at 22:03, torikoshia <torikoshia@oss.nttdata.com> > wrote: >> Regarding the addition of partition descendant tables, should we also >> update the below comment on expand_vacuum_rel? Currently it refers >> only >> partitions: >> >> | * Given a VacuumRelation, fill in the table OID if it wasn't >> specified, >> | * and optionally add VacuumRelations for partitions of the table. >> > > Well spotted! I have updated the comment to read: > > * Given a VacuumRelation, fill in the table OID if it wasn't > specified, > * and optionally add VacuumRelations for partitions or descendant > tables > * of the table. Thanks for the update! I've switched the status on the commitfest to 'ready for committer'. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
hi. in https://www.postgresql.org/docs/current/ddl-inherit.html <<< Commands that do database maintenance and tuning (e.g., REINDEX, VACUUM) typically only work on individual, physical tables and do not support recursing over inheritance hierarchies. The respective behavior of each individual command is documented in its reference page (SQL Commands). <<< does the above para need some tweaks? in section, your patch output is <<<<<<<<< and table_and_columns is: [ ONLY ] table_name [ * ] [ ( column_name [, ...] ) ] <<<<<<<<< ANALYZE ONLY only_parted*; will fail. Maybe we need a sentence saying ONLY and * are mutually exclusive? >>> If the table being analyzed has inheritance children, ANALYZE gathers two sets of statistics: one on the rows of the parent table only, and a second including rows of both the parent table and all of its children. This second set of statistics is needed when planning queries that process the inheritance tree as a whole. The child tables themselves are not individually analyzed in this case. >>> is this still true for table inheritance. for example: drop table if exists only_inh_parent,only_inh_child; CREATE TABLE only_inh_parent (a int , b INT) with (autovacuum_enabled = false); CREATE TABLE only_inh_child () INHERITS (only_inh_parent) with (autovacuum_enabled = false); INSERT INTO only_inh_child(a,b) select g % 80, (g + 1) % 200 from generate_series(1,1000) g; select pg_table_size('only_inh_parent'); ANALYZE ONLY only_inh_parent; here, will "ANALYZE ONLY only_inh_parent;" will have minimal "effects". since the physical table only_inh_parent has zero storage. but select stadistinct,starelid::regclass,staattnum, stainherit from pg_statistic where starelid = ANY ('{only_inh_parent, only_inh_child}'::regclass[]); output is stadistinct | starelid | staattnum | stainherit -------------+-----------------+-----------+------------ 80 | only_inh_parent | 1 | t -0.2 | only_inh_parent | 2 | t I am not sure if this output and the manual description about table inheritance is consistent.
On Sun, 22 Sept 2024 at 23:09, David Rowley <dgrowleyml@gmail.com> wrote: > I'd like to push this soon, so if anyone has any last-minute feedback, > please let me know in the next few days. Many thanks for your updates and assistance. Looks good. Agreed, I was probably too conservative in some of the doc updates. Thanks to all reviewers for helping to get it to this stage! Cheers Mike
On Sun, Sep 22, 2024 at 9:09 PM David Rowley <dgrowleyml@gmail.com> wrote: > > v7-0002 is all my changes. > > I'd like to push this soon, so if anyone has any last-minute feedback, > please let me know in the next few days. > drop table if exists only_inh_parent,only_inh_child; CREATE TABLE only_inh_parent (a int , b INT) with (autovacuum_enabled = false); CREATE TABLE only_inh_child () INHERITS (only_inh_parent) with (autovacuum_enabled = false); INSERT INTO only_inh_child(a,b) select g % 80, (g + 1) % 200 from generate_series(1,1000) g; select pg_table_size('only_inh_parent'); ANALYZE ONLY only_inh_parent; select stadistinct,starelid::regclass,staattnum, stainherit from pg_statistic where starelid = ANY ('{only_inh_parent, only_inh_child}'::regclass[]); stadistinct | starelid | staattnum | stainherit -------------+-----------------+-----------+------------ 80 | only_inh_parent | 1 | t -0.2 | only_inh_parent | 2 | t --------------- catalog-pg-statistic.html stainherit bool If true, the stats include values from child tables, not just the values in the specified relation Normally there is one entry, with stainherit = false, for each table column that has been analyzed. If the table has inheritance children or partitions, a second entry with stainherit = true is also created. This row represents the column's statistics over the inheritance tree, i.e., statistics for the data you'd see with SELECT column FROM table*, whereas the stainherit = false row represents the results of SELECT column FROM ONLY table. I do understand what Michael Harris said in [1] --------------- Given the above context, I am still confused with this sentence in sql-analyze.html. "If ONLY is specified before the table name, only that table is analyzed." like in the above sql example, only_inh_parent's child is also being analyzed. [1] https://www.postgresql.org/message-id/CADofcAW43AD%3Dqqtj1cLkTyRpPM6JB5ZALUK7CA1KZZqpcouoYw%40mail.gmail.com
On Mon, 23 Sept 2024 at 15:29, jian he <jian.universality@gmail.com> wrote: > Given the above context, I am still confused with this sentence in > sql-analyze.html. > "If ONLY is specified before the table name, only that table is analyzed." > > like in the above sql example, only_inh_parent's child is also being analyzed. I guess it depends what you define "analyzed" to mean. In this context, it means gathering statistics specifically for a certain table. Would it be more clear if "only that table is analyzed." was changed to "then statistics are only gathered specifically for that table."? David
On Mon, Sep 23, 2024 at 12:46 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Mon, 23 Sept 2024 at 15:29, jian he <jian.universality@gmail.com> wrote: > > Given the above context, I am still confused with this sentence in > > sql-analyze.html. > > "If ONLY is specified before the table name, only that table is analyzed." > > > > like in the above sql example, only_inh_parent's child is also being analyzed. > > I guess it depends what you define "analyzed" to mean. In this > context, it means gathering statistics specifically for a certain > table. > > Would it be more clear if "only that table is analyzed." was changed > to "then statistics are only gathered specifically for that table."? > looking at expand_vacuum_rel, analyze_rel. if we --------- if (onerel->rd_rel->relhassubclass) do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages, true, in_outer_xact, elevel); change to if (onerel->rd_rel->relhassubclass && ((!relation || relation->inh) || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)) do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages, true, in_outer_xact, elevel); then the inheritance table will behave the way the doc says. for example: drop table if exists only_inh_parent,only_inh_child; CREATE TABLE only_inh_parent (a int , b INT) with (autovacuum_enabled = false); CREATE TABLE only_inh_child () INHERITS (only_inh_parent) with (autovacuum_enabled = false); INSERT INTO only_inh_child(a,b) select g % 80, (g + 1) % 200 from generate_series(1,1000) g; ANALYZE ONLY only_inh_parent; select stadistinct,starelid::regclass,staattnum, stainherit from pg_statistic where starelid = ANY ('{only_inh_parent, only_inh_child}'::regclass[]); will return zero rows, since the physical table only_inh_parent has no storage. sql-analyze.html For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions. Each leaf partition is also recursively analyzed and the statistics updated. This recursive part may be disabled by using the ONLY keyword, otherwise, leaf partitions are analyzed only once, even with multi-level partitioning. No statistics are collected for only the parent table (without data from its partitions), because with partitioning it's guaranteed to be empty. allow me to ask anenglish language question. here "otherwise" means specify ONLY or not? As far as i understand. if you specify ONLY, postgres will only do "For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions" if you not specify ONLY, postgres will do "For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions *AND* also recursively analyze each leaf partition" Is my understanding correct? I think there is a whitespace error in "ANALYZE ONLY vacparted(a,b); " in vacuum.out.
On Mon, Sep 23, 2024 at 6:04 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > If this is confusing, I think there's a bunch of detail that I tried > to keep that's just not that useful. The part about analyzing > partitions just once and the part about not collecting non-inheritance > stats for the partitioned table seems like extra detail that's either > obvious or just not that important. > > Can you have a look at the attached and let me know if it's easier to > understand now? > Now the regress test passed. <para> For partitioned tables, <command>ANALYZE</command> gathers statistics by sampling rows from all partitions. By default, <command>ANALYZE</command> will also recursively collect and update the statistics for each partition. The <literal>ONLY</literal> keyword may be used to disable this. </para> is very clear to me! <para> If the table being analyzed has inheritance children, <command>ANALYZE</command> gathers two sets of statistics: one on the rows of the parent table only, and a second including rows of both the parent table and all of its children. This second set of statistics is needed when planning queries that process the inheritance tree as a whole. The autovacuum daemon, however, will only consider inserts or updates on the parent table itself when deciding whether to trigger an automatic analyze for that table. If that table is rarely inserted into or updated, the inheritance statistics will not be up to date unless you run <command>ANALYZE</command> manually. By default, <command>ANALYZE</command> will also recursively collect and update the statistics for each inheritance child table. The <literal>ONLY</literal> keyword may be used to disable this. </para> looks fine. but maybe we can add the following information "if The <literal>ONLY</literal> is specified, the second set of statistics won't include each children individual statistics" I think that's the main difference between specifying ONLY or not? catalog-pg-statistic.html second paragraph seems very clear to me. Maybe we can link it somehow Other than that, it looks good to me.
On Mon, 23 Sept 2024 at 23:23, jian he <jian.universality@gmail.com> wrote: > looks fine. but maybe we can add the following information > "if The <literal>ONLY</literal> is specified, the second set of > statistics won't include each children individual statistics" > I think that's the main difference between specifying ONLY or not? Ok, I think you're not understanding this yet and I'm not sure what I can make more clear in the documents. Let me explain... For inheritance parent tables, ANALYZE ONLY will gather inheritance and non-inheritance statistics for ONLY the parent. Here's an example of that: drop table if exists parent,child; create table parent(a int); create table child () inherits (parent); insert into parent values(1); insert into child values(1); analyze ONLY parent; select starelid::regclass,stainherit,stadistinct from pg_statistic where starelid in ('parent'::regclass,'child'::regclass); starelid | stainherit | stadistinct ----------+------------+------------- parent | f | -1 <- this is the distinct estimate for SELECT * FROM ONLY parent; parent | t | -0.5 <- this is the distinct estimate for SELECT * FROM parent; (2 rows) For the stainherit==false stats, only 1 row is sampled here as that's the only row directly located in the "parent" table. For the stainherit==true stats, 2 rows are sampled, both of them have "a" == 1. The stadistinct reflects that fact. Note there have been no statistics recorded for "child". However, analyze did sample rows in that table as part of gathering sample rows for "parent" for the stainherit==true row. Now let's try again without ONLY. analyze parent; select starelid::regclass,stainherit,stadistinct from pg_statistic where starelid in ('parent'::regclass,'child'::regclass); starelid | stainherit | stadistinct ----------+------------+------------- parent | f | -1 parent | t | -0.5 child | f | -1 (3 rows) All of the above rows were re-calculated with the "analyze parent" command, the first two rows have the same values as nothing changed in the table, however, there are now statistics stored for the "child" table. > catalog-pg-statistic.html second paragraph seems very clear to me. > Maybe we can link it somehow I don't know what "link it" means in this context. David
On Mon, Sep 23, 2024 at 7:53 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Mon, 23 Sept 2024 at 23:23, jian he <jian.universality@gmail.com> wrote: > > looks fine. but maybe we can add the following information > > "if The <literal>ONLY</literal> is specified, the second set of > > statistics won't include each children individual statistics" > > I think that's the main difference between specifying ONLY or not? > > Ok, I think you're not understanding this yet and I'm not sure what I > can make more clear in the documents. > > Let me explain... For inheritance parent tables, ANALYZE ONLY will > gather inheritance and non-inheritance statistics for ONLY the parent. > > Here's an example of that: > > drop table if exists parent,child; > create table parent(a int); > create table child () inherits (parent); > insert into parent values(1); > insert into child values(1); > > analyze ONLY parent; > select starelid::regclass,stainherit,stadistinct from pg_statistic > where starelid in ('parent'::regclass,'child'::regclass); > starelid | stainherit | stadistinct > ----------+------------+------------- > parent | f | -1 <- this is the distinct estimate > for SELECT * FROM ONLY parent; > parent | t | -0.5 <- this is the distinct estimate > for SELECT * FROM parent; > (2 rows) > > For the stainherit==false stats, only 1 row is sampled here as that's > the only row directly located in the "parent" table. > For the stainherit==true stats, 2 rows are sampled, both of them have > "a" == 1. The stadistinct reflects that fact. > > Note there have been no statistics recorded for "child". However, > analyze did sample rows in that table as part of gathering sample rows > for "parent" for the stainherit==true row. > > Now let's try again without ONLY. > > analyze parent; > > select starelid::regclass,stainherit,stadistinct from pg_statistic > where starelid in ('parent'::regclass,'child'::regclass); > starelid | stainherit | stadistinct > ----------+------------+------------- > parent | f | -1 > parent | t | -0.5 > child | f | -1 > (3 rows) > > All of the above rows were re-calculated with the "analyze parent" > command, the first two rows have the same values as nothing changed in > the table, however, there are now statistics stored for the "child" > table. thanks for your explanation! now I don't have any questions about this patch. > > catalog-pg-statistic.html second paragraph seems very clear to me. > > Maybe we can link it somehow > > I don't know what "link it" means in this context. > i mean, change to: By default, <command>ANALYZE</command> will also recursively collect and update the statistics for each inheritance child table. The <literal>ONLY</literal> keyword may be used to disable this. You may also refer to catalog <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link> description about <literal>stainherit</literal>. but <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link> already mentioned once. maybe not a good idea.
On Mon, 23 Sept 2024 at 22:04, David Rowley <dgrowleyml@gmail.com> wrote: > Can you have a look at the attached and let me know if it's easier to > understand now? Pushed. David