Thread: ANALYZE on partitioned tables vs on individual partitions
Hello Experts,
Our application has a database with a large number of partitioned tables used to store time series data.
It is partitioned by time: new data is populated into the current partition, old partitions eventually get dropped, and the partitions in between are largely static.
I had not realized until recently that the partitioned tables themselves are not analyzed by the autovacuum processes. I am now wondering if we should be manually analyzing those.
The problem is they are quite enormous and take a long time to analyze - I tested one yesterday and it took ~30mins, and we have hundreds of them. It might not be practical to regularly analyze them.
My questions are:
1. What is the consequence of not having good statistics on partitioned table level, if you do have good statistics on the partition level?
2. Which planning operations rely on partitioned table level statistics?
My reason for asking is I'm trying to understand the tradeoff between the IO cost associated with frequently analyzing as opposed to possibly non-optimal plans being chosen.
Thanks in advance, and sorry if this has all been covered before. I could not find anything definitive in the docs, and while I did find a few references by googling, none really explained the answers to those questions too well.
Regards
Mike
> On Aug 6, 2024, at 21:13, Michael Harris <harmic@gmail.com> wrote: > > 1. What is the consequence of not having good statistics on partitioned table level, if you do have good statistics onthe partition level? The child partitions are just tables, and all of the same issues that can arise from bad statistics on a table can ariseon them: specifically, bad query plans. (There are also some consequences to leaving a table unvacuumed, even an append-onlytable.) The root table in declarative partitioning has no rows, so there really is no such thing as "statisticsat the partition level." > 2. Which planning operations rely on partitioned table level statistics? Just about every query uses the information that ANALYZE gathers in some way. If the query can't be executed in any otherconceivable way than a sequential scan, or if it is a simple lookup on a unique column, the statistics will not contributeto the plan, but essentially every other plan will use them in one way or another. Child partitions should be autovacuumed and autoanalyzed just like any other table; they are not prohibited from autovacuumin any way by default. It's probably a good idea to investigate why they are not being picked up by autovacuum. If they are created by a bulk load process, it's not a bad idea to do a VACUUM ANALYZE on them once the bulkload is complete.
On Wed, 7 Aug 2024 at 16:44, Christophe Pettus <xof@thebuild.com> wrote: > Child partitions should be autovacuumed and autoanalyzed just like any other table; they are not prohibited from autovacuumin any way by default. It's probably a good idea to investigate why they are not being picked up by autovacuum. If they are created by a bulk load process, it's not a bad idea to do a VACUUM ANALYZE on them once the bulkload is complete. I think the complaint was about no autovacuum on the partitioned table, not the partitions. This is expected as we don't track the counters (in particular n_mod_since_analyze) shown in pg_stat_all_tables at the partitioned table level, so the trigger points that normally cause autovacuum to analyze or vacuum a table just won't be triggered for a partitioned table. For VACUUM, that's fine as, as you mentioned, no rows are stored. But for analyze, that does present a problem. To name the aspects of planning that rely on statistics of the partitioned table, basically anything above the Append or MergeAppend which joins the partitioned results together. So that doesn't include the scans of each partition and any quals that are pushed down to the scan level as those are able to use the partition level statistics. However, it does include things like joins, group by, distinct as those require n_distinct estimates for the partitioned table. It's not all bad though as the row estimates for each individual partition will be totalled up through the Append / MergeAppend simply by adding up the row estimates for each Append / MergeAppend child plan. So, it's really only an estimation problem for any node that comes after a join node or a group by node as the output rows for those nodes will depend on a good n_distinct estimate for the partitioned table. Partition-wise joins and aggregates do change things a bit as those features do permit moving those operations below the Append / Merge Append, in which case the statistics for the individual partition can be used. You could consider manually setting the n_distinct_inherited estimates for the columns that you join on or group by in the partitioned table. You might find that you're able to choose a suitable value for that if you review the documentation for that setting. In particular, please review what is mentioned about using negative numbers for that setting. You may be able to choose a value that scales correctly with the row estimate that doesn't get outdated as you add more rows to the partitions. You'll need to determine that based on the data you're storing. David
Many thanks David for the comprehensive response. > I think the complaint was about no autovacuum on the partitioned > table, not the partitions. Yes, exactly. One other piece of information: these tables contain a lot of columns, of which only 4 are normally used for WHERE clauses or joins. The table I was experimenting with has 150 columns, 156026832 rows and occupies 166GB. I found that running an ANALYZE specifying only those 4 columns only took 5 minutes, compared to the 30 minutes for the whole table. That was a bit of a surprise as I imagined actually reading the table would take most of the time and would be the same regardless of the number of columns being analyzed, but I guess that is wrong. Regards, Mike On Wed, 7 Aug 2024 at 15:23, David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 7 Aug 2024 at 16:44, Christophe Pettus <xof@thebuild.com> wrote: > > Child partitions should be autovacuumed and autoanalyzed just like any other table; they are not prohibited from autovacuumin any way by default. It's probably a good idea to investigate why they are not being picked up by autovacuum. If they are created by a bulk load process, it's not a bad idea to do a VACUUM ANALYZE on them once the bulkload is complete. > > I think the complaint was about no autovacuum on the partitioned > table, not the partitions. This is expected as we don't track the > counters (in particular n_mod_since_analyze) shown in > pg_stat_all_tables at the partitioned table level, so the trigger > points that normally cause autovacuum to analyze or vacuum a table > just won't be triggered for a partitioned table. For VACUUM, that's > fine as, as you mentioned, no rows are stored. But for analyze, that > does present a problem. > > To name the aspects of planning that rely on statistics of the > partitioned table, basically anything above the Append or MergeAppend > which joins the partitioned results together. So that doesn't include > the scans of each partition and any quals that are pushed down to the > scan level as those are able to use the partition level statistics. > However, it does include things like joins, group by, distinct as > those require n_distinct estimates for the partitioned table. It's not > all bad though as the row estimates for each individual partition will > be totalled up through the Append / MergeAppend simply by adding up > the row estimates for each Append / MergeAppend child plan. So, it's > really only an estimation problem for any node that comes after a join > node or a group by node as the output rows for those nodes will depend > on a good n_distinct estimate for the partitioned table. > > Partition-wise joins and aggregates do change things a bit as those > features do permit moving those operations below the Append / Merge > Append, in which case the statistics for the individual partition can > be used. > > You could consider manually setting the n_distinct_inherited estimates > for the columns that you join on or group by in the partitioned table. > You might find that you're able to choose a suitable value for that if > you review the documentation for that setting. In particular, please > review what is mentioned about using negative numbers for that > setting. You may be able to choose a value that scales correctly with > the row estimate that doesn't get outdated as you add more rows to the > partitions. You'll need to determine that based on the data you're > storing. > > David
On Wed, 7 Aug 2024 at 19:20, Michael Harris <harmic@gmail.com> wrote: > I found that running an ANALYZE specifying only those 4 columns only took > 5 minutes, compared to the 30 minutes for the whole table. > > That was a bit of a surprise as I imagined actually reading the table would take > most of the time and would be the same regardless of the number of columns > being analyzed, but I guess that is wrong. ANALYZE does do sampling of the data in the table. It would only read all of the rows for fairly small tables. The docs in [1] mention this: "For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time." I think the reason it's taking so long is not because of it performing ANALYZE on the partitioned table which results in gathering statistics for the partitioned table which means proportionately (based on the size of the partition) sampling rows from each partition, it's more likely due to the fact that each partition is also analysed and the statistics for each of those is updated. There is no "ANALYZE ONLY" command similar to "FROM ONLY" in SELECT queries. You could probably do some sampling of the pg_stat_progress_analyze view to figure out what's taking the most time. If you find that the majority of the time is spent analysing the partitions and not the partitioned table then maybe we should expand ANALYZE to add the ONLY option... David [1] https://www.postgresql.org/docs/current/sql-analyze.html [2] https://www.postgresql.org/docs/current/progress-reporting.html
> You could probably do some sampling of the pg_stat_progress_analyze > view to figure out what's taking the most time. I did another run, sampling the pg_stat_progress_analyze every 30s. For the first 4 minutes it was working on the partitioned table. After that it began analyzing all the partitions, which took approx 29 minutes. I think you are correct - an ONLY option for ANALYZE would be a huge benefit. In my use case, the autovacuum processes are keeping the partitions analyzed so there would seem to be little benefit to including them in the manual table level ANALYZE. Regards Mike On Wed, 7 Aug 2024 at 18:09, David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 7 Aug 2024 at 19:20, Michael Harris <harmic@gmail.com> wrote: > > I found that running an ANALYZE specifying only those 4 columns only took > > 5 minutes, compared to the 30 minutes for the whole table. > > > > That was a bit of a surprise as I imagined actually reading the table would take > > most of the time and would be the same regardless of the number of columns > > being analyzed, but I guess that is wrong. > > ANALYZE does do sampling of the data in the table. It would only read > all of the rows for fairly small tables. The docs in [1] mention > this: > > "For large tables, ANALYZE takes a random sample of the table > contents, rather than examining every row. This allows even very large > tables to be analyzed in a small amount of time." > > I think the reason it's taking so long is not because of it performing > ANALYZE on the partitioned table which results in gathering statistics > for the partitioned table which means proportionately (based on the > size of the partition) sampling rows from each partition, it's more > likely due to the fact that each partition is also analysed and the > statistics for each of those is updated. There is no "ANALYZE ONLY" > command similar to "FROM ONLY" in SELECT queries. > > You could probably do some sampling of the pg_stat_progress_analyze > view to figure out what's taking the most time. If you find that the > majority of the time is spent analysing the partitions and not the > partitioned table then maybe we should expand ANALYZE to add the ONLY > option... > > David > > [1] https://www.postgresql.org/docs/current/sql-analyze.html > [2] https://www.postgresql.org/docs/current/progress-reporting.html