Re: Autovacuum on partitioned table (autoanalyze) - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: Autovacuum on partitioned table (autoanalyze) |
Date | |
Msg-id | 20200318163039.GL26184@telsasoft.com Whole thread Raw |
In response to | Re: Autovacuum on partitioned table (yuzuko <yuzukohosoya@gmail.com>) |
Responses |
Re: Autovacuum on partitioned table (autoanalyze)
Re: Autovacuum on partitioned table (autoanalyze) |
List | pgsql-hackers |
Regarding this patch: + * the ANALYZE message as it resets the partition's changes_since_analze => analyze + * If the relation is a partitioned table, we must add up children's childrens' The approach in general: I see an issue for timeseries data, where only the most recent partition is being inserted into, and the histogram endpoint is being continuously extended (this is our use-case). The most recent partition will be analyzed pretty often, and I think it'll be problematic if its parent doesn't get similar treatment. Let's say there are 12 historic, monthly children with 1e6 tuples each, and the 13th child has 2e5 tuples (6 days into the month). It's analyzed when it grows by 20% (1.2 days), but at that point the parent has only grown by 12x less (~2%) and won't be analyzed until 12x further into the future (14 days). Its histogram is 12x longer (geometrically), but the histogram changed by just as much (arithmetically). That's an issue for a query over "the last few days"; if that's past the end of the histogram bound, the query planner will estimate about ~0 tuples, and tend to give cascades of nested loops. I'm biased, but I'm guessing that's too common a use case to answer that the proper fix is to set the parent's analyze_scale_factor=0.0005. I think that suggests that the parent might sometimes need to be analyzed every time any of its children are. In other cases (like probably any hash partitioning), that'd be excessive, and maybe the default settings shouldn't do that, but I think that behavior ought to be possible, and I think this patch doesn't allow that. In the past, I think there's was talk that maybe someone would invent a clever way to dynamically combine all the partitions' statistics, so analyzing the parent wasn't needed. I think that's easy enough for reltuples, MCV, and I think histogram, but ISTM that ndistinct is simultaneously important to get right and hard to do so. It depends on whether it's the partition key, which now can be an arbitrary expression. Extended stats further complicates it, even if we didn't aim to dynamically compute extended stats for a parent. While writing this, it occured to me that we could use "CREATE STATISTICS" as a way to mark a partitioned table (or certain columns) as needing to be handled by analyze. I understand "CREATE STATs" was intended to (eventually) allow implementing stats on expressions without using "create index" as a hack. So if it's excessive to automatically analyze a parent table when any of its children are analyzed, maybe it's less excessive to only do that for parents with a stats object, and only on the given colums. I realize this patch is alot less useful if it requires to do anything extra/nondefault, and it's desirable to work without creating a stats object at all. Also, using CREATE STATs would reduces the CPU cost of re-analyzing the entire heirarchy, but doesn't help to reduce the I/O cost, which is significant. -- Justin
pgsql-hackers by date: