Thread: Merging statistics from children instead of re-sampling everything
Hi, While reviewing the thread about issues with auto-analyze on partitioned tables [1] I remembered that the way we build statistics on the parent is somewhat expensive, because it samples rows from the children again. It's true we sample much smaller amounts of rows from each partition (proportional to it's size), so it's not as expensive as simply running ANALYZE on all partitions individually. But it's not free either, and in some cases (e.g. with partitioning by time) it may touch old data that is not touched by regular queries, so likely not cached etc. That's a bit against the idea to use partitioning to "segregate" old data. One reason why the ANALYZE costs are not a huge issue in practice is that we're not actually triggering that - changes_since_analyze is not updated for the parent, so autoanalyze does not realize it needs to do anything, and we never rebuild the statistics :-( But as shown in [2], that may lead to poor estimates (and bad plans) in cases when we rely on the parent's stats (probably due to not expanding the list of children early enough). (Note: I wonder if we might simply not rely on the parent stats at all, and always consult directly the children stats - but with many children that's likely problematic/expensive, and it does have the same issues as the merging.) The other thread [1] attempts to fix that by incrementing the counters for the parent too, but that'll make this much worse. firstly, with multi-level partitioning we'll have to sample the children repeatedly, essentially once for each level. Secondly, it's tricky to control when exactly are the counters propagated to the parent, making the parent analyzes more frequent. Not great. Even if we do a great job in [1] and come up with smart heuristics, there will always be cases where it does not work too well and we either analyze too late or too often. Note: Propagating changes_since_analyze is only part of the story, as it does not do anything about stats after attach/detach of a partition. This attempts to approach the problem from the other end - instead of tightly controlling when to analyze the parent, it makes the analyze much cheaper. That means we don't need to worry too much about doing the analyze too often, and we can update the stats more aggressively. So, how does it work? Well, it simply fetches the statistics for all children, and merges them together. For most statistics that's fairly simple for most statistics types. 1) stawidth, stanullfrac - Those are trivial to merge. 2) stacorrelation - Not sure, I've used weighted average for now. Perhaps we could store the "internal" counters (sumX, sumX2) which would allow us to calculate regular estimate for the parent. 3) stadistinct - This is quite problematic. We only have the per-child estimates, and it's not clear if there's any overlap. For now I've just summed it up, because that's safer / similar to what we do for gather merge paths etc. Maybe we could improve this by estimating the overlap somehow (e.g. from MCV lists / histograms). But honestly, I doubt the estimates based on tiny sample of each child are any better. I suppose we could introduce a column option, determining how to combine ndistinct (similar to how we can override n_distinct itself). 4) MCV - It's trivial to build a new "parent" MCV list, although it may be too large (in which case we cut it at statistics target, and copy the remaining bits to the histogram) 5) histograms - A bit more complicated, because it requires dealing with overlapping bins, so we may have to "cut" and combine them in some way. If we assume that cutting a bin in K parts means each part has 1/K tuples (no matter where exactly we cut it), then it's trivial and it works just fine in practice. That's because with N children, each bin actually represents 1.0/(target*N) of the tuples, so the errors are quite limited. The attached patch is a PoC - it should work, but there's plenty of room for improvement. It only deals with "regular" per-column statistics, not with extended stats (but I don't see why it wouldn't work for them too). It adds a new analyze option "MERGE" which does not sample the children but instead just combines the statistics. So the example from [2] looks like this: ====================================================================== create table p (i integer, j integer) partition by list (i); create table p0 partition of p for values in (0); create table p1 partition of p for values in (1); insert into p select 0,generate_series(1,1000); insert into p select 1,generate_series(1,1000); analyze p0; analyze p1; create table q (i integer); insert into q values (0); analyze q; test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..49.82 rows=5 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) test=# analyze (merge) p; ANALYZE test=# explain select * from q join p using (i) where j between 1 and 500; QUERY PLAN --------------------------------------------------------------------- Hash Join (cost=1.02..54.77 rows=500 width=8) Hash Cond: (p.i = q.i) -> Append (cost=0.00..45.00 rows=1000 width=8) -> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8) Filter: ((j >= 1) AND (j <= 500)) -> Hash (cost=1.01..1.01 rows=1 width=4) -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (9 rows) ====================================================================== FWIW I'm not sure we need the separate MERGE mode, but it's an easy way to allow both the regular and "merge" approach, so that it's possible to experiment and compare the behavior. One issue is that this would require some coordination between the parent/child analyzes. Essentially, any time a child is analyzed, the parent should rebuild the stats (to merge from the new child stats). This is similar to the issue of analyzing the parent too often because we don't know when exactly the counters get updated, but it's much cheaper to merge the stats, so it's much less problematic. regards [1] https://commitfest.postgresql.org/32/2492/ [2] https://www.postgresql.org/message-id/CAM-w4HO9hUHvJDVwQ8%3DFgm-znF9WNvQiWsfyBjCr-5FD7gWKGA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thanks for taking a fresh look at this. As you've written it, this can apply to either/both partitioned or inheritence. I imagine when "MERGE" goes away, this should apply only to partitioned tables. (Actually, personally I would advocate to consider applying it to *both*, but I don't think that's been the tendency over the last 4 years. I wrote here about some arguably-gratuitous differences between inheritence and partitioning. https://www.postgresql.org/message-id/20180601221428.GU5164@telsasoft.com) > + if (*mcv_items > default_statistics_target) > + n = default_statistics_target; It should use any non-default stats target of the parent's column > + * ignore anything but valid leaf relatiins with data, but release sp: relatiins. > + elog(WARNING, "stats for %d %d not found", > + RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum); should be %u %d This code duplication is undesirable: > + /* Log the action if appropriate */ > + * Determine which columns to analyze
On 3/29/21 8:36 PM, Justin Pryzby wrote: > Thanks for taking a fresh look at this. > > As you've written it, this can apply to either/both partitioned or inheritence. > I imagine when "MERGE" goes away, this should apply only to partitioned tables. > (Actually, personally I would advocate to consider applying it to *both*, but I > don't think that's been the tendency over the last 4 years. I wrote here about > some arguably-gratuitous differences between inheritence and partitioning. > https://www.postgresql.org/message-id/20180601221428.GU5164@telsasoft.com) > Haven't thought about that too much at this point, but I don't see any reason to not apply it only to both cases. I might be missing something, but the fact that with declarative partitioning we analyze the children, while with inheritance we don't, seems a bit strange. Not sure. >> + if (*mcv_items > default_statistics_target) >> + n = default_statistics_target; > > It should use any non-default stats target of the parent's column > Yeah. That's a simplification, the non-PoC code would have to look at per-column statistics target for the target / all the children, and make some calculation based on that. I ignored that in the PoC. >> + * ignore anything but valid leaf relatiins with data, but release > > sp: relatiins. > >> + elog(WARNING, "stats for %d %d not found", >> + RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum); > > should be %u %d > > This code duplication is undesirable: >> + /* Log the action if appropriate */ >> + * Determine which columns to analyze Yeah. Fine for PoC, but needs to be cleaned up in future patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/29/21 9:24 PM, Tomas Vondra wrote: > > > On 3/29/21 8:36 PM, Justin Pryzby wrote: >> Thanks for taking a fresh look at this. >> >> As you've written it, this can apply to either/both partitioned or inheritence. >> I imagine when "MERGE" goes away, this should apply only to partitioned tables. >> (Actually, personally I would advocate to consider applying it to *both*, but I >> don't think that's been the tendency over the last 4 years. I wrote here about >> some arguably-gratuitous differences between inheritence and partitioning. >> https://www.postgresql.org/message-id/20180601221428.GU5164@telsasoft.com) >> > > Haven't thought about that too much at this point, but I don't see any > reason to not apply it only to both cases. I might be missing something, > but the fact that with declarative partitioning we analyze the children, > while with inheritance we don't, seems a bit strange. Not sure. > BTW I'm not sure the "merge" will / should go away. What I'd expect is that we'll keep it, and you can do either "ANALYZE" or "ANALYZE (MERGE)". The former does regular sampling, while the latter does the proposed merging of stats. For the autovacuum, I think a new autovacuum_analyze_merge GUC and reloption would work - chances are people will want to set the default and then maybe enable merging only for some cases. Not sure. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I'd like to point out two less obvious things, about how this relates to Tom's proposal [1] and patch [2] from 2015. Tom approached the problem from a different direction, essentially allowing Var to be associated with a list of statistics instead of just one. So it's a somewhat orthogonal solution, and it has pros and cons. The pro is that it can ignore statistics for eliminated partitions, thus producing better estimates. The con is that it requires all the places dealing with VariableStatData to assume there's a list, not just one, making the code more complex and more CPU expensive (with sufficiently many partitions). However, it seems to me we could easily combine those two things - we can merge the statistics (the way I proposed here), so that each Var has still just one VariableStatData. That'd mean the various places don't need to start dealing with a list, and it'd still allow ignoring stats for eliminated partitions. Of course, that assumes the merge is cheaper than processing the list of statistics, but I find that plausible, especially the list needs to be processed multiple (e.g. when considering different join orders, filters and so on). Haven't tried, but might be worth exploring in the future. regards [1] https://www.postgresql.org/message-id/7363.1426537103@sss.pgh.pa.us [2] https://www.postgresql.org/message-id/22598.1425686096@sss.pgh.pa.us -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Sorry, I forgot to send CC into pgsql-hackers. On 29/6/21 13:23, Tomas Vondra wrote: > Because sampling is fairly expensive, especially if you have to do it > for large number of child relations. And you'd have to do that every > time *any* child triggers autovacuum, pretty much. Merging the stats is > way cheaper. > > See the other thread linked from the first message. Maybe i couldn't describe my idea clearly. The most commonly partitioning is used for large tables. I suppose to store a sampling reservoir for each partition, replace on update of statistics and merge to build statistics for parent table. It can be spilled into tuplestore on a disk, or stored in a parent table. In the case of complex inheritance we can store sampling reservoirs only for leafs. You can consider this idea as an imagination, but the merging statistics approach has an extensibility problem on another types of statistics. > > > On 6/29/21 9:01 AM, Andrey Lepikhov wrote: >> On 30/3/21 03:51, Tomas Vondra wrote: >>> Of course, that assumes the merge is cheaper than processing the list of >>> statistics, but I find that plausible, especially the list needs to be >>> processed multiple (e.g. when considering different join orders, filters >>> and so on). >> I think your approach have a chance. But I didn't understand: why do >> you merge statistics? I think we could merge only samples of each >> children and build statistics as usual. >> Error of a sample merging procedure would be quite limited. >> > -- regards, Andrey Lepikhov Postgres Professional
On 6/30/21 2:55 PM, Andrey Lepikhov wrote: > Sorry, I forgot to send CC into pgsql-hackers. > On 29/6/21 13:23, Tomas Vondra wrote: >> Because sampling is fairly expensive, especially if you have to do it >> for large number of child relations. And you'd have to do that every >> time *any* child triggers autovacuum, pretty much. Merging the stats >> is way cheaper. >> >> See the other thread linked from the first message. > Maybe i couldn't describe my idea clearly. > The most commonly partitioning is used for large tables. > I suppose to store a sampling reservoir for each partition, replace on > update of statistics and merge to build statistics for parent table. > It can be spilled into tuplestore on a disk, or stored in a parent table. > In the case of complex inheritance we can store sampling reservoirs only > for leafs. > You can consider this idea as an imagination, but the merging statistics > approach has an extensibility problem on another types of statistics. > Well, yeah - we might try that too, of course. This is simply exploring the "merge statistics" idea from [1], which is why it does not even attempt to do what you suggested. We may explore the approach with keeping per-partition samples, of course. You're right maintaining a per-partition samples and merging those might solve (or at least reduce) some of the problems, e.g. eliminating most of the I/O that'd be needed for sampling. And yeah, it's not entirely clear how to merge some of the statistics types (like ndistinct). But for a lot of the basic stats it works quite nicely, I think. I'm sure there'll be some complexity due to handling large / toasted values, etc. And we probably need to design this for large hierarchies (IMHO it should work with 10k partitions, not just 100), in which case it may still be quite a bit more expensive than merging the stats. So maybe we should really support both, and combine them somehow? regards https://www.postgresql.org/message-id/CAM-w4HO9hUHvJDVwQ8%3DFgm-znF9WNvQiWsfyBjCr-5FD7gWKGA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/30/21 17:15, Tomas Vondra wrote: > On 6/30/21 2:55 PM, Andrey Lepikhov wrote: >> Sorry, I forgot to send CC into pgsql-hackers. >> On 29/6/21 13:23, Tomas Vondra wrote: >>> Because sampling is fairly expensive, especially if you have to do it >>> for large number of child relations. And you'd have to do that every >>> time *any* child triggers autovacuum, pretty much. Merging the stats >>> is way cheaper. >>> >>> See the other thread linked from the first message. >> Maybe i couldn't describe my idea clearly. >> The most commonly partitioning is used for large tables. >> I suppose to store a sampling reservoir for each partition, replace on >> update of statistics and merge to build statistics for parent table. >> It can be spilled into tuplestore on a disk, or stored in a parent table. >> In the case of complex inheritance we can store sampling reservoirs >> only for leafs. >> You can consider this idea as an imagination, but the merging >> statistics approach has an extensibility problem on another types of >> statistics. >> > > Well, yeah - we might try that too, of course. This is simply exploring > the "merge statistics" idea from [1], which is why it does not even > attempt to do what you suggested. We may explore the approach with > keeping per-partition samples, of course. > > You're right maintaining a per-partition samples and merging those might > solve (or at least reduce) some of the problems, e.g. eliminating most > of the I/O that'd be needed for sampling. And yeah, it's not entirely > clear how to merge some of the statistics types (like ndistinct). But > for a lot of the basic stats it works quite nicely, I think. > > I'm sure there'll be some complexity due to handling large / toasted > values, etc. And we probably need to design this for large hierarchies > (IMHO it should work with 10k partitions, not just 100), in which case > it may still be quite a bit more expensive than merging the stats. > > So maybe we should really support both, and combine them somehow? > I've been thinking about this PoC patch regularly since I submitted it a year ago, and I still think merging the statistics is an interesting idea. It may be a huge win in various scenarios, like: 1) Multi-level partitioning hierarchies, where analyze of each level has to re-sample all the leaf relations, causing a lot of I/O. 2) Partitioning with foreign/remote partitions, where analyze has to retrieve significant amounts of data from the remote node over network (so a different kind of I/O). These issues will only get worse as the number of partitions used by systems in the wild grows - we continuously reduce the per-partition overhead, so people are likely to leverage that by using more of them. But I don't have a very good idea what to do about statistics that we can't really merge. For some types of statistics it's rather tricky to reasonably merge the results - ndistinct is a simple example, although we could work around that by building and merging hyperloglog counters. What's trickier are extended statistics - I can't quite imagine merging of functional dependencies, mvdistinct, etc. So if there are extended stats we'd have to do the sampling anyway. (Some of the extended statistics can also be defined only on the parent relation, in which case we have nothing to merge. But that seems like a rare case.) In any case, I don't have a very good plan how to move this patch forward, so unless people have some interesting ideas I'll mark it as returned with feedback. It's been lurking in the CF for ~1 year ... However, It'd be a mistake to also discard the approach proposed by Andrey Lepikhov - storing samples for individual relations, and then just using those while analyzing the parent relations. That is more expensive, but it does not have the issues with merging some types of statistics, and so on. It seems interesting also in for the dynamic sampling PoC patch [1], which does sampling as part of the query planning. In that context the cost of collecting the sample is clearly a major issue, and storing the sample somewhere would help a lot. But the question is how/where to store it. Joins are another issue, because we can't just build two random samples. But let's discuss that in the other thread [1]. [1] https://commitfest.postgresql.org/36/3211/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21/1/2022 01:25, Tomas Vondra wrote: > But I don't have a very good idea what to do about statistics that we > can't really merge. For some types of statistics it's rather tricky to > reasonably merge the results - ndistinct is a simple example, although > we could work around that by building and merging hyperloglog counters. I think, as a first step on this way we can reduce a number of pulled tuples. We don't really needed to pull all tuples from a remote server. To construct a reservoir, we can pull only a tuple sample. Reservoir method needs only a few arguments to return a sample like you read tuples locally. Also, to get such parts of samples asynchronously, we can get size of each partition on a preliminary step of analysis. In my opinion, even this solution can reduce heaviness of a problem drastically. -- regards, Andrey Lepikhov Postgres Professional
On 2/10/22 12:50, Andrey Lepikhov wrote: > On 21/1/2022 01:25, Tomas Vondra wrote: >> But I don't have a very good idea what to do about statistics that we >> can't really merge. For some types of statistics it's rather tricky to >> reasonably merge the results - ndistinct is a simple example, although >> we could work around that by building and merging hyperloglog counters. > > I think, as a first step on this way we can reduce a number of pulled > tuples. We don't really needed to pull all tuples from a remote server. > To construct a reservoir, we can pull only a tuple sample. Reservoir > method needs only a few arguments to return a sample like you read > tuples locally. Also, to get such parts of samples asynchronously, we > can get size of each partition on a preliminary step of analysis. > In my opinion, even this solution can reduce heaviness of a problem > drastically. > Oh, wow! I haven't realized we're fetching all the rows from foreign (postgres_fdw) partitions. For local partitions we already do that, because that uses the usual acquire function, with a reservoir proportional to partition size. I have assumed we use tablesample to fetch just a small fraction of rows from FDW partitions, and I agree doing that would be a pretty huge benefit. I actually tried hacking that together - there's a couple problems with that (e.g. determining what fraction to sample using bernoulli/system), but in principle it seems quite doable. Some minor changes to the FDW API may be necessary, not sure. Not sure about the async execution - that seems way more complicated, and the sampling reduces the total cost, async just parallelizes it. That being said, this thread was not really about foreign partitions, but about re-analyzing inheritance trees in general. And sampling foreign partitions doesn't really solve that - we'll still do the sampling over and over. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Merging statistics from children instead of re-sampling everything
From
"Andrey V. Lepikhov"
Date:
On 2/11/22 03:37, Tomas Vondra wrote: > That being said, this thread was not really about foreign partitions, > but about re-analyzing inheritance trees in general. And sampling > foreign partitions doesn't really solve that - we'll still do the > sampling over and over. IMO, to solve the problem we should do two things: 1. Avoid repeatable partition scans in the case inheritance tree. 2. Avoid to re-analyze everything in the case of active changes in small subset of partitions. For (1) i can imagine a solution like multiplexing: on the stage of defining which relations to scan, group them and prepare parameters of scanning to make multiple samples in one shot. It looks like we need a separate logic for analysis of partitioned tables - we should form and cache samples on each partition before an analysis. It requires a prototype to understand complexity of such solution and can be done separately from (2). Task (2) is more difficult to solve. Here we can store samples from each partition in values[] field of pg_statistic or in specific table which stores a 'most probable values' snapshot of each table. Most difficult problem here, as you mentioned, is ndistinct value. Is it possible to store not exactly calculated value of ndistinct, but an 'expected value', based on analysis of samples and histograms on partitions? Such value can solve also a problem of estimation of a SETOP result grouping (joining of them, etc), where we have statistics only on sources of the union. -- regards, Andrey Lepikhov Postgres Professional
On 2/11/22 05:29, Andrey V. Lepikhov wrote: > On 2/11/22 03:37, Tomas Vondra wrote: >> That being said, this thread was not really about foreign partitions, >> but about re-analyzing inheritance trees in general. And sampling >> foreign partitions doesn't really solve that - we'll still do the >> sampling over and over. > IMO, to solve the problem we should do two things: > 1. Avoid repeatable partition scans in the case inheritance tree. > 2. Avoid to re-analyze everything in the case of active changes in small > subset of partitions. > > For (1) i can imagine a solution like multiplexing: on the stage of > defining which relations to scan, group them and prepare parameters of > scanning to make multiple samples in one shot. >> It looks like we need a separate logic for analysis of partitioned > tables - we should form and cache samples on each partition before an > analysis. > It requires a prototype to understand complexity of such solution and > can be done separately from (2). > I'm not sure I understand what you mean by multiplexing. The term usually means "sending multiple signals at once" but I'm not sure how that applies to this issue. Can you elaborate? I assume you mean something like collecting a sample for a partition once, and then keeping and reusing the sample for future ANALYZE runs, until invalidated in some sense. Yeah, I agree that'd be useful - and not just for partitions, actually. I've been pondering something like that for regular tables, because the sample might be used for estimation of clauses directly. But it requires storing the sample somewhere, and I haven't found a good and simple way to do that. We could serialize that into bytea, or we could create a new fork, or something, but what should that do with oversized attributes (how would TOAST work for a fork) and/or large samples (which might not fit into 1GB bytea)? > Task (2) is more difficult to solve. Here we can store samples from each > partition in values[] field of pg_statistic or in specific table which > stores a 'most probable values' snapshot of each table. I think storing samples in pg_statistic is problematic, because values[] is subject to 1GB limit etc. Not an issue for small MCV list of a single attribute, certainly an issue for larger samples. Even if the data fit, the size of pg_statistic would explode. > Most difficult problem here, as you mentioned, is ndistinct value. Is it > possible to store not exactly calculated value of ndistinct, but an > 'expected value', based on analysis of samples and histograms on > partitions? Such value can solve also a problem of estimation of a SETOP > result grouping (joining of them, etc), where we have statistics only on > sources of the union. > I think ndistinct is problem only when merging final estimates. But if we're able to calculate and store some intermediate results, we can easily use HLL and merge that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 30, 2021 at 11:15 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > You're right maintaining a per-partition samples and merging those might > solve (or at least reduce) some of the problems, e.g. eliminating most > of the I/O that'd be needed for sampling. And yeah, it's not entirely > clear how to merge some of the statistics types (like ndistinct). But > for a lot of the basic stats it works quite nicely, I think. It feels like you might in some cases get very different answers. Let's say you have 1000 partitions. In each of those partitions, there is a particular value that appears in column X in 50% of the rows. This value differs for every partition. So you can imagine for example that in partition 1, X = 1 with probability 50%; in partition 2, X = 2 with probability 50%, etc. There is also a value, let's say 0, which appears in 0.5% of the rows in every partition. It seems possible that 0 is not an MCV in any partition, or in only some of them, but it might be more common overall than the #1 MCV of any single partition. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Merging statistics from children instead of re-sampling everything
From
"Andrey V. Lepikhov"
Date:
On 2/11/22 20:12, Tomas Vondra wrote: > > > On 2/11/22 05:29, Andrey V. Lepikhov wrote: >> On 2/11/22 03:37, Tomas Vondra wrote: >>> That being said, this thread was not really about foreign partitions, >>> but about re-analyzing inheritance trees in general. And sampling >>> foreign partitions doesn't really solve that - we'll still do the >>> sampling over and over. >> IMO, to solve the problem we should do two things: >> 1. Avoid repeatable partition scans in the case inheritance tree. >> 2. Avoid to re-analyze everything in the case of active changes in >> small subset of partitions. >> >> For (1) i can imagine a solution like multiplexing: on the stage of >> defining which relations to scan, group them and prepare parameters of >> scanning to make multiple samples in one shot. > I'm not sure I understand what you mean by multiplexing. The term > usually means "sending multiple signals at once" but I'm not sure how > that applies to this issue. Can you elaborate? I suppose to make a set of samples in one scan: one sample for plane table, another - for a parent and so on, according to the inheritance tree. And cache these samples in memory. We can calculate all parameters of reservoir method to do it. > sample might be used for estimation of clauses directly. You mean, to use them in difficult cases, such of estimation of grouping over APPEND ? > > But it requires storing the sample somewhere, and I haven't found a good > and simple way to do that. We could serialize that into bytea, or we > could create a new fork, or something, but what should that do with > oversized attributes (how would TOAST work for a fork) and/or large > samples (which might not fit into 1GB bytea)? This feature looks like meta-info over a database. It can be stored in separate relation. It is not obvious that we need to use it for each relation, for example, with large samples. I think, it can be controlled by a table parameter. -- regards, Andrey Lepikhov Postgres Professional
On 2/14/22 11:22, Andrey V. Lepikhov wrote: > On 2/11/22 20:12, Tomas Vondra wrote: >> >> >> On 2/11/22 05:29, Andrey V. Lepikhov wrote: >>> On 2/11/22 03:37, Tomas Vondra wrote: >>>> That being said, this thread was not really about foreign partitions, >>>> but about re-analyzing inheritance trees in general. And sampling >>>> foreign partitions doesn't really solve that - we'll still do the >>>> sampling over and over. >>> IMO, to solve the problem we should do two things: >>> 1. Avoid repeatable partition scans in the case inheritance tree. >>> 2. Avoid to re-analyze everything in the case of active changes in >>> small subset of partitions. >>> >>> For (1) i can imagine a solution like multiplexing: on the stage of >>> defining which relations to scan, group them and prepare parameters >>> of scanning to make multiple samples in one shot. >> I'm not sure I understand what you mean by multiplexing. The term >> usually means "sending multiple signals at once" but I'm not sure how >> that applies to this issue. Can you elaborate? > > I suppose to make a set of samples in one scan: one sample for plane > table, another - for a parent and so on, according to the inheritance > tree. And cache these samples in memory. We can calculate all parameters > of reservoir method to do it. > I doubt keeping the samples just in memory is a good solution. Firstly, there's the question of memory consumption. Imagine a large partitioned table with 1-10k partitions. If we keep a "regular" sample (30k rows) per partition, that's 30M-300M rows. If each row needs 100B, that's 3-30GB of data. Sure, maybe we could keep smaller per-partition samples, large enough to get the merged sample of 30k row. But then you can also have higher statistics target values, the rows can be larger, etc. So a couple of GB per inheritance tree can easily happen. And this data may not be used all that often, so keeping it in memory may be wasteful. But maybe you have an idea how to optimize sizes per-partition samples? In principle we need 30k * size(partition) / size(total) for each partition, but the trouble is partitions may be detached, data may be deleted from some partitions, etc. Also, what would happen after a restart? If we lose the samples, we'll have to resample everything anyway - and after a restart the system is usually fairly busy, so that's not a great timing. So IMHO the samples need to be serialized, in some way. >> sample might be used for estimation of clauses directly. > You mean, to use them in difficult cases, such of estimation of grouping > over APPEND ? That's one example, yes. But the sample might be used even to estimate complex conditions on a single partition (there's plenty of stuff we can't estimate from MCV/histogram). >> But it requires storing the sample somewhere, and I haven't found a >> good and simple way to do that. We could serialize that into bytea, or >> we could create a new fork, or something, but what should that do with >> oversized attributes (how would TOAST work for a fork) and/or large >> samples (which might not fit into 1GB bytea)? > This feature looks like meta-info over a database. It can be stored in > separate relation. It is not obvious that we need to use it for each > relation, for example, with large samples. I think, it can be controlled > by a table parameter. > Well, a separate catalog is one of the options. But I don't see how that deals with large samples, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Merging statistics from children instead of re-sampling everything
From
"Andrey V. Lepikhov"
Date:
On 2/14/22 20:16, Tomas Vondra wrote: > > > On 2/14/22 11:22, Andrey V. Lepikhov wrote: >> On 2/11/22 20:12, Tomas Vondra wrote: >>> >>> >>> On 2/11/22 05:29, Andrey V. Lepikhov wrote: >>>> On 2/11/22 03:37, Tomas Vondra wrote: >>>>> That being said, this thread was not really about foreign partitions, >>>>> but about re-analyzing inheritance trees in general. And sampling >>>>> foreign partitions doesn't really solve that - we'll still do the >>>>> sampling over and over. >>>> IMO, to solve the problem we should do two things: >>>> 1. Avoid repeatable partition scans in the case inheritance tree. >>>> 2. Avoid to re-analyze everything in the case of active changes in >>>> small subset of partitions. >>>> >>>> For (1) i can imagine a solution like multiplexing: on the stage of >>>> defining which relations to scan, group them and prepare parameters >>>> of scanning to make multiple samples in one shot. >>> I'm not sure I understand what you mean by multiplexing. The term >>> usually means "sending multiple signals at once" but I'm not sure how >>> that applies to this issue. Can you elaborate? >> >> I suppose to make a set of samples in one scan: one sample for plane >> table, another - for a parent and so on, according to the inheritance >> tree. And cache these samples in memory. We can calculate all >> parameters of reservoir method to do it. >> > > I doubt keeping the samples just in memory is a good solution. Firstly, > there's the question of memory consumption. Imagine a large partitioned > table with 1-10k partitions. If we keep a "regular" sample (30k rows) > per partition, that's 30M-300M rows. If each row needs 100B, that's > 3-30GB of data. I tell about caching a sample only for a time that it needed in this ANALYZE operation. Imagine 3 levels of partitioned table. On each partition you should create and keep three different samples (we can do it in one scan). Sample for a plane table we can use immediately and destroy it. Sample for the partition on second level of hierarchy: we can save a copy of sample for future usage (maybe, repeated analyze) to a disk. In-memory data used to form a reservoir, that has a limited size and can be destroyed immediately. At the third level we can use the same logic. So, at one moment we only use as many samples as many levels of hierarchy we have. IMO, it isn't large number. > the trouble is partitions may be detached, data may be deleted from > some partitions, etc. Because statistics hasn't strong relation with data, we can use two strategies: In the case of explicit 'ANALYZE <table>' we can recalculate all samples for all partitions, but in autovacuum case or implicit analysis we can use not-so-old versions of samples and samples of detached (but not destroyed) partitions in optimistic assumption that it doesn't change statistic drastically. > So IMHO the samples need to be serialized, in some way. Agreed > Well, a separate catalog is one of the options. But I don't see how that > deals with large samples, etc. I think, we can design fall back to previous approach in the case of very large tuples, like a switch from HashJoin to NestedLoop if we estimate, that we haven't enough memory. -- regards, Andrey Lepikhov Postgres Professional
3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).
4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)
I think there is one approach to solve the problem with calculating mcv and distinct statistics.
To do this, you need to calculate the density of the sample distribution and store it, for example, in some slot.
Then, when merging statistics, we will sum up the densities of all partitions as functions and get a new density.
According to the new density, you can find out which values are most common and which are distinct.
To do this, you need to calculate the density of the sample distribution and store it, for example, in some slot.
Then, when merging statistics, we will sum up the densities of all partitions as functions and get a new density.
According to the new density, you can find out which values are most common and which are distinct.
To calculate the partition densities, you can use the "Kernel density Estimation" - https://www.statsmodels.org/dev/examples/notebooks/generated/kernel_density html
The approach may be very inaccurate and difficult to implement, but solves the problem.
Regards,
Damir Belyalov
Postgres Professional