Re: Merging statistics from children instead of re-sampling everything - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Merging statistics from children instead of re-sampling everything |
Date | |
Msg-id | 949f204b-44e2-4e41-67ff-f93643828f4a@enterprisedb.com Whole thread Raw |
In response to | Re: Merging statistics from children instead of re-sampling everything (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Merging statistics from children instead of re-sampling everything
|
List | pgsql-hackers |
On 6/30/21 17:15, Tomas Vondra wrote: > On 6/30/21 2:55 PM, Andrey Lepikhov wrote: >> Sorry, I forgot to send CC into pgsql-hackers. >> On 29/6/21 13:23, Tomas Vondra wrote: >>> Because sampling is fairly expensive, especially if you have to do it >>> for large number of child relations. And you'd have to do that every >>> time *any* child triggers autovacuum, pretty much. Merging the stats >>> is way cheaper. >>> >>> See the other thread linked from the first message. >> Maybe i couldn't describe my idea clearly. >> The most commonly partitioning is used for large tables. >> I suppose to store a sampling reservoir for each partition, replace on >> update of statistics and merge to build statistics for parent table. >> It can be spilled into tuplestore on a disk, or stored in a parent table. >> In the case of complex inheritance we can store sampling reservoirs >> only for leafs. >> You can consider this idea as an imagination, but the merging >> statistics approach has an extensibility problem on another types of >> statistics. >> > > Well, yeah - we might try that too, of course. This is simply exploring > the "merge statistics" idea from [1], which is why it does not even > attempt to do what you suggested. We may explore the approach with > keeping per-partition samples, of course. > > You're right maintaining a per-partition samples and merging those might > solve (or at least reduce) some of the problems, e.g. eliminating most > of the I/O that'd be needed for sampling. And yeah, it's not entirely > clear how to merge some of the statistics types (like ndistinct). But > for a lot of the basic stats it works quite nicely, I think. > > I'm sure there'll be some complexity due to handling large / toasted > values, etc. And we probably need to design this for large hierarchies > (IMHO it should work with 10k partitions, not just 100), in which case > it may still be quite a bit more expensive than merging the stats. > > So maybe we should really support both, and combine them somehow? > I've been thinking about this PoC patch regularly since I submitted it a year ago, and I still think merging the statistics is an interesting idea. It may be a huge win in various scenarios, like: 1) Multi-level partitioning hierarchies, where analyze of each level has to re-sample all the leaf relations, causing a lot of I/O. 2) Partitioning with foreign/remote partitions, where analyze has to retrieve significant amounts of data from the remote node over network (so a different kind of I/O). These issues will only get worse as the number of partitions used by systems in the wild grows - we continuously reduce the per-partition overhead, so people are likely to leverage that by using more of them. But I don't have a very good idea what to do about statistics that we can't really merge. For some types of statistics it's rather tricky to reasonably merge the results - ndistinct is a simple example, although we could work around that by building and merging hyperloglog counters. What's trickier are extended statistics - I can't quite imagine merging of functional dependencies, mvdistinct, etc. So if there are extended stats we'd have to do the sampling anyway. (Some of the extended statistics can also be defined only on the parent relation, in which case we have nothing to merge. But that seems like a rare case.) In any case, I don't have a very good plan how to move this patch forward, so unless people have some interesting ideas I'll mark it as returned with feedback. It's been lurking in the CF for ~1 year ... However, It'd be a mistake to also discard the approach proposed by Andrey Lepikhov - storing samples for individual relations, and then just using those while analyzing the parent relations. That is more expensive, but it does not have the issues with merging some types of statistics, and so on. It seems interesting also in for the dynamic sampling PoC patch [1], which does sampling as part of the query planning. In that context the cost of collecting the sample is clearly a major issue, and storing the sample somewhere would help a lot. But the question is how/where to store it. Joins are another issue, because we can't just build two random samples. But let's discuss that in the other thread [1]. [1] https://commitfest.postgresql.org/36/3211/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: