Re: multivariate statistics (v19) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics (v19) |
Date | |
Msg-id | 0924a7a9-a219-1366-d3bc-2ddd98bc9269@2ndquadrant.com Whole thread Raw |
In response to | Re: multivariate statistics v14 (Tatsuo Ishii <ishii@postgresql.org>) |
Responses |
Re: multivariate statistics (v19)
Re: multivariate statistics (v19) Re: multivariate statistics (v19) |
List | pgsql-hackers |
Hi, Attached is v19 of the "multivariate stats" patch series - essentially v18 rebased on top of current master. Aside from a few bug fixes, the main improvement is addition of SGML docs demonstrating the statistics in a way similar to the current "Row Estimation Examples" (and the docs are actually in the same section). I've tried to keep the right amount of technical detail (and pointing to the right README for additional details), but this may need improvements. I have not written docs explaining how statistics may be combined yet (more about this later). There are two general design questions that I'd like to get feedback on: 1) enriching the query tree with multivariate statistics info Right now all the stuff related to multivariate statistics estimation happens in clausesel.c - matching condition to statistics, selection of statistics to use (if there are multiple usable stats), etc. So pretty much all this info is internal to clausesel.c and does not get outside. I'm starting to think that some of the steps (matching quals to stats, selection of stats) should happen in a "preprocess" step before the actual estimation, storing the information (which stats to use, etc.) in a new type of node in the query tree - something like RestrictInfo. I believe this needs to happen sometime after deconstruct_jointree() as that builds RestrictInfos nodes, and looking at planmain.c, right after extract_restriction_or_clauses seems about right. Haven't tried, though. This would move all the "statistics selection" logic from clausesel.c, separating it from the "actual estimation" and simplifying the code. But more importantly, I think we'll need to show some of the data in EXPLAIN output. With per-column statistics it's fairly straightforward to determine which statistics are used and how. But with multivariate stats things are often more complicated - there may be multiple candidate statistics (e.g. histograms covering different subsets of the conditions), it's possible to apply them in different orders, etc. But EXPLAIN can't show the info if it's ephemeral and available only within clausesel.c (and thrown away after the estimation). 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). Assume there's a table "t" with 3 columns (a, b, c), and that we're estimating query: SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 but that we only have two statistics (a,b) and (b,c). The current patch does about this: P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) i.e. it estimates the first two conditions using (a,b), and then estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, but it only works as long as the query contains conditions "connecting" the two statistics. So if we remove the "b=2" condition from the query, this stops working. But it's possible to do this differently, e.g. by doing this: P(a=1) * P(c=3|a=1) where P(c=3|a=1) is using (b,c), but uses (a,b) to restrict the set of buckets (if the statistics is a histogram) to consider. In pseudo-code, it might look like this: buckets = {} foreach bucket x in (b,c): foreach bucket y in (a,b): if y matches (a=1) and overlap(x,y): buckets := buckets + x which is the part of (b,c) matching (a=1), allowing us to compute the conditional probability. It may get more complicated, of course. In particular, there may be different types of statistics, and we need to be able to "match" them against each other. With just MCV lists and histograms that's probably easy enough, but if we add other types of statistics, it may get way more complicated. I still think this is a useful capability, but perhaps there are better ideas how to do that. In any case, it only affects the last part of the patch (0006). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: