Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | c2614611-fd99-124f-c9aa-0fef3859c78f@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 07/17/2018 11:09 AM, Dean Rasheed wrote: > On 16 July 2018 at 21:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> ... >> >> So, how would the proposed algorithm work? Let's start with "a=1": >> >> sel(a=1) = 0.1508 >> >> I don't see much point in applying the two "b" clauses independently (or >> how would it be done, as it's effectively a single clause): >> >> sel(b=1 or b=2) = 0.0673 >> >> And we get >> >> total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101 >> >> From the multivariate MCV we get >> >> mcv_sel = 0.0399 >> >> And finally >> >> total_sel = Max(total_sel, mcv_sel) = 0.0399 >> >> Which is great, but I don't see how that actually helped anything? We >> still only have the estimate for the ~7% covered by the MCV list, and >> not the remaining non-MCV part. >> > > Right. If these are the only stats available, and there are just 2 > top-level clauses like this, it either returns the MCV estimate, or > the old univariate estimate (whichever is larger). It avoids > over-estimating, but will almost certainly under-estimate when the MCV > matches are incomplete. > Yeah :-( In my experience under-estimates tend to have much worse consequences (say a nested loop chosen by under-estimate vs. hash join chosen by over-estimate). This certainly influenced some of the choices I've made in this patch (extrapolation to non-MCV part is an example of that), but I agree it's not particularly scientific approach and I'd very much want something better. > >> I could do the same thing for the second query, but the problem there is >> actually exactly the same - extrapolation from MCV to non-MCV part >> roughly doubles the estimate. >> >> So unless I'm applying your algorithm incorrectly, this does not seem >> like a very promising direction :-( >> > > You could be right. Actually it's the order dependence with more than > 2 top-level clauses that bothers me most about this algorithm. It's > also not entirely obvious how to include histogram stats in this > scheme. > I think for inequalities that's fairly simple - histograms work pretty well for that, and I have a hunch that replacing the 0.5 estimate for partially-matching buckets with conver_to_scalar-like logic and the geometric mean (as you proposed) will work well enough. For equalities it's going to be hard. The only thing I can think of at the moment is checking if there are any matching buckets at all, and using that to decide whether to extrapolate the MCV selectivity to the non-MCV part or not (or perhaps to what part of the non-MCV part). > A different approach that I have been thinking about is, in > mcv_update_match_bitmap(), attempt to work out the probability of all > the clauses matching and it not being an MCV value. For example, given > a clause like a=1 whose univariate estimate we know (0.1508 in the > above example), knowing that the MCV values account for 0.0222+0.0177 > of that, the remainder must be from non-MCV values. So we could say > that the probability that a=1 and it not being and MCV is > 0.1508-0.0222-0.0177 = 0.1109. So then the question is could we > combine these non-MCV probabilities to give an estimate of how many > non-MCV values we need to worry about? I've not fully thought that > through, but it might be useful. Could we use it to derive some upper boundaries on the non-MCV part? > The problem is, it's still likely to > over-estimate when the MCV matches fully cover all possibilities, and > I still don't see a way to reliably detect that case. > I guess we can use a histogram to limit the over-estimates, as explained above. It may not be 100% reliable as it depends on the sample and how exactly the buckets are formed, but it might help. But perhaps these over-estimates are a particularly serious issue? When you already get matches in a MCV, the number of matching rows is going to be pretty significant. If you over-estimate 2x, so what? IMHO that's still pretty accurate estimate. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: