Home > mailing lists

Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date	July 17, 2018 16:03:28
Msg-id	c2614611-fd99-124f-c9aa-0fef3859c78f@2ndquadrant.com Whole thread Raw
In response to	Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses	Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List	pgsql-hackers

Tree view

On 07/17/2018 11:09 AM, Dean Rasheed wrote:
> On 16 July 2018 at 21:55, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
...
 >>
>> So, how would the proposed algorithm work? Let's start with "a=1":
>>
>>     sel(a=1) = 0.1508
>>
>> I don't see much point in applying the two "b" clauses independently (or
>> how would it be done, as it's effectively a single clause):
>>
>>     sel(b=1 or b=2) = 0.0673
>>
>> And we get
>>
>>     total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101
>>
>>  From the multivariate MCV we get
>>
>>     mcv_sel = 0.0399
>>
>> And finally
>>
>>     total_sel = Max(total_sel, mcv_sel) = 0.0399
>>
>> Which is great, but I don't see how that actually helped anything? We
>> still only have the estimate for the ~7% covered by the MCV list, and
>> not the remaining non-MCV part.
>>
> 
> Right. If these are the only stats available, and there are just 2
> top-level clauses like this, it either returns the MCV estimate, or
> the old univariate estimate (whichever is larger). It avoids
> over-estimating, but will almost certainly under-estimate when the MCV
> matches are incomplete.
> 

Yeah :-(

In my experience under-estimates tend to have much worse consequences 
(say a nested loop chosen by under-estimate vs. hash join chosen by 
over-estimate). This certainly influenced some of the choices I've made 
in this patch (extrapolation to non-MCV part is an example of that), but 
I agree it's not particularly scientific approach and I'd very much want 
something better.

> 
>> I could do the same thing for the second query, but the problem there is
>> actually exactly the same - extrapolation from MCV to non-MCV part
>> roughly doubles the estimate.
>>
>> So unless I'm applying your algorithm incorrectly, this does not seem
>> like a very promising direction :-(
>>
> 
> You could be right. Actually it's the order dependence with more than
> 2 top-level clauses that bothers me most about this algorithm. It's
> also not entirely obvious how to include histogram stats in this
> scheme.
> 

I think for inequalities that's fairly simple - histograms work pretty 
well for that, and I have a hunch that replacing the 0.5 estimate for 
partially-matching buckets with conver_to_scalar-like logic and the 
geometric mean (as you proposed) will work well enough.

For equalities it's going to be hard. The only thing I can think of at 
the moment is checking if there are any matching buckets at all, and 
using that to decide whether to extrapolate the MCV selectivity to the 
non-MCV part or not (or perhaps to what part of the non-MCV part).

> A different approach that I have been thinking about is, in
> mcv_update_match_bitmap(), attempt to work out the probability of all
> the clauses matching and it not being an MCV value. For example, given
> a clause like a=1 whose univariate estimate we know (0.1508 in the
> above example), knowing that the MCV values account for 0.0222+0.0177
> of that, the remainder must be from non-MCV values. So we could say
> that the probability that a=1 and it not being and MCV is
> 0.1508-0.0222-0.0177 = 0.1109. So then the question is could we
> combine these non-MCV probabilities to give an estimate of how many
> non-MCV values we need to worry about? I've not fully thought that
> through, but it might be useful.

Could we use it to derive some upper boundaries on the non-MCV part?

> The problem is, it's still likely to
> over-estimate when the MCV matches fully cover all possibilities, and
> I still don't see a way to reliably detect that case.
> 

I guess we can use a histogram to limit the over-estimates, as explained 
above. It may not be 100% reliable as it depends on the sample and how 
exactly the buckets are formed, but it might help.

But perhaps these over-estimates are a particularly serious issue? When 
you already get matches in a MCV, the number of matching rows is going 
to be pretty significant. If you over-estimate 2x, so what? IMHO that's 
still pretty accurate estimate.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Haribabu Kommi
Date: 17 July 2018, 16:01:36
Subject: Re: Pluggable Storage - Andres's take

From: Michael Paquier
Date: 17 July 2018, 16:04:55
Subject: Re: PG 10: could not generate random cancel key

Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

Previous

Next