Re: Choosing values for multivariate MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Choosing values for multivariate MCV lists
Date
Msg-id 20190620223548.p22wj7uejbrhkglf@development
Whole thread Raw
In response to Re: Choosing values for multivariate MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Choosing values for multivariate MCV lists
List pgsql-hackers
On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> The current implementation of multi-column MCV lists (added in this
>> cycle) uses a fairly simple algorithm to pick combinations to include in
>> the MCV list. We just compute a minimum number of occurences, and then
>> include all entries sampled more often. See get_mincount_for_mcv_list().
>>
>> [snip]
>>
>> This however means that by choosing the MCV entries solely based on the
>> number of occurrences in the sample, we end up with MCV lists where vast
>> majority of entries has NULL street name.
>>
>> That's why we got such poor estimate in the example query, despite the
>> fact that the city/street combination is the most frequent in the data
>> set (with non-NULL street name).
>>
>
>I think the fact that they're NULL is a bit of a red herring because
>we're treating NULL just like any other value. The same thing would
>happen if there were some other very common non-NULL value that
>dominated the dataset.
>

I wasn't really suggesting the NULL is an issue - sorry if that wasn't
clear. It might be any other value, as long as it's very common (and
roughly uniform) in all cities. So yes, I agree with you here.

>> The other weird thing is that frequency of NULL street names is fairly
>> uniform in the whole data set. In total about 50% addresses match that,
>> and for individual cities it's generally between 25% and 100%, so the
>> estimate is less than 2x off in those cases.
>>
>> But for addresses with non-NULL street names, the situation is very
>> different. Some street names are unique to a single city, etc.
>>
>> Overall, this means we end up with MCV list with entries representing
>> the mostly-uniform part of the data set, instead of prefering the
>> entries that are truly skewed.
>>
>> So I'm wondering if we could/should rethink the algorithm, so look at
>> the frequency and base_frequency, and maybe pick entries based on their
>> ratio, or something like that.
>>
>
>Hmm, interesting. I think I would like to see a more rigorous
>justification for changing the algorithm deciding which values to
>keep.
>

Sure, I'm not going to pretend my proposals were particularly rigorous, it
was more a collection of random ideas.

>If I've understood correctly, I think the problem is this: The
>mincount calculation is a good way of identifying MCV candidates to
>keep, because it ensures that we don't keep values that don't appear
>sufficiently often to produce accurate estimates, and ideally we'd
>keep everything with count >= mincount. However, in the case were
>there are more than stats_target items with count >= mincount, simply
>ordering by count and keeping the most commonly seen values isn't
>necessarily the best strategy in the case of multivariate statistics.
>

Yes.

>To identify what the best strategy might be, I think you need to
>examine the errors that would occur as a result of *not* keeping a
>value in the multivariate MCV list. Given a value that appears with
>count >= mincount, N*freq ought to be a reasonable estimate for the
>actual number of occurrences of that value in the table, and
>N*base_freq ought to be a reasonable estimate for the
>univariate-stats-based estimate that it would be given if it weren't
>kept in the multivariate MCV list. So the absolute error resulting
>from not keeping that value would be
>
>  N * Abs(freq - base_freq)
>

Yeah. Considering N is the same for all groups in the sample, this
would mean the same thing as Abs(freq - base_freq).

>But then I think we ought to take into account how often we're likely
>to get that error. If we're simply picking values at random, the
>likelihood of getting that value is just its frequency, so the average
>average absolute error would be
>
>  Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )
>
>which suggests that, if we wanted to reduce the average absolute error
>of the estimates, we should order by freq*Abs(freq-base_freq) and keep
>the top n in the MCV list.
>

Interesting idea. But I'm not sure it makes much sense to assume the rows
are picked randomly - OTOH we don't really know anything about how the
data will be queries, so we may just as well do that.

>On the other hand, if we wanted to reduce the average *relative* error
>of the estimates, we might instead order by Abs(freq-base_freq).
>

Hmmm, yeah. I don't know what's the right choice here, TBH.

>> For example, we might sort the entries by
>>
>>     abs(freq - base_freq) / freq
>>
>
>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>though, because that would seem likely to put too much weight on the
>least commonly occurring values.
>

But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

   ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)

and

   ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude. I think the MCV algorithm should reflect this.


>> Of course, this is a challenging problem, for a couple of reasons.
>>
>> Firstly, picking simply the most frequent groups is very simple and it
>> gives us additional information about the largest group (which may be
>> useful elsewhere, e.g. the incremental sort patch).
>>
>
>Yes, you would have to keep in mind that changing the algorithm would
>mean that the MCV list no longer represented all the most common
>values. For example, it would no longer be valid to assume that no
>value appeared more often than the first value in the MCV list. I'm
>not sure that we currently do that though.
>

We don't, but I've proposed using some of this knowledge in the
incremental sort patch. But I think we might actually extend the MCV list
to store some extra information (e.g. frequency of the largest groups,
even if we don't store it).

>> Secondly, if the correlated combinations are less frequent, how reliably
>> can we even estimate the frequency from a sample? The combination in the
>> example query was ~0.02% of the data, so how likely it's to be sampled?
>>
>
>I think that's OK as long as we keep the mincount filter in the new
>algorithm. I think some experimentation is definitely worthwhile here,
>but it looks plausible that a decent approach might be:
>
>1). Discard all values with count < mincount.
>2). Order by freq*Abs(freq-base_freq) (or possibly just
>Abs(freq-base_freq)) and keep the top n, where n is the stats target.
>3). Perhaps re-sort by count, so the final list is in frequency order
>again? Not sure if that's a desirable property to keep.
>

Will try. Thanks for the feedback.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Multivariate MCV list vs. statistics target
Next
From: Alvaro Herrera
Date:
Subject: Re: clean up docs for v12