Re: Choosing values for multivariate MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Choosing values for multivariate MCV lists |
Date | |
Msg-id | 20190620223548.p22wj7uejbrhkglf@development Whole thread Raw |
In response to | Re: Choosing values for multivariate MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Choosing values for multivariate MCV lists
|
List | pgsql-hackers |
On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote: >On Tue, 18 Jun 2019 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> The current implementation of multi-column MCV lists (added in this >> cycle) uses a fairly simple algorithm to pick combinations to include in >> the MCV list. We just compute a minimum number of occurences, and then >> include all entries sampled more often. See get_mincount_for_mcv_list(). >> >> [snip] >> >> This however means that by choosing the MCV entries solely based on the >> number of occurrences in the sample, we end up with MCV lists where vast >> majority of entries has NULL street name. >> >> That's why we got such poor estimate in the example query, despite the >> fact that the city/street combination is the most frequent in the data >> set (with non-NULL street name). >> > >I think the fact that they're NULL is a bit of a red herring because >we're treating NULL just like any other value. The same thing would >happen if there were some other very common non-NULL value that >dominated the dataset. > I wasn't really suggesting the NULL is an issue - sorry if that wasn't clear. It might be any other value, as long as it's very common (and roughly uniform) in all cities. So yes, I agree with you here. >> The other weird thing is that frequency of NULL street names is fairly >> uniform in the whole data set. In total about 50% addresses match that, >> and for individual cities it's generally between 25% and 100%, so the >> estimate is less than 2x off in those cases. >> >> But for addresses with non-NULL street names, the situation is very >> different. Some street names are unique to a single city, etc. >> >> Overall, this means we end up with MCV list with entries representing >> the mostly-uniform part of the data set, instead of prefering the >> entries that are truly skewed. >> >> So I'm wondering if we could/should rethink the algorithm, so look at >> the frequency and base_frequency, and maybe pick entries based on their >> ratio, or something like that. >> > >Hmm, interesting. I think I would like to see a more rigorous >justification for changing the algorithm deciding which values to >keep. > Sure, I'm not going to pretend my proposals were particularly rigorous, it was more a collection of random ideas. >If I've understood correctly, I think the problem is this: The >mincount calculation is a good way of identifying MCV candidates to >keep, because it ensures that we don't keep values that don't appear >sufficiently often to produce accurate estimates, and ideally we'd >keep everything with count >= mincount. However, in the case were >there are more than stats_target items with count >= mincount, simply >ordering by count and keeping the most commonly seen values isn't >necessarily the best strategy in the case of multivariate statistics. > Yes. >To identify what the best strategy might be, I think you need to >examine the errors that would occur as a result of *not* keeping a >value in the multivariate MCV list. Given a value that appears with >count >= mincount, N*freq ought to be a reasonable estimate for the >actual number of occurrences of that value in the table, and >N*base_freq ought to be a reasonable estimate for the >univariate-stats-based estimate that it would be given if it weren't >kept in the multivariate MCV list. So the absolute error resulting >from not keeping that value would be > > N * Abs(freq - base_freq) > Yeah. Considering N is the same for all groups in the sample, this would mean the same thing as Abs(freq - base_freq). >But then I think we ought to take into account how often we're likely >to get that error. If we're simply picking values at random, the >likelihood of getting that value is just its frequency, so the average >average absolute error would be > > Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) ) > >which suggests that, if we wanted to reduce the average absolute error >of the estimates, we should order by freq*Abs(freq-base_freq) and keep >the top n in the MCV list. > Interesting idea. But I'm not sure it makes much sense to assume the rows are picked randomly - OTOH we don't really know anything about how the data will be queries, so we may just as well do that. >On the other hand, if we wanted to reduce the average *relative* error >of the estimates, we might instead order by Abs(freq-base_freq). > Hmmm, yeah. I don't know what's the right choice here, TBH. >> For example, we might sort the entries by >> >> abs(freq - base_freq) / freq >> > >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq >though, because that would seem likely to put too much weight on the >least commonly occurring values. > But would that be an issue, or a good thing? I mean, as long as the item is above mincount, we take the counts as reliable. As I explained, my motivation for proposing that was that both ... (cost=... rows=1 ...) (actual=... rows=1000001 ...) and ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...) have exactly the same Abs(freq - base_freq), but I think we both agree that the first misestimate is much more dangerous, because it's off by six orders of magnitude. I think the MCV algorithm should reflect this. >> Of course, this is a challenging problem, for a couple of reasons. >> >> Firstly, picking simply the most frequent groups is very simple and it >> gives us additional information about the largest group (which may be >> useful elsewhere, e.g. the incremental sort patch). >> > >Yes, you would have to keep in mind that changing the algorithm would >mean that the MCV list no longer represented all the most common >values. For example, it would no longer be valid to assume that no >value appeared more often than the first value in the MCV list. I'm >not sure that we currently do that though. > We don't, but I've proposed using some of this knowledge in the incremental sort patch. But I think we might actually extend the MCV list to store some extra information (e.g. frequency of the largest groups, even if we don't store it). >> Secondly, if the correlated combinations are less frequent, how reliably >> can we even estimate the frequency from a sample? The combination in the >> example query was ~0.02% of the data, so how likely it's to be sampled? >> > >I think that's OK as long as we keep the mincount filter in the new >algorithm. I think some experimentation is definitely worthwhile here, >but it looks plausible that a decent approach might be: > >1). Discard all values with count < mincount. >2). Order by freq*Abs(freq-base_freq) (or possibly just >Abs(freq-base_freq)) and keep the top n, where n is the stats target. >3). Perhaps re-sort by count, so the final list is in frequency order >again? Not sure if that's a desirable property to keep. > Will try. Thanks for the feedback. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: