Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 11223173-bd90-4fa8-8cf1-49928b615405@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers

On 1/8/19 3:18 PM, Dean Rasheed wrote:
> On Mon, 7 Jan 2019 at 00:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> FWIW the main unsolved issue (at least on the MCV part) is how it
>> decides which items to keep in the list.
>>
>> As explained in [1], in the multivariate case we can't simply look at
>> the group frequency and compare it to the average frequency (of the
>> non-MCV items), which is what analyze_mcv_list() does in the
>> single-column case. In the multivariate case we also case about observed
>> vs. base frequency, i.e. we want the MCV list to include groups that are
>> present singificantly more/less than product of per-column stats.
>>
>> I've repeatedly tried to come up with a criteria that would address
>> that, but it seems rather difficult because we can't abandon the other
>> criteria either. So the MCV list should include groups that match both
>>
>> (a) items that are statistically more common than the non-MCV part (i.e.
>> the rule from per-column analyze_mcv_list)
>>
>> (b) items that are statistically more/less common than estimated from
>> per-column stats (i.e. the new rule)
> 
> Thinking about this some more, I think that it probably isn't
> appropriate to use analyze_mcv_list() directly because that's making
> specific assumptions about how items not in the list will be estimated
> that aren't actually true for groups of values in multivariate stats.
> If a group of values isn't in the MCV list, it gets estimated based on
> the product of the selectivities from the per-column stats (modulo the
> additional logic preventing the selectivity not exceeding the total
> non-MCV selectivity).
> 
> So actually, the estimate for a group of values will be either the MCV
> item's frequency (if the MCV item is kept), or (roughly) the MCV
> item's base_frequency (if the MCV item is not kept). That suggests
> that we should simply keep items that are significantly more or less
> common than the item's base frequency -- i.e., keep rule (b) and ditch
> rule (a).
> 

Hmmm, but won't that interfere with how we with how we extrapolate the
MCV estimate to the non-MCV part? Currently the patch does what you
proposed, i.e.

    other_sel = simple_sel - mcv_basesel;

I'm worried that if we only include the items that are significantly
more or less common than the base frequency, it may skew the other_sel
estimate.

>> Enforcing rule (a) seems reasonable because it ensures the MCV list
>> includes all items more frequent than the last one. Without it, it's
>> difficult to decide know whether the absent item is very common (but
>> close to base frequency) or very uncommon (so less frequent than the
>> last MCV item).
> 
> I'm not sure there's much we can do about that. Keeping the item will
> result in keeping a frequency that we know is close to the base
> frequency, and not keeping the item will result in per-column stats
> being used that we expect to also give an estimate close to the base
> frequency. So either way, the result is about the same, and it's
> probably better to discard it, leaving more room for other items about
> which we may have more information.
> 
> That said, there is a separate benefit to keeping items in the list
> even if their frequency is close to the base frequency -- the more
> items kept, the larger their total selectivity will be, giving a
> better cap on the non-MCV selectivities. So if, after keeping all
> items satisfying rule (b), there are free slots available, perhaps
> they should be used for the most common remaining values satisfying
> rule (a).
> 

Hmm, so essentially we'd use (b) first to bootstrap the MCV list, and
then we could do what analyze_mcv_list() does. That could work, I guess.

The question is how to define "significantly different from base freq"
though. Any ideas?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] pgbench - allow to store select results intovariables
Next
From: Surafel Temesgen
Date:
Subject: Re: FETCH FIRST clause PERCENT option