Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id e7d989e1-4a06-d2ab-8950-eb711a14df9c@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 1/10/19 4:20 PM, Dean Rasheed wrote:
> On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 1/8/19 3:18 PM, Dean Rasheed wrote:
>>> So actually, the estimate for a group of values will be either the MCV
>>> item's frequency (if the MCV item is kept), or (roughly) the MCV
>>> item's base_frequency (if the MCV item is not kept). That suggests
>>> that we should simply keep items that are significantly more or less
>>> common than the item's base frequency -- i.e., keep rule (b) and ditch
>>> rule (a).
>>>
>>
>> Hmmm, but won't that interfere with how we with how we extrapolate the
>> MCV estimate to the non-MCV part? Currently the patch does what you
>> proposed, i.e.
>>
>>     other_sel = simple_sel - mcv_basesel;
>>
>> I'm worried that if we only include the items that are significantly
>> more or less common than the base frequency, it may skew the other_sel
>> estimate.
>>
> 
> I don't see how that would skew other_sel. Items close to the base
> frequency would also tend to be close to simple_sel, making other_sel
> approximately zero, so excluding them should have little effect.

Oh, I see. You're right those items should contribute very little to
other_sel, I should have realized that.

> However...
> 
> Re-reading the thread where we enhanced the per-column MCV stats last
> year [1], it was actually the case that an algorithm based on just
> looking at the relative standard error worked pretty well for a very
> wide range of data distributions.
> 
> The final algorithm chosen in analyze_mcv_list() was only a marginal
> improvement on that, and was directly based upon the fact that, in the
> univariate statistics case, all the values not included in the MCV
> list are assigned the same selectivity. However, that's not the case
> for multivariate stats, because each group not included in the
> multivariate MCV list gets assigned a different selectivity based on
> its per-column stats.
> 
> So perhaps what we should do for multivariate stats is simply use the
> relative standard error approach (i.e., reuse the patch in [2] with a
> 20% RSE cutoff). That had a lot of testing at the time, against a wide
> range of data distributions, and proved to be very good, not to
> mention being very simple.
> 
> That approach would encompass both groups more and less common than
> the base frequency, because it relies entirely on the group appearing
> enough times in the sample to infer that any errors on the resulting
> estimates will be reasonably well controlled. It wouldn't actually
> look at the base frequency at all in deciding which items to keep.
> 
> Moreover, if the group appears sufficiently often in the sample to
> justify being kept, each of the individual column values must also
> appear at least that often as well, which means that the errors on the
> base frequency estimate are also well controlled. That was one of my
> concerns about other algorithms such as "keep items significantly more
> or less common than the base frequency" -- in the less common case,
> there's no lower bound on the number of occurrences seen, and so no
> guarantee that the errors are kept under control.
> 

Yep, that looks like a great approach. Simple and tested. I'll try
tweaking the patch accordingly over the weekend.

Thanks!

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Next
From: Tom Lane
Date:
Subject: Re: Policy on cross-posting to multiple lists