Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCVEU=ZZ31yqQY=nUg+p7dBLxqJg6iDN-9MoE81uYs-8dw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 1/8/19 3:18 PM, Dean Rasheed wrote:
> > So actually, the estimate for a group of values will be either the MCV
> > item's frequency (if the MCV item is kept), or (roughly) the MCV
> > item's base_frequency (if the MCV item is not kept). That suggests
> > that we should simply keep items that are significantly more or less
> > common than the item's base frequency -- i.e., keep rule (b) and ditch
> > rule (a).
> >
>
> Hmmm, but won't that interfere with how we with how we extrapolate the
> MCV estimate to the non-MCV part? Currently the patch does what you
> proposed, i.e.
>
>     other_sel = simple_sel - mcv_basesel;
>
> I'm worried that if we only include the items that are significantly
> more or less common than the base frequency, it may skew the other_sel
> estimate.
>

I don't see how that would skew other_sel. Items close to the base
frequency would also tend to be close to simple_sel, making other_sel
approximately zero, so excluding them should have little effect.
However...

Re-reading the thread where we enhanced the per-column MCV stats last
year [1], it was actually the case that an algorithm based on just
looking at the relative standard error worked pretty well for a very
wide range of data distributions.

The final algorithm chosen in analyze_mcv_list() was only a marginal
improvement on that, and was directly based upon the fact that, in the
univariate statistics case, all the values not included in the MCV
list are assigned the same selectivity. However, that's not the case
for multivariate stats, because each group not included in the
multivariate MCV list gets assigned a different selectivity based on
its per-column stats.

So perhaps what we should do for multivariate stats is simply use the
relative standard error approach (i.e., reuse the patch in [2] with a
20% RSE cutoff). That had a lot of testing at the time, against a wide
range of data distributions, and proved to be very good, not to
mention being very simple.

That approach would encompass both groups more and less common than
the base frequency, because it relies entirely on the group appearing
enough times in the sample to infer that any errors on the resulting
estimates will be reasonably well controlled. It wouldn't actually
look at the base frequency at all in deciding which items to keep.

Moreover, if the group appears sufficiently often in the sample to
justify being kept, each of the individual column values must also
appear at least that often as well, which means that the errors on the
base frequency estimate are also well controlled. That was one of my
concerns about other algorithms such as "keep items significantly more
or less common than the base frequency" -- in the less common case,
there's no lower bound on the number of occurrences seen, and so no
guarantee that the errors are kept under control.

Regards,
Dean

[1]
https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com


pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: PostgreSQL vs SQL/XML Standards
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists