Home > mailing lists

Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From	Dean Rasheed
Subject	Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date	January 10, 2019 15:20:51
Msg-id	CAEZATCVEU=ZZ31yqQY=nUg+p7dBLxqJg6iDN-9MoE81uYs-8dw@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: [HACKERS] PATCH: multivariate histograms and MCV lists Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List	pgsql-hackers

Tree view

On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 1/8/19 3:18 PM, Dean Rasheed wrote:
> > So actually, the estimate for a group of values will be either the MCV
> > item's frequency (if the MCV item is kept), or (roughly) the MCV
> > item's base_frequency (if the MCV item is not kept). That suggests
> > that we should simply keep items that are significantly more or less
> > common than the item's base frequency -- i.e., keep rule (b) and ditch
> > rule (a).
> >
>
> Hmmm, but won't that interfere with how we with how we extrapolate the
> MCV estimate to the non-MCV part? Currently the patch does what you
> proposed, i.e.
>
>     other_sel = simple_sel - mcv_basesel;
>
> I'm worried that if we only include the items that are significantly
> more or less common than the base frequency, it may skew the other_sel
> estimate.
>

I don't see how that would skew other_sel. Items close to the base
frequency would also tend to be close to simple_sel, making other_sel
approximately zero, so excluding them should have little effect.
However...

Re-reading the thread where we enhanced the per-column MCV stats last
year [1], it was actually the case that an algorithm based on just
looking at the relative standard error worked pretty well for a very
wide range of data distributions.

The final algorithm chosen in analyze_mcv_list() was only a marginal
improvement on that, and was directly based upon the fact that, in the
univariate statistics case, all the values not included in the MCV
list are assigned the same selectivity. However, that's not the case
for multivariate stats, because each group not included in the
multivariate MCV list gets assigned a different selectivity based on
its per-column stats.

So perhaps what we should do for multivariate stats is simply use the
relative standard error approach (i.e., reuse the patch in [2] with a
20% RSE cutoff). That had a lot of testing at the time, against a wide
range of data distributions, and proved to be very good, not to
mention being very simple.

That approach would encompass both groups more and less common than
the base frequency, because it relies entirely on the group appearing
enough times in the sample to infer that any errors on the resulting
estimates will be reasonably well controlled. It wouldn't actually
look at the base frequency at all in deciding which items to keep.

Moreover, if the group appears sufficiently often in the sample to
justify being kept, each of the individual column values must also
appear at least that often as well, which means that the errors on the
base frequency estimate are also well controlled. That was one of my
concerns about other algorithms such as "keep items significantly more
or less common than the base frequency" -- in the less common case,
there's no lower bound on the number of occurrences seen, and so no
guarantee that the errors are kept under control.

Regards,
Dean

[1]
https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com

pgsql-hackers by date:

From: Pavel Stehule
Date: 10 January 2019, 15:15:18
Subject: Re: PostgreSQL vs SQL/XML Standards

From: Tomas Vondra
Date: 10 January 2019, 15:45:23
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

Previous

Next