Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id a0aa7043-26cd-5473-a9d4-ef3bd035f132@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 1/10/19 4:20 PM, Dean Rasheed wrote:
> ...
>
> So perhaps what we should do for multivariate stats is simply use the
> relative standard error approach (i.e., reuse the patch in [2] with a
> 20% RSE cutoff). That had a lot of testing at the time, against a wide
> range of data distributions, and proved to be very good, not to
> mention being very simple.
> 
> That approach would encompass both groups more and less common than
> the base frequency, because it relies entirely on the group appearing
> enough times in the sample to infer that any errors on the resulting
> estimates will be reasonably well controlled. It wouldn't actually
> look at the base frequency at all in deciding which items to keep.
> 

I've been looking at this approach today, and I'm a bit puzzled. That
patch essentially uses SRE to compute mincount like this:

    mincount = n*(N-n) / (N-n+0.04*n*(N-1))

and then includes all items more common than this threshold. How could
that handle items significantly less common than the base frequency?

Or did you mean to use the SRE, but in some different way?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: port of INSTALL file generation to XSLT
Next
From: Merlin Moncure
Date:
Subject: Re: Early WIP/PoC for inlining CTEs