Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCWWLvp0V+VmAA02dx1jEcJUO=GOMh3ybHEsvSYmxT62sg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Fri, 11 Jan 2019, 21:18 Tomas Vondra <tomas.vondra@2ndquadrant.com wrote:

On 1/10/19 4:20 PM, Dean Rasheed wrote:
> ...
>
> So perhaps what we should do for multivariate stats is simply use the
> relative standard error approach (i.e., reuse the patch in [2] with a
> 20% RSE cutoff). That had a lot of testing at the time, against a wide
> range of data distributions, and proved to be very good, not to
> mention being very simple.
>
> That approach would encompass both groups more and less common than
> the base frequency, because it relies entirely on the group appearing
> enough times in the sample to infer that any errors on the resulting
> estimates will be reasonably well controlled. It wouldn't actually
> look at the base frequency at all in deciding which items to keep.
>

I've been looking at this approach today, and I'm a bit puzzled. That
patch essentially uses SRE to compute mincount like this:

    mincount = n*(N-n) / (N-n+0.04*n*(N-1))

and then includes all items more common than this threshold.

Right.

How could
that handle items significantly less common than the base frequency?

Well what I meant was that it will *allow* items significantly less common than the base frequency, because it's not even looking at the base frequency. For example, if the table size were N=100,000 and we sampled n=10,000 rows from that, mincount would work out as 22. So it's easy to construct allowed items more common than that and still significantly less common than their base frequency.

A possible refinement would be to say that if there are more than stats_target items more common than this mincount threshold, rather than excluding the least common ones to get the target number of items, exclude the ones closest to their base frequencies, on the grounds that those are the ones for which the MCV stats will make the least difference. That might complicate the code somewhat though -- I don't have it in front of me, so I can't remember if it even tracks more than stats_target items.

Regards,
Dean

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Early WIP/PoC for inlining CTEs
Next
From: David Rowley
Date:
Subject: Re: speeding up planning with partitions