Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCUYh_V3J0vCxNEcVAUZx9hU7B0d9L62OEG-9L+DGgKOKQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Sun, 13 Jan 2019 at 00:04, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 1/12/19 8:49 AM, Dean Rasheed wrote:
> > A possible refinement would be to say that if there are more than
> > stats_target items more common than this mincount threshold, rather than
> > excluding the least common ones to get the target number of items,
> > exclude the ones closest to their base frequencies, on the grounds that
> > those are the ones for which the MCV stats will make the least
> > difference. That might complicate the code somewhat though -- I don't
> > have it in front of me, so I can't remember if it even tracks more than
> > stats_target items.
>
> Yes, the patch does limit the number of items to stats_target (a maximum
> of per-attribute stattarget values, to be precise). IIRC that's a piece
> you've added sometime last year ;-)
>
> I've been experimenting with removing items closest to base frequencies
> today, and I came to the conclusion that it's rather tricky for a couple
> of reasons.
>
> 1) How exactly do you measure "closeness" to base frequency? I've tried
> computing the error in different ways, including:
>
>   * Max(freq/base, base/freq)
>   * abs(freq - base)
>
> but this does not seem to affect the behavior very much, TBH.
>
> 2) This necessarily reduces mcv_totalsel, i.e. it increases the part not
> covered by MCV. And estimates on this part are rather crude.
>
> 3) It does nothing for "impossible" items, i.e. combinations that do not
> exist at all. Clearly, those won't be part of the sample, and so can't
> be included in the MCV no matter which error definition we pick. And for
> very rare combinations it might lead to sudden changes, depending on
> whether the group gets sampled or not.
>
> So IMHO it's better to stick to the simple SRE approach for now.
>

OK, that makes sense.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: [WIP] CREATE SUBSCRIPTION with FOR TABLES clause (table filter)
Next
From: Dean Rasheed
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists