Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | CAEZATCVEU=ZZ31yqQY=nUg+p7dBLxqJg6iDN-9MoE81uYs-8dw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
(Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
List | pgsql-hackers |
On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 1/8/19 3:18 PM, Dean Rasheed wrote: > > So actually, the estimate for a group of values will be either the MCV > > item's frequency (if the MCV item is kept), or (roughly) the MCV > > item's base_frequency (if the MCV item is not kept). That suggests > > that we should simply keep items that are significantly more or less > > common than the item's base frequency -- i.e., keep rule (b) and ditch > > rule (a). > > > > Hmmm, but won't that interfere with how we with how we extrapolate the > MCV estimate to the non-MCV part? Currently the patch does what you > proposed, i.e. > > other_sel = simple_sel - mcv_basesel; > > I'm worried that if we only include the items that are significantly > more or less common than the base frequency, it may skew the other_sel > estimate. > I don't see how that would skew other_sel. Items close to the base frequency would also tend to be close to simple_sel, making other_sel approximately zero, so excluding them should have little effect. However... Re-reading the thread where we enhanced the per-column MCV stats last year [1], it was actually the case that an algorithm based on just looking at the relative standard error worked pretty well for a very wide range of data distributions. The final algorithm chosen in analyze_mcv_list() was only a marginal improvement on that, and was directly based upon the fact that, in the univariate statistics case, all the values not included in the MCV list are assigned the same selectivity. However, that's not the case for multivariate stats, because each group not included in the multivariate MCV list gets assigned a different selectivity based on its per-column stats. So perhaps what we should do for multivariate stats is simply use the relative standard error approach (i.e., reuse the patch in [2] with a 20% RSE cutoff). That had a lot of testing at the time, against a wide range of data distributions, and proved to be very good, not to mention being very simple. That approach would encompass both groups more and less common than the base frequency, because it relies entirely on the group appearing enough times in the sample to infer that any errors on the resulting estimates will be reasonably well controlled. It wouldn't actually look at the base frequency at all in deciding which items to keep. Moreover, if the group appears sufficiently often in the sample to justify being kept, each of the individual column values must also appear at least that often as well, which means that the errors on the base frequency estimate are also well controlled. That was one of my concerns about other algorithms such as "keep items significantly more or less common than the base frequency" -- in the less common case, there's no lower bound on the number of occurrences seen, and so no guarantee that the errors are kept under control. Regards, Dean [1] https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com
pgsql-hackers by date: