Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: WIP: multivariate statistics / proof of concept |
Date | |
Msg-id | 5489CC1F.70401@vmware.com Whole thread Raw |
In response to | WIP: multivariate statistics / proof of concept (Tomas Vondra <tv@fuzzy.cz>) |
Responses |
Re: WIP: multivariate statistics / proof of concept
(Tomas Vondra <tv@fuzzy.cz>)
Re: WIP: multivariate statistics / proof of concept (Gavin Flower <GavinFlower@archidevsys.co.nz>) |
List | pgsql-hackers |
On 10/13/2014 01:00 AM, Tomas Vondra wrote: > Hi, > > attached is a WIP patch implementing multivariate statistics. Great! Really glad to see you working on this. > + * FIXME This sample sizing is mostly OK when computing stats for > + * individual columns, but when computing multi-variate stats > + * for multivariate stats (histograms, mcv, ...) it's rather > + * insufficient. For small number of dimensions it works, but > + * for complex stats it'd be nice use sample proportional to > + * the table (say, 0.5% - 1%) instead of a fixed size. I don't think a fraction of the table is appropriate. As long as the sample is random, the accuracy of a sample doesn't depend much on the size of the population. For example, if you sample 1,000 rows from a table with 100,000 rows, or 1000 rows from a table with 100,000,000 rows, the accuracy is pretty much the same. That doesn't change when you go from a single variable to multiple variables. You do need a bigger sample with multiple variables, however. My gut feeling is that if you sample N rows for a single variable, with two variables you need to sample N^2 rows to get the same accuracy. But it's not proportional to the table size. (I have no proof for that, but I'm sure there is literature on this.) > + * Multivariate histograms > + * > + * Histograms are a collection of buckets, represented by n-dimensional > + * rectangles. Each rectangle is delimited by an array of lower and > + * upper boundaries, so that for for the i-th attribute > + * > + * min[i] <= value[i] <= max[i] > + * > + * Each bucket tracks frequency (fraction of tuples it contains), > + * information about the inequalities, number of distinct values in > + * each dimension (which is used when building the histogram) etc. > + * > + * The boundaries may be either inclusive or exclusive, or the whole > + * dimension may be NULL. > + * > + * The buckets may overlap (assuming the build algorithm keeps the > + * frequencies additive) or may not cover the whole space (i.e. allow > + * gaps). This entirely depends on the algorithm used to build the > + * histogram. That sounds pretty exotic. These buckets are quite different from the single-dimension buckets we currently have. The paper you reference in partition_bucket() function, M. Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 1988: 28-36, actually doesn't mention overlapping buckets at all. I haven't read the code in detail, but if it implements the algorithm from that paper, there will be no overlap. - Heikki
pgsql-hackers by date: