Thread: Multi-Dimensional Histograms
Folks, For things like PostGIS, which will want to index in 4 dimensions (x, y, z, t), we might want to have multi-dimensional selectivity histograms and some way to use same. Anybody here qualified to check out this paper on the subject, please speak up :) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.8475 Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
> For things like PostGIS, which will want to index in 4 dimensions > (x, y, z, t), we might want to have multi-dimensional selectivity > histograms and some way to use same. > Another use case is cross column statistics. > Anybody here qualified to check out this paper on the subject, please > speak up :) > Well, I don't know if I'm qualified per se, but here goes... It seems seems like a very reasonable approach to multidimensional histograms. For those who haven't had time to read the paper, it describes SGRID, a greedy algorithm for partitioning an n-dimensional space into non-overlapping n-dimensional boxes and 'outliers' ( points that dont fit into any boxes, but dont warrant their own boxes ). Then, they compare their method to a regression technique, where the space is fit into a fixed grid histogram and then the grid densities are estimated via regression. They also compare a plane splitting algorithm, called MHIST, where the planes are constrained to be orthogonal to the naive dimension vectors. They dismiss singular value decomposition and the discrete wavelet transform as being too parametric ( which is silly, IMHO ) and briefly mention a couple standard clustering techniques ( which are probably not appropriate for us, given their complexity ). Unsurprisingly, it does well on the two test sets that they consider. It think the general idea is fine, but it would certainly need some modification to work for postgres. In particular, Using the naive dimension vectors ( ie, north and east for geo data, or column a and column b for cross-column stats ) in combination with the box constraint will probably lead to problems. Consider, for example, a river that goes in a straight line north-east, and a table that stores camp sites which are mostly along the river. Because SGRID can only create boxes with north east edges, you would end up with a bunch of tiny box on the river where one, long north-east pointing box would do. We would probably want to replace the error term that SGRID uses to determine when to create a new box by a statistical test ( maybe homogeneity of means? ) ( also, the same holds for the outliers ). Finally, this creates the partition but ( AFAICT ) it doesn't describe a method for locating the histogram estimate given a point ( although that doesn't seem too difficult ). -Nathan
On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: > > For things like PostGIS, which will want to index in 4 dimensions > > (x, y, z, t), we might want to have multi-dimensional selectivity > > histograms and some way to use same. > > Another use case is cross column statistics. Good to see there's more than one use case. I was hoping people would chime in with more use cases, and here one is, fast! :) > > Anybody here qualified to check out this paper on the subject, please > > speak up :) > > Well, I don't know if I'm qualified per se, but here goes... > > It seems seems like a very reasonable approach to multidimensional > histograms. > > For those who haven't had time to read the paper, it describes > SGRID, a greedy algorithm for partitioning an n-dimensional space > into non-overlapping n-dimensional boxes and 'outliers' ( points > that dont fit into any boxes, but dont warrant their own boxes ). > Then, they compare their method to a regression technique, where > the space is fit into a fixed grid histogram and then the grid > densities are estimated via regression. They also compare a plane > splitting algorithm, called MHIST, where the planes are constrained > to be orthogonal to the naive dimension vectors. They dismiss > singular value decomposition and the discrete wavelet transform as > being too parametric ( which is silly, IMHO ) Should we have a separate discussion about eigenvalues? Wavelets? > and briefly mention a couple standard clustering techniques ( which > are probably not appropriate for us, given their complexity ). Good to know. > Unsurprisingly, it does well on the two test sets that they > consider. Wait. You mean they might have carefully chosen data to make their point?!? ;) > It think the general idea is fine, but it would certainly need some > modification to work for postgres. > > In particular, > > Using the naive dimension vectors ( ie, north and east for geo data, > or column a and column b for cross-column stats ) in combination with > the box constraint will probably lead to problems. Consider, for > example, a river that goes in a straight line north-east, and a table > that stores camp sites which are mostly along the river. Because SGRID > can only create boxes with north east edges, you would end up with a > bunch of tiny box on the river where one, long north-east pointing box > would do. > > We would probably want to replace the error term that SGRID uses to > determine when to create a new box by a statistical test ( maybe > homogeneity of means? ) ( also, the same holds for the outliers ). > > Finally, this creates the partition but ( AFAICT ) it doesn't describe > a method for locating the histogram estimate given a point ( although > that doesn't seem too difficult ). Is that "not difficult," in terms of the math that needs doing, or "not difficult," in terms of how well PostgreSQL is already set up to implement, or...? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: >> ... They dismiss >> singular value decomposition and the discrete wavelet transform as >> being too parametric ( which is silly, IMHO ) > Should we have a separate discussion about eigenvalues? Wavelets? I think it'd be a short discussion: what will you do with non-numeric datatypes? We probably don't really want to assume anything stronger than that the datatype has a total ordering. regards, tom lane
On Mon, Jun 29, 2009 at 06:43:35PM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: > >> ... They dismiss singular value decomposition and the discrete > >> wavelet transform as being too parametric ( which is silly, IMHO > >> ) > > > Should we have a separate discussion about eigenvalues? Wavelets? > > I think it'd be a short discussion: what will you do with > non-numeric datatypes? We probably don't really want to assume > anything stronger than that the datatype has a total ordering. That sounds about like the discussion we needed unless we later decide we need to do something special for approximations of R^n :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
>> Finally, this creates the partition but ( AFAICT ) it doesn't describe >> a method for locating the histogram estimate given a point ( although >> that doesn't seem too difficult ). > Is that "not difficult," in terms of the math that needs doing, or > "not difficult," in terms of how well PostgreSQL is already set up to > implement, or...? > I only meant that any implementation would need to address this, but I can think of simple ways to do it ( for instance, use the fixed width grid method, and then store a reference to all of the intersecting boxes ). But I am sure that there are much better ways ( nested containment lists perhaps? ). -Nathan
On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > David Fetter <david@fetter.org> writes: >> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: >>> ... They dismiss >>> singular value decomposition and the discrete wavelet transform as >>> being too parametric ( which is silly, IMHO ) > >> Should we have a separate discussion about eigenvalues? Wavelets? > > I think it'd be a short discussion: what will you do with non-numeric > datatypes? We probably don't really want to assume anything stronger > than that the datatype has a total ordering. Well, in the general case, we could use their ranks. At the end of the day, we cant do any dimension reduction unless the ordering encodes some sort of useful information, and the data type being in R^n is certainly no guarantee. Consider, for instance, the cross correlation of zip-codes and area codes - you would really want to order those by some geographic relation. I think that is why cross-column stats is so hard in the general case. That being said, for geographic data in particular, PCA or similar could work well. -Nathan
On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley@gmail.com> wrote: > On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> David Fetter <david@fetter.org> writes: >>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: >>>> ... They dismiss >>>> singular value decomposition and the discrete wavelet transform as >>>> being too parametric ( which is silly, IMHO ) >> >>> Should we have a separate discussion about eigenvalues? Wavelets? >> >> I think it'd be a short discussion: what will you do with non-numeric >> datatypes? We probably don't really want to assume anything stronger >> than that the datatype has a total ordering. > > Well, in the general case, we could use their ranks. > > At the end of the day, we cant do any dimension reduction unless the > ordering encodes some sort of useful information, and the data type > being in R^n is certainly no guarantee. Consider, for instance, the > cross correlation of zip-codes and area codes - you would really want > to order those by some geographic relation. I think that is why > cross-column stats is so hard in the general case. > > That being said, for geographic data in particular, PCA or similar > could work well. I'm finding myself unable to follow all the terminology on this thead.What's dimension reduction? What's PCA? Based on my last few months of answering questions on -performance, and my own experience, it seems like a lot of the cases that arise in practice are those where there is a WHERE clause of the form: colA = constA and colB op constB ...and it sometimes turns out that the subset of the data where colA = constA has a very different distribution for colB than the data as a whole, leading to bad plans. In many cases, it seems like colA is storing some discrete type of thing, like a customer ID, so the distribution of colB where colA = constA tells you nothing about the distribution of colB where colA = constA + someSmallDeltaA. It feels like what you might need is statistics for colB (MCVs and/or a histogram) for certain particular values of colA. Unfortunately, in the general case the set of values of colA for which you need these statistics might be inconveniently large. ...Robert
On Mon, Jun 29, 2009 at 10:22:15PM -0400, Robert Haas wrote: > I'm finding myself unable to follow all the terminology on this thead. > What's dimension reduction? For instance, ask a bunch of people a bunch of survey questions, in hopes of predicting some value (for instance, whether or not these people will die of a heart attack in the next three years). After waiting three years, discover that some of the questions didn't really help you predict the outcome. Remove them from the dataset. That's dimension reduction. > What's PCA? Principal Component Analysis, http://en.wikipedia.org/wiki/Principal_components_analysis -- Josh / eggyknap End Point, Corp. http://www.endpoint.com
> I'm finding myself unable to follow all the terminology on this thead. > What's dimension reduction? What's PCA? ( Sorry for the jargon - thanks Josh ) > It feels like what you might need is statistics for colB (MCVs and/or a > histogram) for certain particular values of colA. Certainly - this is exactly what a multi-dimensional histogram would store. > Unfortunately, in > the general case the set of values of colA for which you need these > statistics might be inconveniently large. > Which is why, in the general case, we need some sort of dimension reduction. If we could say that, for instance, the distribution of colB is roughly the same for all values of colA less than 100, and it is roughly the same for all values of colA >= 100, then we would have effectively reduced the dimension from ndistinct(colB)*ndistinct(colA) to 2*ndistinct(colB). The multidimensional histogram in the above paper is somewhat akin to what you suggested and I just described - it attempts to select contiguous regions that are "the same". PCA is a much different approach, but it is still, broadly, a dimension reduction technique. > At the end of the day, we cant do any dimension reduction unless the > ordering encodes some sort of useful information, and the data type > being in R^n is certainly no guarantee. Consider, for instance, the > cross correlation of zip-codes and area codes - you would really want > to order those by some geographic relation. I think that is why > cross-column stats is so hard in the general case. To clarify my response to Tom, directly above, if as in Robert's example, all of the values in colA < 100 were different than all of the values > 100 with respect to colB, then it is easy to represent the structure in something manageable. Ie, we store two histograms for colB: 1 for colA > 100, one for colA < 100. However, if there are the same two classes in colB ( lets say C1 and C2 ) but rather than C1 being associated with values in colA < 100, it is associated with 100 completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ... ) there is not a whole lot we can do besides storing a list of all of those values. We could maybe use the ratio of classes to improve the average plan choice, but you would still get a lot of bad plans. Could anyone provide a concrete example ( or better yet a data set ) where this occurs? -Nathan
On Mon, Jun 29, 2009 at 10:22 PM, Robert Haas<robertmhaas@gmail.com> wrote: > I'm finding myself unable to follow all the terminology on this thead. > What's dimension reduction? What's PCA? [snip] Imagine you have a dataset with two variables, say height in inches and age in years. For tue purpose of discussion lets pretend for a moment that all the people in your sample have height the same as their age. You could create a 2d histogram of your data: |0000000200000000|0000006000000000 a|0000030000000000 g|0000400000000000 e|0003000000000000|0010000000000000|0100000000000000|---------------- height You could store this 2d histogram as is and use it for all the things you'd use histograms for.... or you could make an observation of the structure and apply a rotation and flattening of the data and convert it to a 1d histogram [0113426200...] which is far more compact. Often data has significant correlation, so it's often possible to reduce the dimensionality without reducing the selectivity of the histogram greatly. This becomes tremendously important as the number of dimensions goes up because the volume of a N dimensional space increases incredibly fast as the number of dimensions increase. PCA is used as one method of dimensionality reduction. In PCA you find a linear transformation (scaling, rotation) of the data that aligns the data so that the axis lines cut through the data-space in the orientations with the greatest variance. I have no clue how you would apply PCA to postgresql histograms, since to build the PCA transform you need to do some non-trivial operations with the data. Perhaps PCA could be done on a random sample of a table, then that transformation could be stored and used to compute the histograms. I'm sure there has been a lot of research on this.