On Mon, Jun 29, 2009 at 10:22 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> I'm finding myself unable to follow all the terminology on this thead.
> What's dimension reduction? What's PCA?
[snip]
Imagine you have a dataset with two variables, say height in inches
and age in years. For tue purpose of discussion lets pretend for a
moment that all the people in your sample have height the same as
their age.
You could create a 2d histogram of your data:
|0000000200000000|0000006000000000
a|0000030000000000
g|0000400000000000
e|0003000000000000|0010000000000000|0100000000000000|---------------- height
You could store this 2d histogram as is and use it for all the things
you'd use histograms for.... or you could make an observation of the
structure and apply a rotation and flattening of the data and convert
it to a 1d histogram
[0113426200...] which is far more compact.
Often data has significant correlation, so it's often possible to
reduce the dimensionality without reducing the selectivity of the
histogram greatly.
This becomes tremendously important as the number of dimensions goes
up because the volume of a N dimensional space increases incredibly
fast as the number of dimensions increase.
PCA is used as one method of dimensionality reduction. In PCA you find
a linear transformation (scaling, rotation) of the data that aligns
the data so that the axis lines cut through the data-space in the
orientations with the greatest variance.
I have no clue how you would apply PCA to postgresql histograms, since
to build the PCA transform you need to do some non-trivial operations
with the data. Perhaps PCA could be done on a random sample of a
table, then that transformation could be stored and used to compute
the histograms. I'm sure there has been a lot of research on this.