On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley@gmail.com> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> David Fetter <david@fetter.org> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues? Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.
I'm finding myself unable to follow all the terminology on this thead.What's dimension reduction? What's PCA?
Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:
colA = constA and colB op constB
...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans. In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA. It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA. Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.
...Robert