Re: Multi-Dimensional Histograms - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Multi-Dimensional Histograms
Date
Msg-id 603c8f070906291922s34f786c4k683860696354a456@mail.gmail.com
Whole thread Raw
In response to Re: Multi-Dimensional Histograms  (Nathan Boley <npboley@gmail.com>)
Responses Re: Multi-Dimensional Histograms
Re: Multi-Dimensional Histograms
Re: Multi-Dimensional Histograms
List pgsql-hackers
On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley@gmail.com> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> David Fetter <david@fetter.org> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues?  Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.

I'm finding myself unable to follow all the terminology on this thead.What's dimension reduction?  What's PCA?

Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:

colA = constA and colB op constB

...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans.  In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA.  It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA.  Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.

...Robert


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: How to register my function into backend?
Next
From: KaiGai Kohei
Date:
Subject: Re: [PATCH] [v8.5] Security checks on largeobjects