Re: Multi-Dimensional Histograms - Mailing list pgsql-hackers

From Nathan Boley
Subject Re: Multi-Dimensional Histograms
Date
Msg-id 6fa3b6e20906301239w1bf3b1cer88545747d041c3c@mail.gmail.com
Whole thread Raw
In response to Re: Multi-Dimensional Histograms  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
> I'm finding myself unable to follow all the terminology on this thead.
>  What's dimension reduction?  What's PCA?

( Sorry for the jargon - thanks Josh )

> It feels like what you might need is statistics for colB (MCVs and/or a
> histogram) for certain particular values of colA.

Certainly - this is exactly what a multi-dimensional histogram would store.

> Unfortunately, in
> the general case the set of values of colA for which you need these
> statistics might be inconveniently large.
>

Which is why, in the general case, we need some sort of dimension reduction.

If we could say that, for instance, the distribution of colB is
roughly the same for all values of colA less than 100, and it is
roughly the same for all values of colA >= 100, then we would have
effectively reduced the dimension from ndistinct(colB)*ndistinct(colA)
to 2*ndistinct(colB).

The multidimensional histogram in the above paper is somewhat akin to
what you suggested and I just described - it attempts to select
contiguous regions that are "the same". PCA is a much different
approach, but it is still, broadly, a dimension reduction technique.

> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.

To clarify my response to Tom, directly above, if as in Robert's
example, all of the values in colA < 100 were different than all of
the values > 100 with respect to colB, then it is easy to represent
the structure in something manageable. Ie, we store two histograms for
colB: 1 for colA > 100, one for colA < 100. However, if there are the
same two classes in colB ( lets say C1 and C2 ) but rather than C1
being associated with values in colA < 100, it is associated with 100
completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ...
) there is not a whole lot we can do besides storing a list of all of
those values. We could maybe use the ratio of classes to improve the
average plan choice, but you would still get a lot of bad plans.

Could anyone provide a concrete example ( or better yet a data set )
where this occurs?

-Nathan


pgsql-hackers by date:

Previous
From: Richard Huxton
Date:
Subject: Re: 8.5 development schedule
Next
From: "David E. Wheeler"
Date:
Subject: Re: Inconsistent Errors on Row Comparisons