Re: Cross-column statistics revisited - Mailing list pgsql-hackers

From Nathan Boley
Subject Re: Cross-column statistics revisited
Date
Msg-id 6fa3b6e20810171447o43c5d28ar3bb98e2cf5b47e5a@mail.gmail.com
Whole thread Raw
In response to Re: Cross-column statistics revisited  ("Joshua Tolley" <eggyknap@gmail.com>)
Responses Re: Cross-column statistics revisited
List pgsql-hackers
>>> Right now our
>>> "histogram" values are really quantiles; the statistics_target T for a
>>> column determines a number of quantiles we'll keep track of, and we
>>> grab values from into an ordered list L so that approximately 1/T of
>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>> thinking that multicolumn statistics would instead divide the range of
>>> each column up into T equally sized segments,
>>
>> Why would you not use the same histogram bin bounds derived for the
>> scalar stats (along each axis of the matrix, of course)?  This seems to
>> me to be arbitrarily replacing something proven to work with something
>> not proven.  Also, the above forces you to invent a concept of "equally
>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. I believe those are the same
> difficulties Gregory Stark mentioned having in his first post in this
> thread. But of course that's an excellent point, that what we do now
> is proven. I'm not sure which problem will be harder to solve -- the
> weird geometry or the "equally sized ranges" for data types where that
> makes no sense.
>

Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Incorrect cursor behaviour with gist index
Next
From: "Joshua Tolley"
Date:
Subject: Re: Cross-column statistics revisited