>>> Right now our
>>> "histogram" values are really quantiles; the statistics_target T for a
>>> column determines a number of quantiles we'll keep track of, and we
>>> grab values from into an ordered list L so that approximately 1/T of
>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>> thinking that multicolumn statistics would instead divide the range of
>>> each column up into T equally sized segments,
>>
>> Why would you not use the same histogram bin bounds derived for the
>> scalar stats (along each axis of the matrix, of course)? This seems to
>> me to be arbitrarily replacing something proven to work with something
>> not proven. Also, the above forces you to invent a concept of "equally
>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. I believe those are the same
> difficulties Gregory Stark mentioned having in his first post in this
> thread. But of course that's an excellent point, that what we do now
> is proven. I'm not sure which problem will be harder to solve -- the
> weird geometry or the "equally sized ranges" for data types where that
> makes no sense.
>
Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.