Re: Cross-field statistics - Mailing list pgsql-hackers

From Decibel!
Subject Re: Cross-field statistics
Date
Msg-id F03A6C3F-A325-4BB7-AEA0-784DFE2F5F54@decibel.org
Whole thread Raw
In response to Re: Cross-field statistics  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>
>> For each field that isn't already in a set of field groupings
>>  * Sort sample rows on that field
>>  * Calculate correlation for all other fields
>>  * If there are other fields that have a correlation to this sort   
>> order over
>> some threshold, save them along with the field we  originally  
>> sorted on as a
>> new 'field grouping'
>>  * Else, there are no other fields that group with this field;  
>> it's  a "loner"
>
> I think this is going somewhere. But "correlation" isn't quite  
> right. It has
> the same problem our use of correlation for clusteredness has.  
> Consider the
> case of Zip code and City. They're nearly very non-independent  
> variables but
> there's basically no correlation.

If we have a limited number of values in one of the fields, it would  
be possible to build a histogram of values for other fields based on  
the values in the first field. But I can't see how we could possibly  
represent something like city, zip in a compact form. You would have  
to keep a range of zips that cover a city.

Hmm... but we only care about cities that have a substantial number  
of zip codes. This is what the equivalent of the most-common-values  
list would be for cross-platform stats: for the most_common_vals in  
column a, you store a range or histogram of the corresponding values  
in b, assuming that there is a good correspondence.

>> For each field grouping, at a minimum we'd need to store a  
>> histogram  for that
>> grouping.
>
> This is a problem. What does a histogram on a grouping mean? It's  
> not clear
> how to come up with a histogram which can help answer questions like
>   A between ? and ? and B between ? and ?
>
> You can do a histogram on <a,b> or <b,a> but neither are going to be
> especially useful. Heikki and I came up with a weird hybrid thing  
> which might
> be useful for avoiding overestimating selectivity like
>   WHERE city='BOS' AND areacode = '617'
>
> But it didn't help at all with the converse, ie:
>  WHERE city='BOS' AND areacode = '212'
>
> It's hard to see how we could possibly catch cases like that though.

If the two fields share the same correlation, then the histogram is  
just what we use right now. We could actually do this today, but only  
for fields with a high physical correlation. What I was describing  
allowed extending this to fields that have a high correlation to each  
other, even if they didn't have a high physical correlation. I know  
that this doesn't help us for things like city/area code or city/zip,  
but other than my idea above I'm rather at a loss on how to represent  
that in a compact fashion.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



pgsql-hackers by date:

Previous
From: Aidan Van Dyk
Date:
Subject: Re: Lessons from commit fest
Next
From: Tom Lane
Date:
Subject: Re: Lessons from commit fest