Re: Cross-column statistics revisited - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: Cross-column statistics revisited
Date
Msg-id 20081017064140.GB1443@svana.org
Whole thread Raw
In response to Re: Cross-column statistics revisited  (Greg Stark <greg.stark@enterprisedb.com>)
Responses Re: Cross-column statistics revisited
List pgsql-hackers
On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
> Correlation is the wrong tool. In fact zip codes and city have nearly
> zero correlation.  Zip codes near 00000 are no more likely to be in
> cities starting with A than Z.

I think we need to define our terms better. In terms of linear
correlation you are correct. However, you can define invertable mappings
from zip codes and cities onto the integers which will then have an
almost perfect correlation.

According to a paper I found this is related to the "principle of
maximum entropy". The fact that you can't determine such functions
easily in practice doesn't change the fact that zip codes and city
names are highly correlated.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Cross-column statistics revisited
Next
From: Gregory Stark
Date:
Subject: Re: Cross-column statistics revisited