Re: proposal : cross-column stats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: proposal : cross-column stats
Date
Msg-id 4D04F6F8.1040403@fuzzy.cz
Whole thread Raw
In response to Re: proposal : cross-column stats  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
> 
> Yeah, and that's actually analogous to the example I used in my
> presentation.
> 
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Use symbolic names not octal constants for file permission flags
Next
From: Andrew Dunstan
Date:
Subject: Re: function attributes