Home > mailing lists

Re: proposal : cross-column stats - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: proposal : cross-column stats
Date	December 12, 2010 15:23:18
Msg-id	4D04F6F8.1040403@fuzzy.cz Whole thread Raw
In response to	Re: proposal : cross-column stats (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List	pgsql-hackers

Tree view

Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
> 
> Yeah, and that's actually analogous to the example I used in my
> presentation.
> 
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas

pgsql-hackers by date:

From: Tom Lane
Date: 12 December 2010, 15:16:56
Subject: Re: [COMMITTERS] pgsql: Use symbolic names not octal constants for file permission flags

From: Andrew Dunstan
Date: 12 December 2010, 15:24:12
Subject: Re: function attributes

Re: proposal : cross-column stats - Mailing list pgsql-hackers

Previous

Next