Home > mailing lists

Re: proposal : cross-column stats - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: proposal : cross-column stats
Date	December 12, 2010 12:49:30
Msg-id	4D04FD1C.9000109@fuzzy.cz Whole thread Raw
In response to	Re: proposal : cross-column stats (Florian Pflug <fgp@phlo.org>)
List	pgsql-hackers

Tree view

Dne 12.12.2010 17:33, Florian Pflug napsal(a):
> On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
>> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any
information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the
selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a
certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for
eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's
someexisting theory on how to measure that from a sample.
 
> 
> The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the
assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states
that
> 
> P(A|B) = P(A and B) / P(B).
> 
> Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like
a=3)does not change under additional assumptions like b=4. Bayes' theorem thus becomes
 
> 
> P(A) = P(A and B) / P(B)    <=>
> P(A and B) = P(A)*P(B)
> 
> which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".
> 
> I believe that measuring this by counting the number of distinct values of column B for each A is basically the right
idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and
comparethat to the overall number of distinct values of "b"...
 

Good point!

Well, I was thinking about this too - generally this means creating a
contingency table with the MCV as bins. Then you can compute these
interesting probabilities P(A and B). (OK, now I definitely look like
some contingency table weirdo, who tries to solve everything with a
contingency table. OMG!)

The question is - what are we going to do when the values in the query
are not in the MCV list? Is there some heuristics to estimate the
probability from MCV, or something like that? Could we use some
"average" probability or what?

Tomas

pgsql-hackers by date:

From: Andrew Dunstan
Date: 12 December 2010, 12:44:17
Subject: Re: function attributes

From: Tom Lane
Date: 12 December 2010, 12:58:19
Subject: Re: function attributes

Re: proposal : cross-column stats - Mailing list pgsql-hackers

Previous

Next