Re: proposal : cross-column stats - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: proposal : cross-column stats |
Date | |
Msg-id | AANLkTinpjTgLDv8R-AkeQu5vhzAx4EXpgQ1H01RBYFpv@mail.gmail.com Whole thread Raw |
In response to | Re: proposal : cross-column stats (Florian Pflug <fgp@phlo.org>) |
Responses |
Re: proposal : cross-column stats
|
List | pgsql-hackers |
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote: > I tried to pick up Robert's idea of quantifying "Implicativeness" - > i.e., finding a number between 0 and 1 that describes how close the > (A,B) are to representing a function A -> B. Actually Heikki's idea... > Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the > estimates of dist(?) are consistent. From that you easily get > > dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and > dist(A,B)/dist(A) <= dist(B) <= dist(A,B) > > If dist(A) == dist(A,B), then there is a functional dependency > A -> B, and conversely if dist(B) == dist(A,B) there is a functional > dependency B -> A. Note that you can have both at the same time! > > On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the > smallest number of distinct values possible for a given combination > of dist(A,B) and dist(A). This is the anti-function case. > > This motivates the definition > > F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ] > > (You can probably drop the "-1", it doesn't make much of a difference > for larger values of dist(B). > > F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and > dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while > a value of 1 indicates that dist(A) == dist(A,B). > > So F(A,B) is a suitable measure of "Implicativeness" - it's higher > if the table (A,B) looks more like a function A -> B. > > You might use that to decide if either A->B or B->a looks function-like > enough to use the uniform bayesian approach. Or you might even go further, > and decide *with* bayesian formula to use - the paper you cited always > averages > > P(A=x|B=y)*P(B=y) and > P(B=y|A=x)*P(A=x) > > but they offer no convincing reason for that other than "We don't know > which to pick". Ideally you want to somehow make this a continuous transaition between the available formulas rather than a discrete transition, I think. If F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and if it's 0, then it's P(A=x)*P(B=y). But suppose F(A,B)=0.5. Then what? A naive approach would be to estimate P(A=x && B=y) = P(A=x) * (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1 we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9) = 0.055. Of course I'm just hand-waving here, and this is without any mathematical basis, being just the simplest formula I could think of that gets the endpoints right and plots some sort of smooth curve between them in the middle. A similar formula with a believable argument to back it up seems like it would be a big step forward for this method. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: