Re: proposal : cross-column stats - Mailing list pgsql-hackers

From Robert Haas
Subject Re: proposal : cross-column stats
Date
Msg-id AANLkTinpjTgLDv8R-AkeQu5vhzAx4EXpgQ1H01RBYFpv@mail.gmail.com
Whole thread Raw
In response to Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Responses Re: proposal : cross-column stats
List pgsql-hackers
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote:
> I tried to pick up Robert's idea of quantifying "Implicativeness" -
> i.e., finding a number between 0 and 1 that describes how close the
> (A,B) are to representing a function A -> B.

Actually Heikki's idea...

> Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
> estimates of dist(?) are consistent. From that you easily get
>
>  dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and
>  dist(A,B)/dist(A) <= dist(B) <= dist(A,B)
>
> If dist(A) == dist(A,B), then there is a functional dependency
> A -> B, and conversely if dist(B) == dist(A,B) there is a functional
> dependency B -> A. Note that you can have both at the same time!
>
> On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
> smallest number of distinct values possible for a given combination
> of dist(A,B) and dist(A). This is the anti-function case.
>
> This motivates the definition
>
>  F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]
>
> (You can probably drop the "-1", it doesn't make much of a difference
> for larger values of dist(B).
>
> F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
> dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
> a value of 1 indicates that dist(A) == dist(A,B).
>
> So F(A,B) is a suitable measure of "Implicativeness" - it's higher
> if the table (A,B) looks more like a function A -> B.
>
> You might use that to decide if either A->B or B->a looks function-like
> enough to use the uniform bayesian approach. Or you might even go further,
> and decide *with* bayesian formula to use - the paper you cited always
> averages
>
>  P(A=x|B=y)*P(B=y) and
>  P(B=y|A=x)*P(A=x)
>
> but they offer no convincing reason for that other than "We don't know
> which to pick".

Ideally you want to somehow make this a continuous transaition between
the available formulas rather than a discrete transition, I think.  If
F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
what?  A naive approach would be to estimate P(A=x && B=y) = P(A=x) *
(1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
= 0.055.  Of course I'm just hand-waving here, and this is without any
mathematical basis, being just the simplest formula I could think of
that gets the endpoints right and plots some sort of smooth curve
between them in the middle.  A similar formula with a believable
argument to back it up seems like it would be a big step forward for
this method.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Shigeru HANADA
Date:
Subject: Re: SQL/MED - file_fdw
Next
From: Robert Haas
Date:
Subject: Re: bug in SignalSomeChildren