Re: proposal : cross-column stats - Mailing list pgsql-hackers

From tv@fuzzy.cz
Subject Re: proposal : cross-column stats
Date
Msg-id 27f08ffc718c00488e4dc33975bcedc9.squirrel@sq.gransy.com
Whole thread Raw
In response to Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Responses Re: proposal : cross-column stats
List pgsql-hackers
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
>> It seems to me you're missing one very important thing - this was not
>> meant as a new default way to do estimates. It was meant as an option
>> when the user (DBA, developer, ...) realizes the current solution gives
>> really bad estimates (due to correlation). In that case he could create
>> 'cross-column' statistics on those columns, and the optimizer would use
>> that info to do the estimates.
>
> I do understand that. I just have the nagging feeling that there is a
> way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
> to apply the uniform bayesian approach or to assume the columns are
> unrelated.

I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule
 if [dist(A) == dist(A,B)] the [A => B]

but that's very fragile. Think about estimates (we're not going to work
with exact values of dist(?)), and then about data errors (e.g. a city
matched to an incorrect ZIP code or something like that).

So for a real-world dataset, the condition [dist(A)==dist(A,B)] will
almost never hold. And about the same holds for the "uniform correlation"
assumption which is the basis for the formula I posted.

So actually we're looking for a formula that does reasonable estimates and
is robust even in cases where the correlation is not uniform or the
estimates are a reasonably unprecise.

> This motivates the definition
>
> F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [dist(A,B) * ( dist(B) - 1)]
>
> (You can probably drop the "-1", it doesn't make much of a difference
> for larger values of dist(B).
>
> F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
> dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
> a value of 1 indicates that dist(A) == dist(A,B).
>
> So F(A,B) is a suitable measure of "Implicativeness" - it's higher
> if the table (A,B) looks more like a function A -> B.
>
> You might use that to decide if either A->B or B->a looks function-like
> enough to use the uniform bayesian approach. Or you might even go further,
> and decide *with* bayesian formula to use - the paper you cited always
> averages
>
>   P(A=x|B=y)*P(B=y) and
>   P(B=y|A=x)*P(A=x)
>
> but they offer no convincing reason for that other than "We don't know
> which to pick".

Well, the reason why they chose the sum/2 approach is they were unable to
infer which of the values is 'better' and the sum/2 limits the errors.

I haven't studied this thoroughly, but my impression is that you are going
in the same direction as they did, i.e. while they've done
  P(A,B) = (P(A|B)*P(A) + P(B|A)*P(B)) / 2

with P(A|B) = dist(A) / dist(A,B), you've chosen P(A|B) ~ F(B,A) or
something like that.

You'll probably object that you could compute F(A,B) and F(B,A) and then
use only the part corresponding to the larger value, but what if the
F(A,B) and F(B,A) are about the same?

This is the reason why they choose to always combine the values (with
varying weights).

> I'd like to find a statistical explanation for that definition of
> F(A,B), but so far I couldn't come up with any. I created a Maple 14
> worksheet while playing around with this - if you happen to have a
> copy of Maple available I'd be happy to send it to you..

No, I don't have Maple. Have you tried Maxima
(http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
even has an online notebook - that seems like a very comfortable way to
exchange this kind of data.

regards
Tomas



pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Extensions, patch 22 (cleanup, review, cleanup)
Next
From: Shigeru HANADA
Date:
Subject: Re: SQL/MED - file_fdw