Home > mailing lists

Re: multivariate statistics (v25) - Mailing list pgsql-hackers

From	Sven R. Kunze
Subject	Re: multivariate statistics (v25)
Date	April 5, 2017 12:41:31
Msg-id	48c43f17-ecde-d582-6442-34516dd35a99@mail.de Whole thread Raw
In response to	Re: multivariate statistics (v25) (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List	pgsql-hackers

Tree view

Thanks Tomas and David for hacking on this patch.

On 04.04.2017 20:19, Tomas Vondra wrote:
> I'm not sure we still need the min_group_size, when evaluating 
> dependencies. It was meant to deal with 'noisy' data, but I think it 
> after switching to the 'degree' it might actually be a bad idea.
>
> Consider this:
>
>     create table t (a int, b int);
>     insert into t select 1, 1 from generate_series(1, 10000) s(i);
>     insert into t select i, i from generate_series(2, 20000) s(i);
>     create statistics s with (dependencies) on (a,b) from t;
>     analyze t;
>
>     select stadependencies from pg_statistic_ext ;
>                   stadependencies
>     --------------------------------------------
>      [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]
>     (1 row)
>
> So the degree of the dependency is just ~0.333 although it's obviously 
> a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The 
> reason is that we discard 2/3 of rows, because those groups are only a 
> single row each, except for the one large group (1/3 of rows).

Just for me to follow the comments better. Is "dependency" roughly the 
same as when statisticians speak about " conditional probability"?

Sven

pgsql-hackers by date:

From: "Tsunakawa, Takayuki"
Date: 05 April 2017, 12:37:35
Subject: Re: Statement timeout behavior in extended queries

From: Ashutosh Bapat
Date: 05 April 2017, 12:42:27
Subject: Re: Partition-wise join for join between (declaratively)partitioned tables

Re: multivariate statistics (v25) - Mailing list pgsql-hackers

Previous

Next