Re: multivariate statistics (v25) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics (v25)
Date
Msg-id 56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] multivariate statistics (v25)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: multivariate statistics (v25)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: multivariate statistics (v25)  ("Sven R. Kunze" <srkunze@mail.de>)
List pgsql-hackers
On 04/04/2017 09:55 AM, David Rowley wrote:
> On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've attached an updated patch.
>
> I've made another pass at this and ended up removing the tryextstats
> variable. We now only try to use extended statistics when
> clauselist_selectivity() is given a valid RelOptInfo with rtekind ==
> RTE_RELATION, and of course, it must also have some extended stats
> defined too.
>
> I've also cleaned up a few more comments, many of which I managed to
> omit updating when I refactored how the selectivity estimates ties
> into clauselist_selectivity()
>
> I'm quite happy with all of this now, and would also be happy for
> other people to take a look and comment.
>
> As a reviewer, I'd be marking this ready for committer, but I've moved
> a little way from just reviewing this now, having spent two weeks
> hacking at it.
>
> The latest patch is attached.
>

Thanks David, I agree the reworked patch is much cleaner that the last 
version I posted. Thanks for spending your time on it.

Two minor comments:

1) DEPENDENCY_MIN_GROUP_SIZE

I'm not sure we still need the min_group_size, when evaluating 
dependencies. It was meant to deal with 'noisy' data, but I think it 
after switching to the 'degree' it might actually be a bad idea.

Consider this:
    create table t (a int, b int);    insert into t select 1, 1 from generate_series(1, 10000) s(i);    insert into t
selecti, i from generate_series(2, 20000) s(i);    create statistics s with (dependencies) on (a,b) from t;    analyze
t;
    select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}]    (1 row)
 

So the degree of the dependency is just ~0.333 although it's obviously a 
perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason 
is that we discard 2/3 of rows, because those groups are only a single 
row each, except for the one large group (1/3 of rows).

Without the mininum group size limitation, the dependencies are:
    test=# select stadependencies from pg_statistic_ext ;                  stadependencies
--------------------------------------------    [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]    (1 row)
 

which seems way more reasonable, I think.


2) A minor detail is that instead of this
    if (estimatedclauses != NULL &&        bms_is_member(listidx, estimatedclauses))        continue;

perhaps we should do just this:
    if (bms_is_member(listidx, estimatedclauses))        continue;

bms_is_member does the same NULL check right at the beginning, so I 
don't think this might make a measurable difference.


kind regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: logical replication launcher crash on buildfarm
Next
From: Tomas Vondra
Date:
Subject: Re: Re: PATCH: pageinspect / add page_checksum andbt_page_items(bytea)