Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 20200309181915.5lxhuw2qxoihfoqo@development
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers
On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
>On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Speaking of which, would you take a look at [1]? I think supporting SAOP
>> is fine, but I wonder if you agree with my conclusion we can't really
>> support inclusion @> as explained in [2].
>>
>
>Hmm, I'm not sure. However, thinking about your example in [2] reminds
>me of a thought I had a while ago, but then forgot about --- there is
>a flaw in the formula used for computing probabilities with functional
>dependencies:
>
>  P(a,b) = P(a) * [f + (1-f)*P(b)]
>
>because it might return a value that is larger that P(b), which
>obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

   create table t (a int, b int);
   insert into t select mod(i,50), mod(i,25)
     from generate_series(1,5000) s(i);
   update t set a = 0 where a < 3;
   create statistics s (dependencies) on a,b from t;

which then does this:

   test=# explain select * from t where a = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..86.50 rows=300 width=8)
      Filter: (a = 0)
   (2 rows)

   test=# explain select * from t where b = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..86.50 rows=200 width=8)
      Filter: (b = 0)
   (2 rows)

   test=# explain select * from t where a = 0 and b = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..99.00 rows=283 width=8)
      Filter: ((a = 0) AND (b = 0))
   (2 rows)

Which I think is the issue you've described ...

>We should amend that formula to prevent a result larger than P(b). The
>obvious way to do that would be to use:
>
>  P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
>
>but actually I think it would be better and more principled to use:
>
>  P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
>
>I.e., for those rows believed to be functionally dependent, we use the
>minimum probability, and for the rows believed to be independent, we
>use the product.
>

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

>I think that would solve the problem with the example you gave at the
>end of [2], but I'm not sure if it helps with the general case.
>

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: range_agg
Next
From: Andres Freund
Date:
Subject: Re: logical replication empty transactions