Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | 20200309181915.5lxhuw2qxoihfoqo@development Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
|
List | pgsql-hackers |
On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote: >On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Speaking of which, would you take a look at [1]? I think supporting SAOP >> is fine, but I wonder if you agree with my conclusion we can't really >> support inclusion @> as explained in [2]. >> > >Hmm, I'm not sure. However, thinking about your example in [2] reminds >me of a thought I had a while ago, but then forgot about --- there is >a flaw in the formula used for computing probabilities with functional >dependencies: > > P(a,b) = P(a) * [f + (1-f)*P(b)] > >because it might return a value that is larger that P(b), which >obviously should not be possible. Hmmm, yeah. It took me a while to reproduce this - the trick is in "inverting" the dependency on a subset of data, e.g. like this: create table t (a int, b int); insert into t select mod(i,50), mod(i,25) from generate_series(1,5000) s(i); update t set a = 0 where a < 3; create statistics s (dependencies) on a,b from t; which then does this: test=# explain select * from t where a = 0; QUERY PLAN ---------------------------------------------------- Seq Scan on t (cost=0.00..86.50 rows=300 width=8) Filter: (a = 0) (2 rows) test=# explain select * from t where b = 0; QUERY PLAN ---------------------------------------------------- Seq Scan on t (cost=0.00..86.50 rows=200 width=8) Filter: (b = 0) (2 rows) test=# explain select * from t where a = 0 and b = 0; QUERY PLAN ---------------------------------------------------- Seq Scan on t (cost=0.00..99.00 rows=283 width=8) Filter: ((a = 0) AND (b = 0)) (2 rows) Which I think is the issue you've described ... >We should amend that formula to prevent a result larger than P(b). The >obvious way to do that would be to use: > > P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b)) > >but actually I think it would be better and more principled to use: > > P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b) > >I.e., for those rows believed to be functionally dependent, we use the >minimum probability, and for the rows believed to be independent, we >use the product. > Hmmm, yeah. The trouble is that we currently don't really have both selectivities in dependencies_clauselist_selectivity :-( We get both clauses, but we only compute selectivity of the "implied" clause, and we leave the other one as not estimated (possibly up to clauselist_selectivity). It's also not clear to me how would this work for more than two clauses, that are all functionally dependent. Like (a => b => c), for example. But I haven't thought about this very much yet. >I think that would solve the problem with the example you gave at the >end of [2], but I'm not sure if it helps with the general case. > I don't follow. I think the issue with [2] is that we can't really apply stats about the array values to queries on individual array elements. Can you explain how would the proposed changes deal with this? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: