Thread: Additional improvements to extended statistics

Additional improvements to extended statistics

From
Tomas Vondra
Date:
Hi,

Now that I've committed [1] which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.


2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

   - Var op Const
   - Var IS [NOT] NULL
   - [NOT] Var
   - ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.


3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

    CREATE STATISTICS s ON a, b, c FROM t;

but not

    CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?


ragards

[1] https://commitfest.postgresql.org/26/2320/

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Pavel Stehule
Date:


út 14. 1. 2020 v 0:00 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
Hi,

Now that I've committed [1] which allows us to use multiple extended
statistics per table, I'd like to start a thread discussing a couple of
additional improvements for extended statistics. I've considered
starting a separate patch for each, but that would be messy as those
changes will touch roughly the same places. So I've organized it into a
single patch series, with the simpler parts at the beginning.

There are three main improvements:

1) improve estimates of OR clauses

Until now, OR clauses pretty much ignored extended statistics, based on
the experience that they're less vulnerable to misestimates. But it's a
bit weird that AND clauses are handled while OR clauses are not, so this
extends the logic to OR clauses.

Status: I think this is fairly OK.


2) support estimating clauses (Var op Var)

Currently, we only support clauses with a single Var, i.e. clauses like

   - Var op Const
   - Var IS [NOT] NULL
   - [NOT] Var
   - ...

and AND/OR clauses built from those simple ones. This patch adds support
for clauses of the form (Var op Var), of course assuming both Vars come
from the same relation.

Status: This works, but it feels a bit hackish. Needs more work.


3) support extended statistics on expressions

Currently we only allow simple references to columns in extended stats,
so we can do

    CREATE STATISTICS s ON a, b, c FROM t;

but not

    CREATE STATISTICS s ON (a+b), (c + 1) FROM t;

+1 for expression's statisctics - it can be great feature.

Pavel


This patch aims to allow this. At the moment it's a WIP - it does most
of the catalog changes and stats building, but with some hacks/bugs. And
it does not even try to use those statistics during estimation.

The first question is how to extend the current pg_statistic_ext catalog
to support expressions. I've been planning to do it the way we support
expressions for indexes, i.e. have two catalog fields - one for keys,
one for expressions.

One difference is that for statistics we don't care about order of the
keys, so that we don't need to bother with storing 0 keys in place for
expressions - we can simply assume keys are first, then expressions.

And this is what the patch does now.

I'm however wondering whether to keep this split - why not to just treat
everything as expressions, and be done with it? A key just represents a
Var expression, after all. And it would massively simplify a lot of code
that now has to care about both keys and expressions.

Of course, expressions are a bit more expensive, but I wonder how
noticeable that would be.

Opinions?


ragards

[1] https://commitfest.postgresql.org/26/2320/

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
Hi,

Here is a rebased version of this patch series. I've polished the first
two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
added a bunch of regression tests to exercise this code. It's not quite
there yet, but I think it's feasible to get this committed for PG13.

The last part (extended stats on expressions) is far from complete, and
it's not feasible to get it into PG13. There's too much missing stuff.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Fri, Mar 06, 2020 at 01:15:56AM +0100, Tomas Vondra wrote:
>Hi,
>
>Here is a rebased version of this patch series. I've polished the first
>two parts a bit - estimation of OR clauses and (Var op Var) clauses, and
>added a bunch of regression tests to exercise this code. It's not quite
>there yet, but I think it's feasible to get this committed for PG13.
>
>The last part (extended stats on expressions) is far from complete, and
>it's not feasible to get it into PG13. There's too much missing stuff.
>

Meh, the last part with stats on expression is not quite right and it
breaks the cputube tester, so here are the first two parts only. I don't
plan to pursue the 0003 part for PG13 anyway, as mentioned.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Here is a rebased version of this patch series. I've polished the first
> two parts a bit - estimation of OR clauses and (Var op Var) clauses.
>

Hi,

I've been looking over the first patch (OR list support). It mostly
looks reasonable to me, except there's a problem with the way
statext_mcv_clauselist_selectivity() combines multiple stat_sel values
into the final result -- in the OR case, it needs to start with sel =
0, and then apply the OR formula to factor in each new estimate. I.e.,
this isn't right for an OR list:

        /* Factor the estimate from this MCV to the oveall estimate. */
        sel *= stat_sel;

(Oh and there's a typo in that comment: s/oveall/overall/).

For example, with the regression test data, this isn't estimated well:

  SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;

Similarly, if no extended stats can be applied it needs to return 0
not 1, for example this query on the test data:

  SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;

It might also be worth adding a couple more regression test cases like these.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:
>On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Here is a rebased version of this patch series. I've polished the first
>> two parts a bit - estimation of OR clauses and (Var op Var) clauses.
>>
>
>Hi,
>
>I've been looking over the first patch (OR list support). It mostly
>looks reasonable to me, except there's a problem with the way
>statext_mcv_clauselist_selectivity() combines multiple stat_sel values
>into the final result -- in the OR case, it needs to start with sel =
>0, and then apply the OR formula to factor in each new estimate. I.e.,
>this isn't right for an OR list:
>
>        /* Factor the estimate from this MCV to the oveall estimate. */
>        sel *= stat_sel;
>
>(Oh and there's a typo in that comment: s/oveall/overall/).
>
>For example, with the regression test data, this isn't estimated well:
>
>  SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;
>
>Similarly, if no extended stats can be applied it needs to return 0
>not 1, for example this query on the test data:
>
>  SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;
>

Ah, right. Thanks for noticing this. Attaches is an updated patch series
with parts 0002 and 0003 adding tests demonstrating the issue and then
fixing it (both shall be merged to 0001).

>It might also be worth adding a couple more regression test cases like these.

Agreed, 0002 adds a couple of relevant tests.

Incidentally, I've been working on improving test coverage for extended
stats over the past few days (it has ~80% lines covered, which is not
bad nor great). I haven't submitted that to hackers yet, because it's
mostly mechanical and it's would interfere with the two existing threads
about extended stats ...

Speaking of which, would you take a look at [1]? I think supporting SAOP
is fine, but I wonder if you agree with my conclusion we can't really
support inclusion @> as explained in [2].

[1] https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy@pierred-pdoc
[2] https://www.postgresql.org/message-id/20200202184134.swoqkqlqorqolrqv%40development

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:
>On Sun, Mar 08, 2020 at 07:17:10PM +0000, Dean Rasheed wrote:
>>On Fri, 6 Mar 2020 at 12:58, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>Here is a rebased version of this patch series. I've polished the first
>>>two parts a bit - estimation of OR clauses and (Var op Var) clauses.
>>>
>>
>>Hi,
>>
>>I've been looking over the first patch (OR list support). It mostly
>>looks reasonable to me, except there's a problem with the way
>>statext_mcv_clauselist_selectivity() combines multiple stat_sel values
>>into the final result -- in the OR case, it needs to start with sel =
>>0, and then apply the OR formula to factor in each new estimate. I.e.,
>>this isn't right for an OR list:
>>
>>       /* Factor the estimate from this MCV to the oveall estimate. */
>>       sel *= stat_sel;
>>
>>(Oh and there's a typo in that comment: s/oveall/overall/).
>>
>>For example, with the regression test data, this isn't estimated well:
>>
>> SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0;
>>
>>Similarly, if no extended stats can be applied it needs to return 0
>>not 1, for example this query on the test data:
>>
>> SELECT * FROM mcv_lists WHERE a = 1 OR a = 2 OR d IS NOT NULL;
>>
>
>Ah, right. Thanks for noticing this. Attaches is an updated patch series
>with parts 0002 and 0003 adding tests demonstrating the issue and then
>fixing it (both shall be merged to 0001).
>

One day I won't forget to actually attach the files ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Speaking of which, would you take a look at [1]? I think supporting SAOP
> is fine, but I wonder if you agree with my conclusion we can't really
> support inclusion @> as explained in [2].
>

Hmm, I'm not sure. However, thinking about your example in [2] reminds
me of a thought I had a while ago, but then forgot about --- there is
a flaw in the formula used for computing probabilities with functional
dependencies:

  P(a,b) = P(a) * [f + (1-f)*P(b)]

because it might return a value that is larger that P(b), which
obviously should not be possible. We should amend that formula to
prevent a result larger than P(b). The obvious way to do that would be
to use:

  P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))

but actually I think it would be better and more principled to use:

  P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)

I.e., for those rows believed to be functionally dependent, we use the
minimum probability, and for the rows believed to be independent, we
use the product.

I think that would solve the problem with the example you gave at the
end of [2], but I'm not sure if it helps with the general case.

Regards,
Dean


> [1] https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy@pierred-pdoc
> [2] https://www.postgresql.org/message-id/20200202184134.swoqkqlqorqolrqv%40development



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
>On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Speaking of which, would you take a look at [1]? I think supporting SAOP
>> is fine, but I wonder if you agree with my conclusion we can't really
>> support inclusion @> as explained in [2].
>>
>
>Hmm, I'm not sure. However, thinking about your example in [2] reminds
>me of a thought I had a while ago, but then forgot about --- there is
>a flaw in the formula used for computing probabilities with functional
>dependencies:
>
>  P(a,b) = P(a) * [f + (1-f)*P(b)]
>
>because it might return a value that is larger that P(b), which
>obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

   create table t (a int, b int);
   insert into t select mod(i,50), mod(i,25)
     from generate_series(1,5000) s(i);
   update t set a = 0 where a < 3;
   create statistics s (dependencies) on a,b from t;

which then does this:

   test=# explain select * from t where a = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..86.50 rows=300 width=8)
      Filter: (a = 0)
   (2 rows)

   test=# explain select * from t where b = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..86.50 rows=200 width=8)
      Filter: (b = 0)
   (2 rows)

   test=# explain select * from t where a = 0 and b = 0;
                        QUERY PLAN
   ----------------------------------------------------
    Seq Scan on t  (cost=0.00..99.00 rows=283 width=8)
      Filter: ((a = 0) AND (b = 0))
   (2 rows)

Which I think is the issue you've described ...

>We should amend that formula to prevent a result larger than P(b). The
>obvious way to do that would be to use:
>
>  P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
>
>but actually I think it would be better and more principled to use:
>
>  P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
>
>I.e., for those rows believed to be functionally dependent, we use the
>minimum probability, and for the rows believed to be independent, we
>use the product.
>

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

>I think that would solve the problem with the example you gave at the
>end of [2], but I'm not sure if it helps with the general case.
>

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Mon, 9 Mar 2020 at 18:19, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:>
> On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
> >
> >  P(a,b) = P(a) * [f + (1-f)*P(b)]
> >
> >because it might return a value that is larger that P(b), which
> >obviously should not be possible.
>
> Hmmm, yeah. It took me a while to reproduce this - the trick is in
> "inverting" the dependency on a subset of data, e.g. like this:
>
>    create table t (a int, b int);
>    insert into t select mod(i,50), mod(i,25)
>      from generate_series(1,5000) s(i);
>    update t set a = 0 where a < 3;
>    create statistics s (dependencies) on a,b from t;
>
> which then does this:
>
>    test=# explain select * from t where a = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..86.50 rows=300 width=8)
>       Filter: (a = 0)
>    (2 rows)
>
>    test=# explain select * from t where b = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..86.50 rows=200 width=8)
>       Filter: (b = 0)
>    (2 rows)
>
>    test=# explain select * from t where a = 0 and b = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..99.00 rows=283 width=8)
>       Filter: ((a = 0) AND (b = 0))
>    (2 rows)
>
> Which I think is the issue you've described ...
>

I think this is also related to the problem that functional dependency
stats don't take into account the fact that the user clauses may not
be compatible with one another. For example:

CREATE TABLE t (a int, b int);
INSERT INTO t
  SELECT x/10,x/10 FROM (SELECT generate_series(1,x)
                     FROM generate_series(1,100) g(x)) AS t(x);
CREATE STATISTICS s (dependencies) ON a,b FROM t;
ANALYSE t;

EXPLAIN SELECT * FROM t WHERE a = 10;

                    QUERY PLAN
--------------------------------------------------
 Seq Scan on t  (cost=0.00..86.12 rows=1 width=8)
   Filter: (a = 10)
(2 rows)

EXPLAIN SELECT * FROM t WHERE b = 1;

                     QUERY PLAN
----------------------------------------------------
 Seq Scan on t  (cost=0.00..86.12 rows=865 width=8)
   Filter: (b = 1)
(2 rows)

EXPLAIN SELECT * FROM t WHERE a = 10 AND b = 1;

                     QUERY PLAN
----------------------------------------------------
 Seq Scan on t  (cost=0.00..98.75 rows=865 width=8)
   Filter: ((a = 10) AND (b = 1))
(2 rows)

whereas without stats it would estimate 1 row. That kind of
over-estimate could get very bad, so it would be good to find a way to
fix it.


> >We should amend that formula to prevent a result larger than P(b). The
> >obvious way to do that would be to use:
> >
> >  P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
> >
> >but actually I think it would be better and more principled to use:
> >
> >  P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
> >
> >I.e., for those rows believed to be functionally dependent, we use the
> >minimum probability, and for the rows believed to be independent, we
> >use the product.
> >
>
> Hmmm, yeah. The trouble is that we currently don't really have both
> selectivities in dependencies_clauselist_selectivity :-(
>

I hacked on this a bit, and I think it's possible to apply dependency
stats in a more general way (not necessarily assuming equality
clauses), something like the attached very rough patch.

This approach guarantees that the result of combining a pair of
selectivities with a functional dependency between them gives a
combined selectivity that is never greater than either individual
selectivity.

One regression test fails, but looking at it, that's to be expected --
the test alters the type of a column, causing its univariate stats to
be dropped, so the single-column estimate is reduced, and the new code
refuses to give a higher estimate than the single clause's new
estimate.


> It's also not clear to me how would this work for more than two clauses,
> that are all functionally dependent. Like (a => b => c), for example.
> But I haven't thought about this very much yet.
>

I attempted to solve that by computing a chain of conditional
probabilities. The maths needs checking over (as I said, this is a
very rough patch). In particular, I think it's wrong for cases like (
a->b, a->c ), but I think it's along the right lines.


> >I think that would solve the problem with the example you gave at the
> >end of [2], but I'm not sure if it helps with the general case.
> >
>
> I don't follow. I think the issue with [2] is that we can't really apply
> stats about the array values to queries on individual array elements.
> Can you explain how would the proposed changes deal with this?
>

With this patch, the original estimate of ~900 rows in that example is
restored with functional dependencies, because of the way it utilises
the minimum selectivity of the 2 clauses.

I've not fully thought this through, but I think it might allow
functional dependencies to be applied to a wider range of operators.

Regards,
Dean

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:
> >
> >Attaches is an updated patch series
> >with parts 0002 and 0003 adding tests demonstrating the issue and then
> >fixing it (both shall be merged to 0001).
> >
>
> One day I won't forget to actually attach the files ...
>

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Fri, Mar 13, 2020 at 04:54:51PM +0000, Dean Rasheed wrote:
>On Mon, 9 Mar 2020 at 00:06, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:
>> >
>> >Attaches is an updated patch series
>> >with parts 0002 and 0003 adding tests demonstrating the issue and then
>> >fixing it (both shall be merged to 0001).
>> >
>>
>> One day I won't forget to actually attach the files ...
>>
>
>0001-0003 look reasonable to me.
>
>One minor point -- there are now 2 code blocks that are basically the
>same, looping over a list of clauses, calling clause_selectivity() and
>then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
>be combined into a new function (clauselist_selectivity_simple_or(),
>say). I guess it would need to be passed the initial starting
>selectivity s1, but it ought to help reduce code duplication.
>

Attached is a patch series rebased on top of the current master, after
committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
to get rid of the code duplication, and barring objections I'll get it
committed shortly together with the two parts improving test coverage.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
>
> ...
>
>Attached is a patch series rebased on top of the current master, after
>committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
>to get rid of the code duplication, and barring objections I'll get it
>committed shortly together with the two parts improving test coverage.
>

I've pushed the two patches improving test coverage for functional
dependencies and MCV lists, which seems mostly non-controversial. I'll
wait a bit more with the two patches actually changing behavior (rebased
version attached, to keep cputube happy). 

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Thomas Munro
Date:
On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
> >Attached is a patch series rebased on top of the current master, after
> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
> >to get rid of the code duplication, and barring objections I'll get it
> >committed shortly together with the two parts improving test coverage.
> >
>
> I've pushed the two patches improving test coverage for functional
> dependencies and MCV lists, which seems mostly non-controversial. I'll
> wait a bit more with the two patches actually changing behavior (rebased
> version attached, to keep cputube happy).

Some comment fixes:

-               /* Check if the expression the right shape (one Var,
one Const) */
-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
+               /*
+                * Check if the expression the right shape (one Var
and one Const,
+                * or two Vars).
+                */

Check if the expression "has" or "is of" the right shape.

- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
- * possibly with a RelabelType on top. When the expression matches this form,
- * returns true, otherwise returns false.
+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
+ * or (Var op Var), possibly with a RelabelType on top. When the expression
+ * matches this form, returns true, otherwise returns false.

... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...

+               /*
+                * Both variables have to be for the same relation
(otherwise it's
+                * a join clause, and we don't deal with those yet.
+                */

Missing close parenthesis.

Stimulated by some bad plans involving JSON, I found my way to your
WIP stats-on-expressions patch in this thread.  Do I understand
correctly that it will eventually also support single expressions,
like CREATE STATISTICS t_distinct_abc (ndistinct) ON
(my_jsonb_column->>'abc') FROM t?  It looks like that would solve
problems that otherwise require a generated column or an expression
index just to get ndistinct.



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Sun, Mar 15, 2020 at 02:48:02PM +1300, Thomas Munro wrote:
>On Sun, Mar 15, 2020 at 1:08 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
>> >Attached is a patch series rebased on top of the current master, after
>> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
>> >to get rid of the code duplication, and barring objections I'll get it
>> >committed shortly together with the two parts improving test coverage.
>> >
>>
>> I've pushed the two patches improving test coverage for functional
>> dependencies and MCV lists, which seems mostly non-controversial. I'll
>> wait a bit more with the two patches actually changing behavior (rebased
>> version attached, to keep cputube happy).
>
>Some comment fixes:
>
>-               /* Check if the expression the right shape (one Var,
>one Const) */
>-               if (!examine_clause_args(expr->args, &var, NULL, NULL))
>+               /*
>+                * Check if the expression the right shape (one Var
>and one Const,
>+                * or two Vars).
>+                */
>
>Check if the expression "has" or "is of" the right shape.
>
>- * Attempts to match the arguments to either (Var op Const) or (Const op Var),
>- * possibly with a RelabelType on top. When the expression matches this form,
>- * returns true, otherwise returns false.
>+ * Attempts to match the arguments to either (Var op Const) or (Const op Var)
>+ * or (Var op Var), possibly with a RelabelType on top. When the expression
>+ * matches this form, returns true, otherwise returns false.
>
>... match the arguments to (Var op Const), (Const op Var) or (Var op Var), ...
>
>+               /*
>+                * Both variables have to be for the same relation
>(otherwise it's
>+                * a join clause, and we don't deal with those yet.
>+                */
>
>Missing close parenthesis.
>

Thanks, I'll get this fixed.

>Stimulated by some bad plans involving JSON, I found my way to your
>WIP stats-on-expressions patch in this thread.  Do I understand
>correctly that it will eventually also support single expressions,
>like CREATE STATISTICS t_distinct_abc (ndistinct) ON
>(my_jsonb_column->>'abc') FROM t?  It looks like that would solve
>problems that otherwise require a generated column or an expression
>index just to get ndistinct.

Yes, I think that's generally the plan. I was also thinking about
inventing some sort of special JSON statistics (e.g. extracting paths
from the JSONB and computing frequencies, or something like that). But
stats on expressions are one of the things I'd like to do in PG14.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
> >
> >Attached is a patch series rebased on top of the current master, after
> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
> >to get rid of the code duplication, and barring objections I'll get it
> >committed shortly together with the two parts improving test coverage.
>
> I've pushed the two patches improving test coverage for functional
> dependencies and MCV lists, which seems mostly non-controversial. I'll
> wait a bit more with the two patches actually changing behavior (rebased
> version attached, to keep cputube happy).
>

Patch 0001 looks to be mostly ready. Just a couple of final comments:

+       if (is_or)
+           simple_sel = clauselist_selectivity_simple_or(root,
stat_clauses, varRelid,
+                                                         jointype,
sjinfo, NULL, 1.0);
+       else

Surely that should be passing 0.0 as the final argument, otherwise it
will always just return simple_sel = 1.0.


+        *
+        * XXX We can't multiply with current value, because for OR clauses
+        * we start with 0.0, so we simply assign to s1 directly.
+        */
+       s = statext_clauselist_selectivity(root, clauses, varRelid,
+                                          jointype, sjinfo, rel,
+                                          &estimatedclauses, true);

That final part of the comment is no longer relevant (variable s1 no
longer exists). Probably it could now just be deleted, since I think
there are sufficient comments elsewhere to explain what's going on.

Otherwise it looks good, and I think this will lead to some very
worthwhile improvements.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Sun, Mar 15, 2020 at 12:37:37PM +0000, Dean Rasheed wrote:
>On Sun, 15 Mar 2020 at 00:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sat, Mar 14, 2020 at 05:56:10PM +0100, Tomas Vondra wrote:
>> >
>> >Attached is a patch series rebased on top of the current master, after
>> >committing the ScalarArrayOpExpr enhancements. I've updated the OR patch
>> >to get rid of the code duplication, and barring objections I'll get it
>> >committed shortly together with the two parts improving test coverage.
>>
>> I've pushed the two patches improving test coverage for functional
>> dependencies and MCV lists, which seems mostly non-controversial. I'll
>> wait a bit more with the two patches actually changing behavior (rebased
>> version attached, to keep cputube happy).
>>
>
>Patch 0001 looks to be mostly ready. Just a couple of final comments:
>
>+       if (is_or)
>+           simple_sel = clauselist_selectivity_simple_or(root,
>stat_clauses, varRelid,
>+                                                         jointype,
>sjinfo, NULL, 1.0);
>+       else
>
>Surely that should be passing 0.0 as the final argument, otherwise it
>will always just return simple_sel = 1.0.
>
>
>+        *
>+        * XXX We can't multiply with current value, because for OR clauses
>+        * we start with 0.0, so we simply assign to s1 directly.
>+        */
>+       s = statext_clauselist_selectivity(root, clauses, varRelid,
>+                                          jointype, sjinfo, rel,
>+                                          &estimatedclauses, true);
>
>That final part of the comment is no longer relevant (variable s1 no
>longer exists). Probably it could now just be deleted, since I think
>there are sufficient comments elsewhere to explain what's going on.
>
>Otherwise it looks good, and I think this will lead to some very
>worthwhile improvements.
>

Attached is a rebased patch series, addressing both those issues.

I've been wondering why none of the regression tests failed because of
the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
make the tests stable, all the MCV lists we use are "perfect" i.e. it
represents 100% of the data. But this selectivity is used to compute
selectivity only for the part not represented by the MCV list, i.e. it's
not really used. I suppose we could add a test that would use larger
MCV item, but I'm afraid that'd be inherently unstable :-(

Another thing I was thinking about is the changes to the API. We need to
pass information whether the clauses are connected by AND or OR to a
number of places, and 0001 does that in two ways. For some functions it
adds a new parameter (called is_or), and for other functiosn it creates
a new copy of a function. So for example

   - statext_mcv_clauselist_selectivity
   - statext_clauselist_selectivity

got the new flag, while e.g. clauselist_selectivity gets a new "copy"
sibling called clauselist_selectivity_or.

There were two reasons for not using flag. First, clauselist_selectivity
and similar functions have to do very different stuff for these two
cases, so it'd be just one huge if/else block. Second, minimizing
breakage of third-party code - pretty much all the extensions I've seen
only work with AND clauses, and call clauselist_selectivity. Adding a
flag would break that code. (Also, there's a bit of laziness, because
this was the simplest thing to do during development.)

But I wonder if that's sufficient reason - maybe we should just add the
flag in all cases. It might break some code, but the fix is trivial (add
a false there).

Opinions?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Attached is a rebased patch series, addressing both those issues.
>
> I've been wondering why none of the regression tests failed because of
> the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
> make the tests stable, all the MCV lists we use are "perfect" i.e. it
> represents 100% of the data. But this selectivity is used to compute
> selectivity only for the part not represented by the MCV list, i.e. it's
> not really used. I suppose we could add a test that would use larger
> MCV item, but I'm afraid that'd be inherently unstable :-(
>

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

> Another thing I was thinking about is the changes to the API. We need to
> pass information whether the clauses are connected by AND or OR to a
> number of places, and 0001 does that in two ways. For some functions it
> adds a new parameter (called is_or), and for other functiosn it creates
> a new copy of a function. So for example
>
>    - statext_mcv_clauselist_selectivity
>    - statext_clauselist_selectivity
>
> got the new flag, while e.g. clauselist_selectivity gets a new "copy"
> sibling called clauselist_selectivity_or.
>
> There were two reasons for not using flag. First, clauselist_selectivity
> and similar functions have to do very different stuff for these two
> cases, so it'd be just one huge if/else block. Second, minimizing
> breakage of third-party code - pretty much all the extensions I've seen
> only work with AND clauses, and call clauselist_selectivity. Adding a
> flag would break that code. (Also, there's a bit of laziness, because
> this was the simplest thing to do during development.)
>
> But I wonder if that's sufficient reason - maybe we should just add the
> flag in all cases. It might break some code, but the fix is trivial (add
> a false there).
>
> Opinions?
>

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Thu, Mar 19, 2020 at 07:08:07PM +0000, Dean Rasheed wrote:
>On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Attached is a rebased patch series, addressing both those issues.
>>
>> I've been wondering why none of the regression tests failed because of
>> the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
>> make the tests stable, all the MCV lists we use are "perfect" i.e. it
>> represents 100% of the data. But this selectivity is used to compute
>> selectivity only for the part not represented by the MCV list, i.e. it's
>> not really used. I suppose we could add a test that would use larger
>> MCV item, but I'm afraid that'd be inherently unstable :-(
>>
>
>I think it ought to be possible to write stable tests for this code
>branch -- I think all you need is for the number of rows to remain
>small, so that the stats sample every row and are predictable, while
>the MCVs don't cover all values, which can be achieved with a mix of
>some common values, and some that only occur once.
>
>I haven't tried it, but it seems like it would be possible in principle.
>

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.


>> Another thing I was thinking about is the changes to the API. We need to
>> pass information whether the clauses are connected by AND or OR to a
>> number of places, and 0001 does that in two ways. For some functions it
>> adds a new parameter (called is_or), and for other functiosn it creates
>> a new copy of a function. So for example
>>
>>    - statext_mcv_clauselist_selectivity
>>    - statext_clauselist_selectivity
>>
>> got the new flag, while e.g. clauselist_selectivity gets a new "copy"
>> sibling called clauselist_selectivity_or.
>>
>> There were two reasons for not using flag. First, clauselist_selectivity
>> and similar functions have to do very different stuff for these two
>> cases, so it'd be just one huge if/else block. Second, minimizing
>> breakage of third-party code - pretty much all the extensions I've seen
>> only work with AND clauses, and call clauselist_selectivity. Adding a
>> flag would break that code. (Also, there's a bit of laziness, because
>> this was the simplest thing to do during development.)
>>
>> But I wonder if that's sufficient reason - maybe we should just add the
>> flag in all cases. It might break some code, but the fix is trivial (add
>> a false there).
>>
>> Opinions?
>>
>
>-1
>
>I think of clause_selectivity() and clauselist_selectivity() as the
>public API that everyone is using, whilst the functions that support
>lists of clauses to be combined using OR are internal (to the planner)
>implementation details. I think callers of public API tend to either
>have implicitly AND'ed list of clauses, or a single OR clause.
>

OK, thanks. That was mostly my reasoning too - not wanting to cause
unnecessary breakage. And yes, I agree most people just call
clauselist_selectivity with a list of clauses combined using AND.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Ah, right. Yeah, I think that should work. I thought there would be some
> volatility due to groups randomly not making it into the MCV list, but
> you're right it's possible to construct the data in a way to make it
> perfectly deterministic. So that's what I've done in the attached patch.
>

Looking through those new tests, another issue occurred to me -- under
some circumstances this patch can lead to extended stats being applied
twice to the same clause, which is not great, because it involves
quite a lot of extra work, and also because it can lead to
overestimates because of the way that MCV stats are applied as a delta
correction to simple_sel.

The way this comes about is as follows -- if we start with an OR
clause, that will be passed as a single-item implicitly AND'ed list to
clauselist_selectivity(), and from there to
statext_clauselist_selectivity() and then to
statext_mcv_clauselist_selectivity(). This will call
clauselist_selectivity_simple() to get simple_sel, before calling
mcv_clauselist_selectivity(), which will recursively compute all the
MCV corrections. However, the call to clauselist_selectivity_simple()
will call clause_selectivity() for each clause (just a single OR
clause in this example), which will now call
clauselist_selectivity_or(), which will go back into
statext_clauselist_selectivity() with "is_or = true", which will apply
the MCV corrections to the same set of clauses that the outer call was
about to process.

I'm not sure what's the best way to resolve that. Perhaps
statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
should ignore OR clauses from an AND-list, on the basis that they will
get processed recursively later. Or perhaps estimatedclauses can
somehow be used to prevent this, though I'm not sure exactly how that
would work.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Mon, Mar 23, 2020 at 08:21:42AM +0000, Dean Rasheed wrote:
>On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Ah, right. Yeah, I think that should work. I thought there would be some
>> volatility due to groups randomly not making it into the MCV list, but
>> you're right it's possible to construct the data in a way to make it
>> perfectly deterministic. So that's what I've done in the attached patch.
>>
>
>Looking through those new tests, another issue occurred to me -- under
>some circumstances this patch can lead to extended stats being applied
>twice to the same clause, which is not great, because it involves
>quite a lot of extra work, and also because it can lead to
>overestimates because of the way that MCV stats are applied as a delta
>correction to simple_sel.
>
>The way this comes about is as follows -- if we start with an OR
>clause, that will be passed as a single-item implicitly AND'ed list to
>clauselist_selectivity(), and from there to
>statext_clauselist_selectivity() and then to
>statext_mcv_clauselist_selectivity(). This will call
>clauselist_selectivity_simple() to get simple_sel, before calling
>mcv_clauselist_selectivity(), which will recursively compute all the
>MCV corrections. However, the call to clauselist_selectivity_simple()
>will call clause_selectivity() for each clause (just a single OR
>clause in this example), which will now call
>clauselist_selectivity_or(), which will go back into
>statext_clauselist_selectivity() with "is_or = true", which will apply
>the MCV corrections to the same set of clauses that the outer call was
>about to process.
>

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

   CREATE TABLE t (a int, b int);
   INSERT INTO t SELECT mod(i, 10), mod(i, 10)
     FROM generate_series(1,100000);
   CREATE STATISTICS s (mcv) ON a,b FROM t;

   SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly, but the calls graph looks like this:

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity  <---
         clauselist_selectivity_simple
           clause_selectivity
             clauselist_selectivity_or
               statext_clauselist_selectivity
                 statext_mcv_clauselist_selectivity  <---
                   clauselist_selectivity_simple_or
                     clause_selectivity
                     clause_selectivity
                   mcv_clauselist_selectivity
               clauselist_selectivity_simple_or
         mcv_clauselist_selectivity
       clauselist_selectivity_simple
         (already estimated)

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.


>I'm not sure what's the best way to resolve that. Perhaps
>statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>should ignore OR clauses from an AND-list, on the basis that they will
>get processed recursively later. Or perhaps estimatedclauses can
>somehow be used to prevent this, though I'm not sure exactly how that
>would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity
         clauselist_selectivity_simple  <--- disable extended stats
           clause_selectivity
             clauselist_selectivity_simple_or
               clause_selectivity
               clause_selectivity
         mcv_clauselist_selectivity
     clauselist_selectivity_simple
       already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

BTW do you have an example where this would actually cause an issue?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hmmm. So let's consider a simple OR clause with two arguments, both
> covered by single statistics object. Something like this:
>
>    CREATE TABLE t (a int, b int);
>    INSERT INTO t SELECT mod(i, 10), mod(i, 10)
>      FROM generate_series(1,100000);
>    CREATE STATISTICS s (mcv) ON a,b FROM t;
>
>    SELECT * FROM t WHERE a = 0 OR b = 0;
>
> Which is estimated correctly...
>

Hmm, the reason that is estimated correctly is that the MCV values
cover all values in the table, so mcv_totalsel is 1 (or pretty close
to 1), and other_sel is capped to nearly 0, and so the result is
basically just mcv_sel -- i.e., in this case the MCV estimates are
returned more-or-less as-is, rather than being applied as a
correction, and so the result is independent of how many times
extended stats are applied.

The more interesting (and probably more realistic) case is where the
MCV values do not cover the all values in the table, as in the new
mcv_lists_partial examples in the regression tests, for example this
test case, which produces a significant overestimate:

  SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0

although actually, I think there's another reason for that (in
addition to the extended stats correction being applied twice) -- the
way the MCV code makes use of base selectivity doesn't seem really
appropriate for OR clauses, because the way base_frequency is computed
is really based on the assumption that every column would be matched.
I'm not sure that there's any easy answer to that though. I feels like
what's needed when computing the match bitmaps in mcv.c, is to produce
a bitmap (would it fit in 1 byte?) per value, to indicate which
columns match, and base_frequency values per column. That would be
significantly more work though, so almost certainly isn't feasible for
PG13.

> IIUC the problem you have in mind is that we end up calling
> statext_mcv_clauselist_selectivity twice, once for the top-level AND
> clause with a single argument, and then recursively for the expanded OR
> clause. Indeed, that seems to be applying the correction twice.
>
>
> >I'm not sure what's the best way to resolve that. Perhaps
> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
> >should ignore OR clauses from an AND-list, on the basis that they will
> >get processed recursively later. Or perhaps estimatedclauses can
> >somehow be used to prevent this, though I'm not sure exactly how that
> >would work.
>
> I don't know. I feel uneasy about just ignoring some of the clauses,
> because what happens for complex clauses, where the OR is not directly
> in the AND clause, but is negated or something like that?
>
> Isn't it the case that clauselist_selectivity_simple (and the OR
> variant) should ignore extended stats entirely? That is, we'd need to
> add a flag (or _simple variant) to clause_selectivity, so that it calls
> causelist_selectivity_simple_or. So the calls would look like this:
>
>    clauselist_selectivity
>      statext_clauselist_selectivity
>        statext_mcv_clauselist_selectivity
>          clauselist_selectivity_simple  <--- disable extended stats
>            clause_selectivity
>              clauselist_selectivity_simple_or
>                clause_selectivity
>                clause_selectivity
>          mcv_clauselist_selectivity
>      clauselist_selectivity_simple
>        already estimated
>
> I've only quickly hacked clause_selectivity, but it does not seems very
> invasive (of course, it means disruption to clause_selectivity callers,
> but I suppose most call clauselist_selectivity).
>

Sounds like a reasonable approach, but I think it would be better to
preserve the current public API by having clauselist_selectivity()
become a thin wrapper around  a new function that optionally applies
extended stats.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:
>On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hmmm. So let's consider a simple OR clause with two arguments, both
>> covered by single statistics object. Something like this:
>>
>>    CREATE TABLE t (a int, b int);
>>    INSERT INTO t SELECT mod(i, 10), mod(i, 10)
>>      FROM generate_series(1,100000);
>>    CREATE STATISTICS s (mcv) ON a,b FROM t;
>>
>>    SELECT * FROM t WHERE a = 0 OR b = 0;
>>
>> Which is estimated correctly...
>>
>
>Hmm, the reason that is estimated correctly is that the MCV values
>cover all values in the table, so mcv_totalsel is 1 (or pretty close
>to 1), and other_sel is capped to nearly 0, and so the result is
>basically just mcv_sel -- i.e., in this case the MCV estimates are
>returned more-or-less as-is, rather than being applied as a
>correction, and so the result is independent of how many times
>extended stats are applied.
>
>The more interesting (and probably more realistic) case is where the
>MCV values do not cover the all values in the table, as in the new
>mcv_lists_partial examples in the regression tests, for example this
>test case, which produces a significant overestimate:
>
>  SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0
>
>although actually, I think there's another reason for that (in
>addition to the extended stats correction being applied twice) -- the
>way the MCV code makes use of base selectivity doesn't seem really
>appropriate for OR clauses, because the way base_frequency is computed
>is really based on the assumption that every column would be matched.
>I'm not sure that there's any easy answer to that though. I feels like
>what's needed when computing the match bitmaps in mcv.c, is to produce
>a bitmap (would it fit in 1 byte?) per value, to indicate which
>columns match, and base_frequency values per column. That would be
>significantly more work though, so almost certainly isn't feasible for
>PG13.
>

Good point. I haven't thought about the base frequencies. I think 1 byte
should be enough, as we limit the number of columns to 8.

>> IIUC the problem you have in mind is that we end up calling
>> statext_mcv_clauselist_selectivity twice, once for the top-level AND
>> clause with a single argument, and then recursively for the expanded OR
>> clause. Indeed, that seems to be applying the correction twice.
>>
>>
>> >I'm not sure what's the best way to resolve that. Perhaps
>> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>> >should ignore OR clauses from an AND-list, on the basis that they will
>> >get processed recursively later. Or perhaps estimatedclauses can
>> >somehow be used to prevent this, though I'm not sure exactly how that
>> >would work.
>>
>> I don't know. I feel uneasy about just ignoring some of the clauses,
>> because what happens for complex clauses, where the OR is not directly
>> in the AND clause, but is negated or something like that?
>>
>> Isn't it the case that clauselist_selectivity_simple (and the OR
>> variant) should ignore extended stats entirely? That is, we'd need to
>> add a flag (or _simple variant) to clause_selectivity, so that it calls
>> causelist_selectivity_simple_or. So the calls would look like this:
>>
>>    clauselist_selectivity
>>      statext_clauselist_selectivity
>>        statext_mcv_clauselist_selectivity
>>          clauselist_selectivity_simple  <--- disable extended stats
>>            clause_selectivity
>>              clauselist_selectivity_simple_or
>>                clause_selectivity
>>                clause_selectivity
>>          mcv_clauselist_selectivity
>>      clauselist_selectivity_simple
>>        already estimated
>>
>> I've only quickly hacked clause_selectivity, but it does not seems very
>> invasive (of course, it means disruption to clause_selectivity callers,
>> but I suppose most call clauselist_selectivity).
>>
>
>Sounds like a reasonable approach, but I think it would be better to
>preserve the current public API by having clauselist_selectivity()
>become a thin wrapper around  a new function that optionally applies
>extended stats.
>

OK, makes sense. I'll take a stab at it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Additional improvements to extended statistics

From
Thomas Munro
Date:
On Sun, Mar 15, 2020 at 3:23 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Mar 15, 2020 at 02:48:02PM +1300, Thomas Munro wrote:
> >Stimulated by some bad plans involving JSON, I found my way to your
> >WIP stats-on-expressions patch in this thread.  Do I understand
> >correctly that it will eventually also support single expressions,
> >like CREATE STATISTICS t_distinct_abc (ndistinct) ON
> >(my_jsonb_column->>'abc') FROM t?  It looks like that would solve
> >problems that otherwise require a generated column or an expression
> >index just to get ndistinct.
>
> Yes, I think that's generally the plan. I was also thinking about
> inventing some sort of special JSON statistics (e.g. extracting paths
> from the JSONB and computing frequencies, or something like that). But
> stats on expressions are one of the things I'd like to do in PG14.

Interesting idea.  If you had simple single-expression statistics, I
suppose a cave-person version of this would be to write a
script/stored procedure that extracts the distinct set of JSON paths
and does CREATE STATISTICS for expressions to access each path.  That
said, I suspect that in many cases there's a small set of a paths and
a human designer would know what to do.  I didn't manage to try your
WIP stats-on-expressions patch due to bitrot and unfinished parts, but
I am hoping it just needs to remove the "if (numcols < 2)
ereport(ERROR ...)" check to get a very very useful thing.



Re: Additional improvements to extended statistics

From
Daniel Gustafsson
Date:
> On 24 Mar 2020, at 15:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:

>> Sounds like a reasonable approach, but I think it would be better to
>> preserve the current public API by having clauselist_selectivity()
>> become a thin wrapper around  a new function that optionally applies
>> extended stats.
>>
>
> OK, makes sense. I'll take a stab at it.

Have you had time to hack on this?  The proposed patch no longer applies, so
I've marked the entry Waiting on Author.

cheers ./daniel


Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On Wed, Jul 01, 2020 at 01:19:40PM +0200, Daniel Gustafsson wrote:
>> On 24 Mar 2020, at 15:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:
>
>>> Sounds like a reasonable approach, but I think it would be better to
>>> preserve the current public API by having clauselist_selectivity()
>>> become a thin wrapper around  a new function that optionally applies
>>> extended stats.
>>>
>>
>> OK, makes sense. I'll take a stab at it.
>
>Have you had time to hack on this?  The proposed patch no longer applies, so
>I've marked the entry Waiting on Author.

Yep, here's a rebased patch. This does not include the changes we've
discussed with Dean in March, but I plan to address that soon.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
Hi,

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

(Note: The WIP patch is expected to fail regression tests. A couple
stats_ext regression tests fail due to changed estimate - I've left it
like that to make the changes more obvious for now.)

Earlier in this thread I used this example:


   CREATE TABLE t (a int, b int);
   INSERT INTO t SELECT mod(i, 10), mod(i, 10)
     FROM generate_series(1,100000) s(i);
   CREATE STATISTICS s (mcv) ON a,b FROM t;
   ANALYZE t;

   EXPLAIN SELECT * FROM t WHERE a = 0 OR b = 0;

which had this call graph with two statext_mcv_clauselist_selectivity
calls (which was kinda the issue):

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity  <--- (1)
         clauselist_selectivity_simple
           clause_selectivity
             clauselist_selectivity_or
               statext_clauselist_selectivity
                 statext_mcv_clauselist_selectivity  <--- (2)
                   clauselist_selectivity_simple_or
                     clause_selectivity
                     clause_selectivity
                   mcv_clauselist_selectivity
               clauselist_selectivity_simple_or
         mcv_clauselist_selectivity
       clauselist_selectivity_simple
         (already estimated)

with the patches applied, the call looks like this:

   clauselist_selectivity_internal (use_extended_stats=1)
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity (is_or=0)
         clauselist_selectivity_simple
           clause_selectivity_internal (use_extended_stats=0)
             clauselist_selectivity_or (use_extended_stats=0)
               clauselist_selectivity_simple_or
                 clause_selectivity_internal (use_extended_stats=0)
                 clause_selectivity_internal (use_extended_stats=0)
         mcv_clauselist_selectivity (is_or=0)
       clauselist_selectivity_simple

The nested call is removed, which I think addresses the issue. As for
the effect on estimates, there's a couple regression tests where the
estimates change - not much though, an example is:

 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial
WHERE a = 0 OR b = 0 OR c = 10');
  estimated | actual
 -----------+--------
-       412 |    104
+       308 |    104
 (1 row)

This is on top of 0001, though. Interestingly enough, this ends up with
the same estimate as current master, but I consider that a coincidence.


As for the patches:

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions


I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

> Isn't it the case that clauselist_selectivity_simple (and the OR 
> variant) should ignore extended stats entirely? That is, we'd need
> to add a flag (or _simple variant) to clause_selectivity, so that it
> calls causelist_selectivity_simple_or.
But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

Dean, does this address the issue you had in mind? Can you come up with
an example of that issue in the form of a regression test or something?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Here is an improved WIP version of the patch series, modified to address
> the issue with repeatedly applying the extended statistics, as discussed
> with Dean in this thread. It's a bit rough and not committable, but I
> need some feedback so I'm posting it in this state.
>

Cool. I haven't forgotten that I promised to look at this.

> Dean, does this address the issue you had in mind? Can you come up with
> an example of that issue in the form of a regression test or something?
>

I'm quite busy with my day job at the moment, but I expect to have
time to review this next week.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Here is an improved WIP version of the patch series, modified to address
> the issue with repeatedly applying the extended statistics, as discussed
> with Dean in this thread. It's a bit rough and not committable, but I
> need some feedback so I'm posting it in this state.

As it stands, it doesn't compile if 0003 is applied, because it missed
one of the callers of clauselist_selectivity_simple(), but that's
easily fixed.

> 0001 is the original patch improving estimates of OR clauses
>
> 0002 adds thin wrappers for clause[list]_selectivity, with "internal"
> functions allowing to specify whether to keep considering extended stats
>
> 0003 does the same for the "simple" functions
>
>
> I've kept it like this to demonstrate that 0002 is not sufficient. In my
> response from March 24 I wrote this:
>
> > Isn't it the case that clauselist_selectivity_simple (and the OR
> > variant) should ignore extended stats entirely? That is, we'd need
> > to add a flag (or _simple variant) to clause_selectivity, so that it
> > calls causelist_selectivity_simple_or.
> But that's actually wrong, as 0002 shows (as it breaks a couple of
> regression tests), because of the way we handle OR clauses. At the top
> level, an OR-clause is actually just a single clause and it may get
> passed to clauselist_selectivity_simple. So entirely disabling extended
> stats for the "simple" functions would also mean disabling extended
> stats for a large number of OR clauses. Which is clearly wrong.
>
> So 0003 addresses that, by adding a flag to the two "simple" functions.
> Ultimately, this should probably do the same thing as 0002 and add thin
> wrappers, because the existing functions are part of the public API.

I agree that, taken together, these patches fix the
multiple-extended-stats-evaluation issue. However:

I think this has ended up with too many variants of these functions,
since we now have "_internal" and "_simple" variants, and you're
proposing adding more. The original purpose of the "_simple" variants
was to compute selectivities without looking at extended stats, and
now the "_internal" variants compute selectivities with an additional
"use_extended_stats" flag to control whether or not to look at
extended stats. Thus they're basically the same, and could be rolled
together.

Additionally, it's worth noting that the "_simple" variants expose the
"estimatedclauses" bitmap as an argument, which IMO is a bit messy as
an API. All callers of the "_simple" functions outside of clausesel.c
actually pass in estimatedclauses=NULL, so it's possible to refactor
and get rid of that, turning estimatedclauses into a purely internal
variable.

Also, it's quite messy that clauselist_selectivity_simple_or() needs
to be passed a Selectivity input (the final argument) that is the
selectivity of any already-estimated clauses, or the value to return
if no not-already-estimated clauses are found, and must be 0.0 when
called from the extended stats code.

Attached is the kind of thing I had in mind (as a single patch, since
I don't think it's worth splitting up). This replaces the "_simple"
and "_internal" variants of these functions with "_opt_ext_stats"
variants whose signatures match the originals except for having the
single extra "use_extended_stats" boolean parameter. Additionally, the
"_simple" functions are merged into the originals (making them more
like they were in PG11) so that the "estimatedclauses" bitmap and
partial-OR-list Selectivity become internal details, no longer exposed
in the API.

Regards,
Dean

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:

On 11/17/20 4:35 PM, Dean Rasheed wrote:
> On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Here is an improved WIP version of the patch series, modified to address
>> the issue with repeatedly applying the extended statistics, as discussed
>> with Dean in this thread. It's a bit rough and not committable, but I
>> need some feedback so I'm posting it in this state.
> 
> As it stands, it doesn't compile if 0003 is applied, because it missed
> one of the callers of clauselist_selectivity_simple(), but that's
> easily fixed.
> 
>> 0001 is the original patch improving estimates of OR clauses
>>
>> 0002 adds thin wrappers for clause[list]_selectivity, with "internal"
>> functions allowing to specify whether to keep considering extended stats
>>
>> 0003 does the same for the "simple" functions
>>
>>
>> I've kept it like this to demonstrate that 0002 is not sufficient. In my
>> response from March 24 I wrote this:
>>
>>> Isn't it the case that clauselist_selectivity_simple (and the OR
>>> variant) should ignore extended stats entirely? That is, we'd need
>>> to add a flag (or _simple variant) to clause_selectivity, so that it
>>> calls causelist_selectivity_simple_or.
>> But that's actually wrong, as 0002 shows (as it breaks a couple of
>> regression tests), because of the way we handle OR clauses. At the top
>> level, an OR-clause is actually just a single clause and it may get
>> passed to clauselist_selectivity_simple. So entirely disabling extended
>> stats for the "simple" functions would also mean disabling extended
>> stats for a large number of OR clauses. Which is clearly wrong.
>>
>> So 0003 addresses that, by adding a flag to the two "simple" functions.
>> Ultimately, this should probably do the same thing as 0002 and add thin
>> wrappers, because the existing functions are part of the public API.
> 
> I agree that, taken together, these patches fix the
> multiple-extended-stats-evaluation issue. However:
> 
> I think this has ended up with too many variants of these functions,
> since we now have "_internal" and "_simple" variants, and you're
> proposing adding more. The original purpose of the "_simple" variants
> was to compute selectivities without looking at extended stats, and
> now the "_internal" variants compute selectivities with an additional
> "use_extended_stats" flag to control whether or not to look at
> extended stats. Thus they're basically the same, and could be rolled
> together.
> 

Yeah, I agree there were far too many functions. Your patch looks much
cleaner / saner than the one I shared last week.

> Additionally, it's worth noting that the "_simple" variants expose the
> "estimatedclauses" bitmap as an argument, which IMO is a bit messy as
> an API. All callers of the "_simple" functions outside of clausesel.c
> actually pass in estimatedclauses=NULL, so it's possible to refactor
> and get rid of that, turning estimatedclauses into a purely internal
> variable.
> 

Hmmm. I think there were two reasons for exposing the estimatedclauses
bitmap like that: (a) we used the function internally and (b) we wanted
to allow cases where the user code might do something with the bitmap.
The first item is not an issue - we can hide that. As for the second
item, my guess is it was unnecessary future-proofing - we don't know
about any use case that might need this, so +1 to get rid of it.

> Also, it's quite messy that clauselist_selectivity_simple_or() needs
> to be passed a Selectivity input (the final argument) that is the
> selectivity of any already-estimated clauses, or the value to return
> if no not-already-estimated clauses are found, and must be 0.0 when
> called from the extended stats code.
> 

True.

> Attached is the kind of thing I had in mind (as a single patch, since
> I don't think it's worth splitting up). This replaces the "_simple"
> and "_internal" variants of these functions with "_opt_ext_stats"
> variants whose signatures match the originals except for having the
> single extra "use_extended_stats" boolean parameter. Additionally, the
> "_simple" functions are merged into the originals (making them more
> like they were in PG11) so that the "estimatedclauses" bitmap and
> partial-OR-list Selectivity become internal details, no longer exposed
> in the API.
> 

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
> AFAICS we use "_internal" for similar functions.
>

There's precedent for using "_opt_xxx" for function variants that add
an option to existing functions, but I agree that in this case it's a
bit of a mouthful. I don't think "_internal" is appropriate though,
since the clauselist function isn't internal. Perhaps using just
"_ext" would be OK.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
> > AFAICS we use "_internal" for similar functions.
> >

I have been thinking about this some more. The one part of this that I
still wasn't happy with was the way that base frequencies were used to
compute the selectivity correction to apply. As noted in [1], using
base frequencies in this way isn't really appropriate for clauses
combined using "OR". The reason is that an item's base frequency is
computed as the product of the per-column selectivities, so that (freq
- base_freq) is the right correction to apply for a set of clauses
combined with "AND", but it doesn't really work properly for clauses
combined with "OR". This is why a number of the estimates in the
regression tests end up being significant over-estimates.

I speculated in [1] that we might fix that by tracking which columns
of the match bitmap actually matched the clauses being estimated, and
then only use those base frequencies. Unfortunately that would also
mean changing the format of the stats that we store, and so would be a
rather invasive change.

It occurred to me though, that there is another, much more
straightforward way to do it. We can rewrite the "OR" clauses, and
turn them into "AND" clauses using the fact that

  P(A OR B) = P(A) + P(B) - P(A AND B)

and then use the multivariate stats to estimate the P(A AND B) part in
the usual way.

Attached is the resulting patch doing it that way. The main change is
in the way that statext_mcv_clauselist_selectivity() works, combined
with a new function mcv_clause_selectivity_or() that does the
necessary MCV bitmap manipulations.

Doing it this way also means that clausesel.c doesn't need to export
clauselist_selectivity_or(), and the new set of exported functions
seem a bit neater now.

A handful of regression test results change, and in all cases except
one the new estimates are much better. One estimate is made worse, but
in that case we only have 2 sets of partial stats:

  SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0

with stats on (a,b) and (c,d) so it's not surprising that combining (a
= 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
old MV stats estimate was more down to chance in this case.

Regards,
Dean

[1] https://www.postgresql.org/message-id/CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg%40mail.gmail.com

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On 11/29/20 3:57 PM, Dean Rasheed wrote:
>> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
>>> AFAICS we use "_internal" for similar functions.
>>>
> 
> I have been thinking about this some more. The one part of this that I
> still wasn't happy with was the way that base frequencies were used to
> compute the selectivity correction to apply. As noted in [1], using
> base frequencies in this way isn't really appropriate for clauses
> combined using "OR". The reason is that an item's base frequency is
> computed as the product of the per-column selectivities, so that (freq
> - base_freq) is the right correction to apply for a set of clauses
> combined with "AND", but it doesn't really work properly for clauses
> combined with "OR". This is why a number of the estimates in the
> regression tests end up being significant over-estimates.
> 
> I speculated in [1] that we might fix that by tracking which columns
> of the match bitmap actually matched the clauses being estimated, and
> then only use those base frequencies. Unfortunately that would also
> mean changing the format of the stats that we store, and so would be a
> rather invasive change.
> 
> It occurred to me though, that there is another, much more
> straightforward way to do it. We can rewrite the "OR" clauses, and
> turn them into "AND" clauses using the fact that
> 
>   P(A OR B) = P(A) + P(B) - P(A AND B)
> 
> and then use the multivariate stats to estimate the P(A AND B) part in
> the usual way.
> 

OK, that seems quite reasonable.

> Attached is the resulting patch doing it that way. The main change is
> in the way that statext_mcv_clauselist_selectivity() works, combined
> with a new function mcv_clause_selectivity_or() that does the
> necessary MCV bitmap manipulations.
> 
> Doing it this way also means that clausesel.c doesn't need to export
> clauselist_selectivity_or(), and the new set of exported functions
> seem a bit neater now.
> 

Nice. I agree this looks way better than the version I hacked together.

I wonder how much of the comment before clauselist_selectivity should
move to clauselist_selectivity_ext - it does talk about range clauses
and so on, but clauselist_selectivity does not really deal with that.
But maybe that's just an implementation detail and it's better to keep
the comment the way it is.

I noticed this outdated comment:

  /* Always compute the selectivity using clause_selectivity */
  s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Also, the comment at clauselist_selectivity_or seems to not follow the
usual pattern, which I think is

/*
 * function name
 *    short one-sentence description
 *
 * ... longer description ...
 */

Those are fairly minor issues. I don't have any deeper objections, and
it seems committable. Do you plan to do that sometime soon?

> A handful of regression test results change, and in all cases except
> one the new estimates are much better. One estimate is made worse, but
> in that case we only have 2 sets of partial stats:
> 
>   SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0
> 
> with stats on (a,b) and (c,d) so it's not surprising that combining (a
> = 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
> old MV stats estimate was more down to chance in this case.
> 

Yeah, that's quite possible - we're multiplying two estimates, but
there's a clear correlation. So it was mostly luck we had over-estimated
the clauses before, which gave us higher product and thus accidentally
better overall estimate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Those are fairly minor issues. I don't have any deeper objections, and
> it seems committable. Do you plan to do that sometime soon?
>

OK, I've updated the patch status in the CF app, and I should be able
to push it in the next day or so.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On 12/1/20 9:15 AM, Dean Rasheed wrote:
> On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Those are fairly minor issues. I don't have any deeper objections, and
>> it seems committable. Do you plan to do that sometime soon?
>>
> 
> OK, I've updated the patch status in the CF app, and I should be able
> to push it in the next day or so.
> 

Cool, thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I wonder how much of the comment before clauselist_selectivity should
> move to clauselist_selectivity_ext - it does talk about range clauses
> and so on, but clauselist_selectivity does not really deal with that.
> But maybe that's just an implementation detail and it's better to keep
> the comment the way it is.

I think it's better to keep it the way it is, so that the entirety of
what clauselist_selectivity() does (via clauselist_selectivity_ext())
can be read in one place, but I have added separate comments for the
new "_ext" functions to explain how they differ. That matches similar
examples elsewhere.


> I noticed this outdated comment:
>
>   /* Always compute the selectivity using clause_selectivity */
>   s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,

Updated.


> Also, the comment at clauselist_selectivity_or seems to not follow the
> usual pattern, which I think is
>
> /*
>  * function name
>  *      short one-sentence description
>  *
>  * ... longer description ...
>  */

Hmm, it seems OK to me. The first part is basically copied from
clauselist_selectivity(). The "longer description" doesn't really have
much more to say because it's much simpler than
clauselist_selectivity(), but it seems neater to keep the two roughly
in sync.


I've been hacking on this a bit more and attached is an updated
(hopefully final) version with some other comment improvements and
also a couple of other tweaks:

The previous version had duplicated code blocks that implemented the
same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
other_sel and total_sel, which was quite messy. So I refactored that
into a new function mcv_combine_selectivities(). About half the
comments from statext_mcv_clauselist_selectivity() then move over to
mcv_combine_selectivities(). I also updated the comments for
mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
explain how their outputs are expected to be used by
mcv_combine_selectivities(). That hopefully makes for a clean
separation of concerns, and makes it easier to tweak the way MCV stats
are applied on top of simple stats, if someone thinks of a better
approach in the future.

In the previous version, for an ORed list of clauses, the MCV
correction was only applied to the overlaps between clauses. That's OK
as long as each clause only refers to a single column, since the
per-column statistics ought to be the best way to estimate each
individual clause in that case. However, if the individual clauses
refer to more than one column, I think the MCV correction should be
applied to each individual clause as well as to the overlaps. That
turns out to be pretty straightforward, since we're already doing all
the hard work of computing the match bitmap for each clause. The sort
of queries I had in mind were things like this:

  WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)

I added a new test case along those lines and the new estimates are
much better than they are without this patch, but not for the reason I
thought --- the old code consistently over-estimated queries like that
because it actually applied the MCV correction twice (once while
processing each AND list, via clause_selectivity(), called from
clauselist_selectivity_simple(), and once for the top-level OR clause,
contained in a single-element implicitly-ANDed list). The way the new
code is structured avoids any kind of double application of extended
stats, producing a much better estimate, which is good.

However, the new code doesn't apply the extended stats directly using
clauselist_selectivity_or() for this kind of query because there are
no RestrictInfos for the nested AND clauses, so
find_single_rel_for_clauses() (and similarly
statext_is_compatible_clause()) regards those clauses as not
compatible with extended stats. So what ends up happening is that
extended stats are used only when we descend down to the two AND
clauses, and their results are combined using the original "s1 + s2 -
s1 * s2" formula. That actually works OK in this case, because there
is no overlap between the two AND clauses, but it wouldn't work so
well if there was.

I'm pretty sure that can be fixed by teaching
find_single_rel_for_clauses() and statext_is_compatible_clause() to
handle BoolExpr clauses, looking for RestrictInfos underneath them,
but I think that should be left for a follow-in patch. I have left a
regression test in place, whose estimates ought to be improved by such
a fix.

The upshot of all that is that the new code that applies the MCV
correction to the individual clauses in an ORed list doesn't help with
queries like the one above at the moment, and it's not obvious whether
it is currently reachable, but I think it's worth leaving in because
it seems more principled, and makes that code more future-proof. I
also think it's neater because now the signature of
mcv_clause_selectivity_or() is more natural --- it's primary return
value is now the clause's MCV selectivity, as suggested by the
function's name, rather than the overlap selectivity that the previous
version was returning. Also, after your "Var Op Var" patch is applied,
I think it would be possible to construct queries that would benefit
from this, so it would be good to get that committed too.

Barring any further comments, I'll push this sometime soon.

Regards,
Dean

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On 12/2/20 4:51 PM, Dean Rasheed wrote:
> On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I wonder how much of the comment before clauselist_selectivity should
>> move to clauselist_selectivity_ext - it does talk about range clauses
>> and so on, but clauselist_selectivity does not really deal with that.
>> But maybe that's just an implementation detail and it's better to keep
>> the comment the way it is.
> 
> I think it's better to keep it the way it is, so that the entirety of
> what clauselist_selectivity() does (via clauselist_selectivity_ext())
> can be read in one place, but I have added separate comments for the
> new "_ext" functions to explain how they differ. That matches similar
> examples elsewhere.
> 

+1

> 
>> I noticed this outdated comment:
>>
>>   /* Always compute the selectivity using clause_selectivity */
>>   s2 = clause_selectivity_ext(root, clause, varRelid, jointype, sjinfo,
> 
> Updated.
> 
> 
>> Also, the comment at clauselist_selectivity_or seems to not follow the
>> usual pattern, which I think is
>>
>> /*
>>  * function name
>>  *      short one-sentence description
>>  *
>>  * ... longer description ...
>>  */
> 
> Hmm, it seems OK to me. The first part is basically copied from
> clauselist_selectivity(). The "longer description" doesn't really have
> much more to say because it's much simpler than
> clauselist_selectivity(), but it seems neater to keep the two roughly
> in sync.
> 

I see. In that case it's OK, I guess.

> 
> I've been hacking on this a bit more and attached is an updated
> (hopefully final) version with some other comment improvements and
> also a couple of other tweaks:
> 
> The previous version had duplicated code blocks that implemented the
> same MCV-correction algorithm using simple_sel, mcv_sel, base_sel,
> other_sel and total_sel, which was quite messy. So I refactored that
> into a new function mcv_combine_selectivities(). About half the
> comments from statext_mcv_clauselist_selectivity() then move over to
> mcv_combine_selectivities(). I also updated the comments for
> mcv_clauselist_selectivity() and mcv_clause_selectivity_or() to
> explain how their outputs are expected to be used by
> mcv_combine_selectivities(). That hopefully makes for a clean
> separation of concerns, and makes it easier to tweak the way MCV stats
> are applied on top of simple stats, if someone thinks of a better
> approach in the future.
> 
> In the previous version, for an ORed list of clauses, the MCV
> correction was only applied to the overlaps between clauses. That's OK
> as long as each clause only refers to a single column, since the
> per-column statistics ought to be the best way to estimate each
> individual clause in that case. However, if the individual clauses
> refer to more than one column, I think the MCV correction should be
> applied to each individual clause as well as to the overlaps. That
> turns out to be pretty straightforward, since we're already doing all
> the hard work of computing the match bitmap for each clause. The sort
> of queries I had in mind were things like this:
> 
>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
> 
> I added a new test case along those lines and the new estimates are
> much better than they are without this patch, but not for the reason I
> thought --- the old code consistently over-estimated queries like that
> because it actually applied the MCV correction twice (once while
> processing each AND list, via clause_selectivity(), called from
> clauselist_selectivity_simple(), and once for the top-level OR clause,
> contained in a single-element implicitly-ANDed list). The way the new
> code is structured avoids any kind of double application of extended
> stats, producing a much better estimate, which is good.
> 

Nice.

> However, the new code doesn't apply the extended stats directly using
> clauselist_selectivity_or() for this kind of query because there are
> no RestrictInfos for the nested AND clauses, so
> find_single_rel_for_clauses() (and similarly
> statext_is_compatible_clause()) regards those clauses as not
> compatible with extended stats. So what ends up happening is that
> extended stats are used only when we descend down to the two AND
> clauses, and their results are combined using the original "s1 + s2 -
> s1 * s2" formula. That actually works OK in this case, because there
> is no overlap between the two AND clauses, but it wouldn't work so
> well if there was.
> 
> I'm pretty sure that can be fixed by teaching
> find_single_rel_for_clauses() and statext_is_compatible_clause() to
> handle BoolExpr clauses, looking for RestrictInfos underneath them,
> but I think that should be left for a follow-in patch. I have left a
> regression test in place, whose estimates ought to be improved by such
> a fix.
> 

Yeah, I agree with leaving this for a separate patch. We can't do
everything at the same time.

> The upshot of all that is that the new code that applies the MCV
> correction to the individual clauses in an ORed list doesn't help with
> queries like the one above at the moment, and it's not obvious whether
> it is currently reachable, but I think it's worth leaving in because
> it seems more principled, and makes that code more future-proof. I
> also think it's neater because now the signature of
> mcv_clause_selectivity_or() is more natural --- it's primary return
> value is now the clause's MCV selectivity, as suggested by the
> function's name, rather than the overlap selectivity that the previous
> version was returning. Also, after your "Var Op Var" patch is applied,
> I think it would be possible to construct queries that would benefit
> from this, so it would be good to get that committed too.
> 
> Barring any further comments, I'll push this sometime soon.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Wed, 2 Dec 2020 at 16:34, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 12/2/20 4:51 PM, Dean Rasheed wrote:
> >
> > Barring any further comments, I'll push this sometime soon.
>
> +1
>

Pushed.

Regards,
Dean



Re: Additional improvements to extended statistics

From
Dean Rasheed
Date:
On Wed, 2 Dec 2020 at 15:51, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> The sort of queries I had in mind were things like this:
>
>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
>
> However, the new code doesn't apply the extended stats directly using
> clauselist_selectivity_or() for this kind of query because there are
> no RestrictInfos for the nested AND clauses, so
> find_single_rel_for_clauses() (and similarly
> statext_is_compatible_clause()) regards those clauses as not
> compatible with extended stats. So what ends up happening is that
> extended stats are used only when we descend down to the two AND
> clauses, and their results are combined using the original "s1 + s2 -
> s1 * s2" formula. That actually works OK in this case, because there
> is no overlap between the two AND clauses, but it wouldn't work so
> well if there was.
>
> I'm pretty sure that can be fixed by teaching
> find_single_rel_for_clauses() and statext_is_compatible_clause() to
> handle BoolExpr clauses, looking for RestrictInfos underneath them,
> but I think that should be left for a follow-in patch.

Attached is a patch doing that, which improves a couple of the
estimates for queries with AND clauses underneath OR clauses, as
expected.

This also revealed a minor bug in the way that the estimates for
multiple statistics objects were combined while processing an OR
clause -- the estimates for the overlaps between clauses only apply
for the current statistics object, so we really have to combine the
estimates for each set of clauses for each statistics object as if
they were independent of one another.

0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
improves the estimates for sub-AND clauses underneath OR clauses.

These are both quite small patches, that hopefully won't interfere
with any of the other extended stats patches.

Regards,
Dean

Attachment

Re: Additional improvements to extended statistics

From
Tomas Vondra
Date:
On 12/7/20 5:15 PM, Dean Rasheed wrote:
> On Wed, 2 Dec 2020 at 15:51, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>
>> The sort of queries I had in mind were things like this:
>>
>>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
>>
>> However, the new code doesn't apply the extended stats directly using
>> clauselist_selectivity_or() for this kind of query because there are
>> no RestrictInfos for the nested AND clauses, so
>> find_single_rel_for_clauses() (and similarly
>> statext_is_compatible_clause()) regards those clauses as not
>> compatible with extended stats. So what ends up happening is that
>> extended stats are used only when we descend down to the two AND
>> clauses, and their results are combined using the original "s1 + s2 -
>> s1 * s2" formula. That actually works OK in this case, because there
>> is no overlap between the two AND clauses, but it wouldn't work so
>> well if there was.
>>
>> I'm pretty sure that can be fixed by teaching
>> find_single_rel_for_clauses() and statext_is_compatible_clause() to
>> handle BoolExpr clauses, looking for RestrictInfos underneath them,
>> but I think that should be left for a follow-in patch.
> 
> Attached is a patch doing that, which improves a couple of the
> estimates for queries with AND clauses underneath OR clauses, as
> expected.
> 
> This also revealed a minor bug in the way that the estimates for
> multiple statistics objects were combined while processing an OR
> clause -- the estimates for the overlaps between clauses only apply
> for the current statistics object, so we really have to combine the
> estimates for each set of clauses for each statistics object as if
> they were independent of one another.
> 
> 0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
> improves the estimates for sub-AND clauses underneath OR clauses.
> 

Cool! Thanks for taking time to investigate and fixing those. Both
patches seem fine to me.

> These are both quite small patches, that hopefully won't interfere
> with any of the other extended stats patches.
> 

I haven't tried, but it should not interfere with it too much.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company