Thread: PATCH: add support for IN and @> in functional-dependency statistics use

PATCH: add support for IN and @> in functional-dependency statistics use

From

Pierre Ducroquet

Date:

01 February 2020, 07:51:04

Hello

At my current job, we have a lot of multi-tenant databases, thus with tables 
containing a tenant_id column. Such a column introduces a severe bias in 
statistics estimation since any other FK in the next columns is very likely to 
have a functional dependency on the tenant id. We found several queries where 
this functional dependency messed up the estimations so much the optimizer 
chose wrong plans.
When we tried to use extended statistics with CREATE STATISTIC on tenant_id, 
other_id, we noticed that the current implementation for detecting functional 
dependency lacks two features (at least in our use case):
- support for IN clauses
- support for the array contains operator (that could be considered as a 
special case of IN)

After digging in the source code, I think the lack of support for IN clauses 
is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr 
instead of OpExpr. The attached patch fixes this by simply copying the code-
path for OpExpr and changing the type name. It compiles and the results are 
correct, with a dependency being correctly taken into consideration when 
estimating rows. If you think such a copy paste is bad and should be factored 
in another static bool function, please say so and I will happily provide an 
updated patch.
The lack of support for @> operator, on the other hand, seems to be a decision 
taken when writing the initial code, but I can not find any mathematical nor 
technical reason for it. The current code restricts dependency calculation to 
the = operator, obviously because inequality operators are not going to 
work... but array contains is just several = operators grouped, thus the same 
for the dependency calculation. The second patch refactors the operator check 
in order to also include array contains.

I tested the patches on current HEAD, but I can test and provide back-ported 
versions of the patch for other versions if needed (this code path hardly 
changed since its introduction in 10).

Best regards

 Pierre Ducroquet

Attachment

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

01 February 2020, 14:24:46

On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:
>Hello
>
>At my current job, we have a lot of multi-tenant databases, thus with tables
>containing a tenant_id column. Such a column introduces a severe bias in
>statistics estimation since any other FK in the next columns is very likely to
>have a functional dependency on the tenant id. We found several queries where
>this functional dependency messed up the estimations so much the optimizer
>chose wrong plans.
>When we tried to use extended statistics with CREATE STATISTIC on tenant_id,
>other_id, we noticed that the current implementation for detecting functional
>dependency lacks two features (at least in our use case):
>- support for IN clauses
>- support for the array contains operator (that could be considered as a
>special case of IN)
>

Thanks for the patch. I don't think the lack of support for these clause
types is an oversight - we haven't done them because we were not quite
sure the functional dependencies can actually apply to them. But maybe
we can support them, I'm not against that in principle.

>After digging in the source code, I think the lack of support for IN clauses
>is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr
>instead of OpExpr. The attached patch fixes this by simply copying the code-
>path for OpExpr and changing the type name. It compiles and the results are
>correct, with a dependency being correctly taken into consideration when
>estimating rows. If you think such a copy paste is bad and should be factored
>in another static bool function, please say so and I will happily provide an
>updated patch.

Hmmm. Consider a query like this:

   ... WHERE tenant_id = 1 AND another_column IN (2,3)

which kinda contradicts the idea of functional dependencies that knowing
a value in one column, tells us something about a value in a second
column. But that assumes a single value, which is not quite true here.

The above query is essentially the same thing as

   ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))

and also

   ... WHERE (tenant_id=1 AND another_column=2)
          OR (tenant_id=1 AND another_column=3)

at wchich point we could apply functional dependencies - but we'd do it
once for each AND-clause, and then combine the results to compute
selectivity for the OR clause.

But this means that if we apply functional dependencies directly to the
original clause, it'll be inconsistent. Which seems a bit unfortunate.

Or do I get this wrong?

BTW the code added in the 0001 patch is the same as for is_opclause, so
maybe we can simply do

     if (is_opclause(rinfo->clause) ||
         IsA(rinfo->clause, ScalarArrayOpExpr))
     {
         ...
     }

instead of just duplicating the code. We also need some at least some
regression tests, testing functional dependencies with this clause type.

>The lack of support for @> operator, on the other hand, seems to be a decision
>taken when writing the initial code, but I can not find any mathematical nor
>technical reason for it. The current code restricts dependency calculation to
>the = operator, obviously because inequality operators are not going to
>work... but array contains is just several = operators grouped, thus the same
>for the dependency calculation. The second patch refactors the operator check
>in order to also include array contains.
>

No concrete opinion on this yet. I think my concerns are pretty similar
to the IN clause, although I'm not sure what you mean by "this could be
considered as special case of IN".

>I tested the patches on current HEAD, but I can test and provide back-ported
>versions of the patch for other versions if needed (this code path hardly
>changed since its introduction in 10).

I think the chance of this getting backpatched is zero, because it might
easily break existing apps.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependency statistics use

From

Pierre Ducroquet

Date:

02 February 2020, 09:59:32

On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:
> On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:
> >Hello
> >
> >At my current job, we have a lot of multi-tenant databases, thus with
> >tables containing a tenant_id column. Such a column introduces a severe
> >bias in statistics estimation since any other FK in the next columns is
> >very likely to have a functional dependency on the tenant id. We found
> >several queries where this functional dependency messed up the estimations
> >so much the optimizer chose wrong plans.
> >When we tried to use extended statistics with CREATE STATISTIC on
> >tenant_id, other_id, we noticed that the current implementation for
> >detecting functional dependency lacks two features (at least in our use
> >case):
> >- support for IN clauses
> >- support for the array contains operator (that could be considered as a
> >special case of IN)
> 
> Thanks for the patch. I don't think the lack of support for these clause
> types is an oversight - we haven't done them because we were not quite
> sure the functional dependencies can actually apply to them. But maybe
> we can support them, I'm not against that in principle.
> 
> >After digging in the source code, I think the lack of support for IN
> >clauses is an oversight and due to the fact that IN clauses are
> >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
> >simply copying the code- path for OpExpr and changing the type name. It
> >compiles and the results are correct, with a dependency being correctly
> >taken into consideration when estimating rows. If you think such a copy
> >paste is bad and should be factored in another static bool function,
> >please say so and I will happily provide an updated patch.
> 
> Hmmm. Consider a query like this:
> 
>    ... WHERE tenant_id = 1 AND another_column IN (2,3)
> 
> which kinda contradicts the idea of functional dependencies that knowing
> a value in one column, tells us something about a value in a second
> column. But that assumes a single value, which is not quite true here.
> 
> The above query is essentially the same thing as
> 
>    ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))
> 
> and also
> 
>    ... WHERE (tenant_id=1 AND another_column=2)
>           OR (tenant_id=1 AND another_column=3)
> 
> at wchich point we could apply functional dependencies - but we'd do it
> once for each AND-clause, and then combine the results to compute
> selectivity for the OR clause.
> 
> But this means that if we apply functional dependencies directly to the
> original clause, it'll be inconsistent. Which seems a bit unfortunate.
> 
> Or do I get this wrong?

In my tests, I've got a table with two columns a and b, generated this way:
    CREATE TABLE test (a INT, b INT)
        AS SELECT i, i/10 FROM 
                  generate_series(1, 100000) s(i),
                  generate_series(1, 5) x;

With statistics defined on the a, b columns

Here are the estimated selectivity results without any patch:

SELECT * FROM test WHERE a = 1 AND b = 1 : 5
SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1

With the patch, the estimated rows of the last query goes back to 5, which is 
more logical. The other ones don't change.

> BTW the code added in the 0001 patch is the same as for is_opclause, so
> maybe we can simply do
> 
>      if (is_opclause(rinfo->clause) ||
>          IsA(rinfo->clause, ScalarArrayOpExpr))
>      {
>          ...
>      }
> 
> instead of just duplicating the code.

I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary 
compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is 
at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work 
around with this kind of code, but I don't like it much:
List *args;
Oid opno;
if (IsA(rinfo->clause, OpExpr))
{
    /* If it's an opclause, we will check for Var = Const or Const = Var. */
    OpExpr       *expr = (OpExpr *) rinfo->clause;
    args = expr->args;
    opno = expr->opno;
}
else if (IsA(rinfo->clause, ScalarArrayOpExpr))
{
    /* If it's a ScalarArrayOpExpr, check for Var IN Const. */
    ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
    args = expr->args;
    opno = expr->opno;
}

Or I can rewrite it in C++ to play with templates... :)

> We also need some at least some
> regression tests, testing functional dependencies with this clause type.

Agreed

> >The lack of support for @> operator, on the other hand, seems to be a
> >decision taken when writing the initial code, but I can not find any
> >mathematical nor technical reason for it. The current code restricts
> >dependency calculation to the = operator, obviously because inequality
> >operators are not going to work... but array contains is just several =
> >operators grouped, thus the same for the dependency calculation. The
> >second patch refactors the operator check in order to also include array
> >contains.
> 
> No concrete opinion on this yet. I think my concerns are pretty similar
> to the IN clause, although I'm not sure what you mean by "this could be
> considered as special case of IN".

I meant from a mathematical point of view.

> >I tested the patches on current HEAD, but I can test and provide
> >back-ported versions of the patch for other versions if needed (this code
> >path hardly changed since its introduction in 10).
> 
> I think the chance of this getting backpatched is zero, because it might
> easily break existing apps.

I understand

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

02 February 2020, 18:41:34

On Sun, Feb 02, 2020 at 10:59:32AM +0100, Pierre Ducroquet wrote:
>On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:
>> On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:
>> >Hello
>> >
>> >At my current job, we have a lot of multi-tenant databases, thus with
>> >tables containing a tenant_id column. Such a column introduces a severe
>> >bias in statistics estimation since any other FK in the next columns is
>> >very likely to have a functional dependency on the tenant id. We found
>> >several queries where this functional dependency messed up the estimations
>> >so much the optimizer chose wrong plans.
>> >When we tried to use extended statistics with CREATE STATISTIC on
>> >tenant_id, other_id, we noticed that the current implementation for
>> >detecting functional dependency lacks two features (at least in our use
>> >case):
>> >- support for IN clauses
>> >- support for the array contains operator (that could be considered as a
>> >special case of IN)
>>
>> Thanks for the patch. I don't think the lack of support for these clause
>> types is an oversight - we haven't done them because we were not quite
>> sure the functional dependencies can actually apply to them. But maybe
>> we can support them, I'm not against that in principle.
>>
>> >After digging in the source code, I think the lack of support for IN
>> >clauses is an oversight and due to the fact that IN clauses are
>> >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
>> >simply copying the code- path for OpExpr and changing the type name. It
>> >compiles and the results are correct, with a dependency being correctly
>> >taken into consideration when estimating rows. If you think such a copy
>> >paste is bad and should be factored in another static bool function,
>> >please say so and I will happily provide an updated patch.
>>
>> Hmmm. Consider a query like this:
>>
>>    ... WHERE tenant_id = 1 AND another_column IN (2,3)
>>
>> which kinda contradicts the idea of functional dependencies that knowing
>> a value in one column, tells us something about a value in a second
>> column. But that assumes a single value, which is not quite true here.
>>
>> The above query is essentially the same thing as
>>
>>    ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))
>>
>> and also
>>
>>    ... WHERE (tenant_id=1 AND another_column=2)
>>           OR (tenant_id=1 AND another_column=3)
>>
>> at wchich point we could apply functional dependencies - but we'd do it
>> once for each AND-clause, and then combine the results to compute
>> selectivity for the OR clause.
>>
>> But this means that if we apply functional dependencies directly to the
>> original clause, it'll be inconsistent. Which seems a bit unfortunate.
>>
>> Or do I get this wrong?
>
>In my tests, I've got a table with two columns a and b, generated this way:
>    CREATE TABLE test (a INT, b INT)
>        AS SELECT i, i/10 FROM
>                  generate_series(1, 100000) s(i),
>                  generate_series(1, 5) x;
>
>With statistics defined on the a, b columns
>
>Here are the estimated selectivity results without any patch:
>
>SELECT * FROM test WHERE a = 1 AND b = 1 : 5
>SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
>SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
>SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1
>
>With the patch, the estimated rows of the last query goes back to 5, which is
>more logical. The other ones don't change.
>

Yes, I think you're right. I've been playing with this a bit more, and I
think you're right we can support it the way you propose.

I'm still a bit annoyed by the inconsistency this might/does introduce.
Consider for example these clauses:

a) ... WHERE a = 10 AND b IN (100, 200)
b) ... WHERE a = 10 AND (b = 100 OR b = 200)
c) ... WHERE (a = 10 AND b = 100) OR (a = 10 AND b = 200)

All three cases are logically equivalent and do return the same set of
rows. But we estimate them differently, arriving at different estimates.

Case (a) is the one you improve in your patch. Case (c) is actually not
possible in practice, because we rewrite it as (b) during planning.

But (b) is estimated very differently, because we don't recognize the OR
clause as supported by functional dependencies. On the one hand I'm sure
it's not the first case where we already estimate equivalent clauses
differently. On the other hand I wonder how difficult would it be to
support this specific type of OR clause (with all expressions having the
form "Var = Const" and all Vars referencing the same rel).

I'm not going to block the patch because of this, of course. Similarly,
it'd be nice to add support for ScalarArrayOpExpr to MCV stats, not just
functional dependencies ...

>> BTW the code added in the 0001 patch is the same as for is_opclause, so
>> maybe we can simply do
>>
>>      if (is_opclause(rinfo->clause) ||
>>          IsA(rinfo->clause, ScalarArrayOpExpr))
>>      {
>>          ...
>>      }
>>
>> instead of just duplicating the code.
>
>I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary
>compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is
>at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work
>around with this kind of code, but I don't like it much:
>List *args;
>Oid opno;
>if (IsA(rinfo->clause, OpExpr))
>{
>    /* If it's an opclause, we will check for Var = Const or Const = Var. */
>    OpExpr       *expr = (OpExpr *) rinfo->clause;
>    args = expr->args;
>    opno = expr->opno;
>}
>else if (IsA(rinfo->clause, ScalarArrayOpExpr))
>{
>    /* If it's a ScalarArrayOpExpr, check for Var IN Const. */
>    ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
>    args = expr->args;
>    opno = expr->opno;
>}
>

Oh, right. I'm dumb, I missed this obvious detail. I blame belgian beer.

>Or I can rewrite it in C++ to play with templates... :)
>

Please don't ;-)

>> We also need some at least some
>> regression tests, testing functional dependencies with this clause type.
>
>Agreed
>
>> >The lack of support for @> operator, on the other hand, seems to be a
>> >decision taken when writing the initial code, but I can not find any
>> >mathematical nor technical reason for it. The current code restricts
>> >dependency calculation to the = operator, obviously because inequality
>> >operators are not going to work... but array contains is just several =
>> >operators grouped, thus the same for the dependency calculation. The
>> >second patch refactors the operator check in order to also include array
>> >contains.
>>
>> No concrete opinion on this yet. I think my concerns are pretty similar
>> to the IN clause, although I'm not sure what you mean by "this could be
>> considered as special case of IN".
>
>I meant from a mathematical point of view.
>

Can you elaborate a bit? I still don't understand how it's just "several
equality operators grouped".

I think the challenge here is in applying the functional dependency
computed for the whole array to individual elements. I'm not sure we can
do that.

For example, with a table like this:

   CREATE TABLE t (a int, b int[]);
   CREATE STATISTICS s (dependencies) ON a, b FROM t;

Let's say the functional dependency is "perfect" i.e. has strength 1.0.
But that only tells us dependency for complete array values, we don't
know how much information we gain by knowledge of subset of the values.

For example, all the arrays may contain {1, 2, 3} as subset, and then
some "unique" element, like this:

   INSERT INTO t SELECT i/1000, ARRAY[1,2,3, 4 + i/100]
   FROM generate_series(1,1000000) s(i);

and then do a query like this:

   select * from t where a = 10 and b @> ARRAY[1,2];

Without extended stats, it's estimated like this:

                               QUERY PLAN
---------------------------------------------------------------
  Seq Scan on t  (cost=0.00..24346.00 rows=997 width=41)
                 (actual time=1.391..140.261 rows=1000 loops=1)
    Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
    Rows Removed by Filter: 999000
  Planning Time: 0.052 ms
  Execution Time: 140.707 ms
(5 rows)

but the moment you create functional stats, you get this:

                                QUERY PLAN
---------------------------------------------------------------
  Seq Scan on t  (cost=0.00..24346.00 rows=1000000 width=41)
                 (actual time=1.432..143.047 rows=1000 loops=1)
    Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
    Rows Removed by Filter: 999000
  Planning Time: 0.099 ms
  Execution Time: 143.527 ms
(5 rows)

That doesn't seem very useful :-(

>> >I tested the patches on current HEAD, but I can test and provide
>> >back-ported versions of the patch for other versions if needed (this code
>> >path hardly changed since its introduction in 10).
>>
>> I think the chance of this getting backpatched is zero, because it might
>> easily break existing apps.
>
>I understand
>


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

05 March 2020, 02:34:14

Hi Pierre,

I've looked at this patch series, hoping to get it close to committable.
Here is a somewhat improved version of the patch series, split into 5
pieces. The first 4 parts are about applying functional dependencies to
ScalarArrayOpExpr clauses. The last part is about doing the same thing
for MCV lists, so it seemed fine to post it here.

0001 is the patch you posted back in October

0002 simplifies the handling logic a bit, because ScalarArrayOpExpr can
only have form (Var op Const) but not (Const op Var).

0003 fixes what I think is a bug - ScalarArrayOpExpr can represent three
different cases:

  * Var op ANY ()
  * Var IN () -- special case of ANY
  * Var op ALL ()

I don't think functional dependencies can handle the ALL case, we need
to reject it by checking the useOr flag.

0004 adds queries to the stats_ext test suite, to test all of this (some
of the cases illustrate the need for 0003, I think)

0005 allows estimation of ScalarArrayOpExpr by MCV lists, including
regression tests etc.

Will you have time to look at this, particularly 0001-0004, but maybe
even the 0005 part?

As for the second part of your patch (the one allowing estimation of
array containment queries), I still think that's not something we can
reasonably do without also building statistics on elements (which is
what we have in pg_stats but not pg_stats_ext).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

12 March 2020, 10:25:41

[ For the sake of the archives, some of the discussion on the other
thread [1-3] should really have been on this thread. ]

On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> I think the challenge here is in applying the functional dependency
> computed for the whole array to individual elements. I'm not sure we can
> do that.
>
> For example, with a table like this:
>
>    CREATE TABLE t (a int, b int[]);
>    CREATE STATISTICS s (dependencies) ON a, b FROM t;
>
> Let's say the functional dependency is "perfect" i.e. has strength 1.0.
> But that only tells us dependency for complete array values, we don't
> know how much information we gain by knowledge of subset of the values.
>

The more I think about this example, the more I think this is really
just a special case of the more general problem of compatibility of
clauses. Once you add support for IN (...) clauses, any query of the
form

  SELECT ... WHERE (any clauses on col a) AND (any clauses on col b)

can be recast as

  SELECT ... WHERE a IN (...) AND b IN (...)

so any counter-example with bad estimates produced with a query in the
first form can also be written in the second form.

I think we should really be thinking in terms of making a strong
functional dependency (a => b) applicable generally to queries in the
first form, which will work well if the clauses on b are compatible
with those on b, but not if they're incompatible. However, that's not
so very different from the current state without extended stats, which
assumes independence, and will return poor estimates if the
columns/clauses aren't independent.

So I'd be tempted to apply a tidied up version of the patch from [3],
and then lift all restrictions from dependency_is_compatible_clause(),
other than the requirement that the clause refer to a single variable.

Regards,
Dean

[1] https://www.postgresql.org/message-id/CAEZATCXaNFZyOhR4XXAfkvj1tibRBEjje6ZbXwqWUB_tqbH%3Drw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20200309181915.5lxhuw2qxoihfoqo%40development
[3] https://www.postgresql.org/message-id/CAEZATCUic8PwhTnexC%2BUx-Z_e5MhWD-8jk%3DJ1MtnVW8TJD%2BVHw%40mail.gmail.com

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

12 March 2020, 17:30:47

On Thu, Mar 12, 2020 at 10:25:41AM +0000, Dean Rasheed wrote:
>[ For the sake of the archives, some of the discussion on the other
>thread [1-3] should really have been on this thread. ]
>
>On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> I think the challenge here is in applying the functional dependency
>> computed for the whole array to individual elements. I'm not sure we can
>> do that.
>>
>> For example, with a table like this:
>>
>>    CREATE TABLE t (a int, b int[]);
>>    CREATE STATISTICS s (dependencies) ON a, b FROM t;
>>
>> Let's say the functional dependency is "perfect" i.e. has strength 1.0.
>> But that only tells us dependency for complete array values, we don't
>> know how much information we gain by knowledge of subset of the values.
>>
>
>The more I think about this example, the more I think this is really
>just a special case of the more general problem of compatibility of
>clauses. Once you add support for IN (...) clauses, any query of the
>form
>
>  SELECT ... WHERE (any clauses on col a) AND (any clauses on col b)
>
>can be recast as
>
>  SELECT ... WHERE a IN (...) AND b IN (...)
>
>so any counter-example with bad estimates produced with a query in the
>first form can also be written in the second form.
>
>I think we should really be thinking in terms of making a strong
>functional dependency (a => b) applicable generally to queries in the
>first form, which will work well if the clauses on b are compatible
>with those on b, but not if they're incompatible. However, that's not
>so very different from the current state without extended stats, which
>assumes independence, and will return poor estimates if the
>columns/clauses aren't independent.
>

I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

     WHERE a IN (...) AND b IN (...)

but I don't see how we could do that for

     WHERE a @> (...) AND b @> (...)

I don't think the dependency degree gives us any reliable insight into
statistical dependency of elements of the values.

Or maybe we're just talking about different things? You seem to be
talking abotu IN clauses (which I think is doable), but my question was
about using functional dependencies to estimate array containment
clauses (which I think is not really doable).

>So I'd be tempted to apply a tidied up version of the patch from [3],
>and then lift all restrictions from dependency_is_compatible_clause(),
>other than the requirement that the clause refer to a single variable.
>

I haven't looked at the patch from [3] closely yet, but you're right

   P(A & B) <= Min(P(A), P(B))

and the approach you proposed seems reasonable. I don't think how we
can just remove all the restriction on clause type - the restriction
that dependencies only handle equality-like clauses seems pretty much
baked into the dependencies.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

13 March 2020, 08:42:49

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> I'm sorry, but I don't see how we could do this for arbitrary clauses. I
> think we could do that for clauses that have equality semantics and
> reference column values as a whole. So I think it's possible to do this
> for IN clauses (which is what the first part of the patch does), but I
> don't think we can do it for the containment operator.
>
> I.e. we can do that for
>
>      WHERE a IN (...) AND b IN (...)
>

Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

  a | b
 -------
  1 | 1
  2 | 2
  3 | 3
  : | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.

Regards,
Dean

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Bruce Momjian

Date:

13 March 2020, 14:19:21

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:
> On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm sorry, but I don't see how we could do this for arbitrary clauses. I
> > think we could do that for clauses that have equality semantics and
> > reference column values as a whole. So I think it's possible to do this
> > for IN clauses (which is what the first part of the patch does), but I
> > don't think we can do it for the containment operator.
> >
> > I.e. we can do that for
> >
> >      WHERE a IN (...) AND b IN (...)
> >
> 
> Hmm, the difficulty always comes back to the compatibility of the
> clauses though. It's easy to come up with artificial examples for
> which functional dependencies come up with bad estimates, even with
> just = and IN (...) operators. For example, given a perfect
> correlation like
> 
>   a | b
>  -------
>   1 | 1
>   2 | 2
>   3 | 3
>   : | :
> 
> you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
> IN (2,4,6,8,...)" to get a very bad estimate from functional
> dependencies.

Wow, that is a very good example --- the arrays do not tie elements in
one array to elements in another array;  good point.  I get it now!

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

13 March 2020, 16:09:44

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:
>On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> I'm sorry, but I don't see how we could do this for arbitrary clauses. I
>> think we could do that for clauses that have equality semantics and
>> reference column values as a whole. So I think it's possible to do this
>> for IN clauses (which is what the first part of the patch does), but I
>> don't think we can do it for the containment operator.
>>
>> I.e. we can do that for
>>
>>      WHERE a IN (...) AND b IN (...)
>>
>
>Hmm, the difficulty always comes back to the compatibility of the
>clauses though. It's easy to come up with artificial examples for
>which functional dependencies come up with bad estimates, even with
>just = and IN (...) operators. For example, given a perfect
>correlation like
>
>  a | b
> -------
>  1 | 1
>  2 | 2
>  3 | 3
>  : | :
>
>you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
>IN (2,4,6,8,...)" to get a very bad estimate from functional
>dependencies.
>
>However, I don't think such artificial examples are that useful. I
>think you have to think in terms of real data distributions together
>with real queries expected to go with them. For example:
>
>Using the OP's original example of a multi-tenant system, you might
>well have a table with columns (product_type, tenant_id) and a
>functional dependency product_type => tenant_id. In that case, it
>could well be very useful in optimising queries like "WHERE
>product_type IN (X,Y,Z) AND tenant_id = 123".
>
>But this isn't necessarily limited to = and IN (...). For example,
>consider a table with UK-based geographic data with columns (location
>point, postcode text). Then there would be a strong functional
>dependency location => postcode (and possibly also the other way
>round, depending on how dense the points were). That dependency could
>be used to estimate much more general queries like "WHERE location <@
>some_area AND postcode ~ '^CB.*'", where there may be no useful stats
>on location, but a histogram on postcode might give a reasonable
>estimate.
>
>This also extends to inequalities. For example a table with columns
>(weight, category) might have a strong functional dependency weight =>
>category. Then a query like "WHERE weight > 10 AND weight < 20 AND
>category = 'large'" could get a decent estimate from a histogram on
>the weight column, and then use the functional dependency to note that
>that implies the category. Note that such an example would work with
>my patch from the other thread, because it groups clauses by column,
>and uses clauselist_selectivity_simple() on them. So in this case, the
>clauses "weight > 10 AND weight < 20" would be estimated together, and
>would be able to make use of the RangeQueryClause code.
>
>Of course, it's equally easy to come up with counter-example queries
>for any of those examples, where using the functional dependency would
>produce a poor estimate. Ultimately, it's up to the user to decide
>whether or not to build functional dependency statistics, and that
>decision needs to be based not just on the data distribution, but also
>on the types of queries expected.
>

Well, yeah. I'm sure we can produce countless examples where applying
the functional dependencies to additional types of clauses helps a lot.
I'm somewhat hesitant to just drop any restrictions, though, because
it's equally simple to produce examples with poor results.

The main issue I have with just applying dependencies to arbitrary
clauses is it uses "degree" computed for the value as a whole, and
just uses it to estimate dependency between pieces of the values.

The IN() clause does not have this problem, the other cases like @> or
pattern matching do.

>Given the timing though, perhaps it is best to limit this to IN (..)
>clauses for PG13, and we can consider other possibilities later.
>

Yeah, I was gonna propose the same thing. I'll get the IN bit committed
shortly (both for dependencies and MCV), improved handling of OR clauses
and some additional regression tests to increase the coverage.

Then we can discuss these improvement in more detail for PG14.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

14 March 2020, 15:21:59

Hi,

I've pushed the first part of the patch, adding ScalarArrayOpExpr as
supported clause for functional dependencies, and then also doing the
same for MCV lists.

As discussed, I'm not going to do anything about the array containment
clauses for PG13, that needs more discussion.

I have a bunch of additional improvements for extended stats, discussed
in [1]. I'll wait a bit for buildfarm and maybe some feedback before
pushing those.


[1] https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

14 March 2020, 18:45:35

Hi,

I realized there's one more thing that probably needs discussing.
Essentially, these two clause types are the same:

   a IN (1, 2, 3)

   (a = 1 OR a = 2 OR a = 3)

but with 8f321bd1 we only recognize the first one as compatible with
functional dependencies. It was always the case that we estimated those
two clauses a bit differently, but the differences were usually small.
But now that we recognize IN as compatible with dependencies, the
difference may be much larger, which bugs me a bit ...

So I wonder if we should recognize the special form of an OR clause,
with all Vars referencing to the same attribute etc. and treat this as
supported by functional dependencies - the attached patch does that.
MCV lists there's already no difference because OR clauses are
supported.

The question is whether we want to do this, and whether we should also
teach the per-column estimates to recognize this special case of IN
clause. That would allow producing exactly the same estimates even with
functional dependencies - recognizing the OR clause as supported gets us
only half-way there, because we still use estimates for each clause (and
those will be slightly different).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

estimate-OR-similar-to-IN-clauses.patch

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

17 March 2020, 12:42:52

On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> I realized there's one more thing that probably needs discussing.
> Essentially, these two clause types are the same:
>
>    a IN (1, 2, 3)
>
>    (a = 1 OR a = 2 OR a = 3)
>
> but with 8f321bd1 we only recognize the first one as compatible with
> functional dependencies. It was always the case that we estimated those
> two clauses a bit differently, but the differences were usually small.
> But now that we recognize IN as compatible with dependencies, the
> difference may be much larger, which bugs me a bit ...
>
> So I wonder if we should recognize the special form of an OR clause,
> with all Vars referencing to the same attribute etc. and treat this as
> supported by functional dependencies - the attached patch does that.
> MCV lists there's already no difference because OR clauses are
> supported.
>

Makes sense, and the patch looks straightforward enough.

> The question is whether we want to do this, and whether we should also
> teach the per-column estimates to recognize this special case of IN
> clause.

I'm not convinced about that second part though. I'd say that
recognising the OR clause for functional dependencies is sufficient to
prevent the large differences in estimates relative to the equivalent
IN clauses. The small differences between the way that OR and IN
clauses are handled have always been there, and I think that changing
that is out of scope for this work.

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Regards,
Dean

Attachment

dependencies_clauselist_selectivity-v2.patch

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

17 March 2020, 15:37:06

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:
>On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> I realized there's one more thing that probably needs discussing.
>> Essentially, these two clause types are the same:
>>
>>    a IN (1, 2, 3)
>>
>>    (a = 1 OR a = 2 OR a = 3)
>>
>> but with 8f321bd1 we only recognize the first one as compatible with
>> functional dependencies. It was always the case that we estimated those
>> two clauses a bit differently, but the differences were usually small.
>> But now that we recognize IN as compatible with dependencies, the
>> difference may be much larger, which bugs me a bit ...
>>
>> So I wonder if we should recognize the special form of an OR clause,
>> with all Vars referencing to the same attribute etc. and treat this as
>> supported by functional dependencies - the attached patch does that.
>> MCV lists there's already no difference because OR clauses are
>> supported.
>>
>
>Makes sense, and the patch looks straightforward enough.
>
>> The question is whether we want to do this, and whether we should also
>> teach the per-column estimates to recognize this special case of IN
>> clause.
>
>I'm not convinced about that second part though. I'd say that
>recognising the OR clause for functional dependencies is sufficient to
>prevent the large differences in estimates relative to the equivalent
>IN clauses. The small differences between the way that OR and IN
>clauses are handled have always been there, and I think that changing
>that is out of scope for this work.
>

Not sure. I think the inconsistency between plan and extended stats may
be a bit surprising, but I agree that issue may be negligible.

>The other thing that I'm still concerned about is the possibility of
>returning estimates with P(a,b) > P(a) or P(b). I think that such a
>thing becomes much more likely with the new types of clause supported
>here, because they now allow multiple values from each column, where
>before we only allowed one. I took another look at the patch I posted
>on the other thread, and I've convinced myself that it's correct.
>Attached is an updated version, with some cosmetic tidying up and now
>with some additional regression tests.
>

Yeah, I agree that's something we need to fix. Do you plan to push the
fix, or do you want me to do it?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

17 March 2020, 16:14:26

On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:
>
> >The other thing that I'm still concerned about is the possibility of
> >returning estimates with P(a,b) > P(a) or P(b). I think that such a
> >thing becomes much more likely with the new types of clause supported
> >here, because they now allow multiple values from each column, where
> >before we only allowed one. I took another look at the patch I posted
> >on the other thread, and I've convinced myself that it's correct.
> >Attached is an updated version, with some cosmetic tidying up and now
> >with some additional regression tests.
>
> Yeah, I agree that's something we need to fix. Do you plan to push the
> fix, or do you want me to do it?
>

I can push it. Have you had a chance to review it?

Regards,
Dean

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

17 March 2020, 17:05:17

On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote:
>On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:
>>
>> >The other thing that I'm still concerned about is the possibility of
>> >returning estimates with P(a,b) > P(a) or P(b). I think that such a
>> >thing becomes much more likely with the new types of clause supported
>> >here, because they now allow multiple values from each column, where
>> >before we only allowed one. I took another look at the patch I posted
>> >on the other thread, and I've convinced myself that it's correct.
>> >Attached is an updated version, with some cosmetic tidying up and now
>> >with some additional regression tests.
>>
>> Yeah, I agree that's something we need to fix. Do you plan to push the
>> fix, or do you want me to do it?
>>
>
>I can push it. Have you had a chance to review it?
>

Not yet, I'll take a look today.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

18 March 2020, 00:29:46

On Tue, Mar 17, 2020 at 06:05:17PM +0100, Tomas Vondra wrote:
>On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote:
>>On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:
>>>
>>>>The other thing that I'm still concerned about is the possibility of
>>>>returning estimates with P(a,b) > P(a) or P(b). I think that such a
>>>>thing becomes much more likely with the new types of clause supported
>>>>here, because they now allow multiple values from each column, where
>>>>before we only allowed one. I took another look at the patch I posted
>>>>on the other thread, and I've convinced myself that it's correct.
>>>>Attached is an updated version, with some cosmetic tidying up and now
>>>>with some additional regression tests.
>>>
>>>Yeah, I agree that's something we need to fix. Do you plan to push the
>>>fix, or do you want me to do it?
>>>
>>
>>I can push it. Have you had a chance to review it?
>>
>
>Not yet, I'll take a look today.
>

OK, I took a look. I think from the correctness POV the patch is OK, but
I think the dependencies_clauselist_selectivity() function now got a bit
too complex. I've been able to parse it now, but I'm sure I'll have
trouble in the future :-(

Can we refactor / split it somehow and move bits of the logic to smaller
functions, or something like that?

Another thing I'd like to suggest is keeping the "old" formula, and
instead of just replacing it with

    P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

but explaining how the old formula may produce nonsensical selectivity,
and how the new formula addresses that issue.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

18 March 2020, 15:55:43

On Tue, Mar 17, 2020 at 04:37:06PM +0100, Tomas Vondra wrote:
>On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:
>>On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>I realized there's one more thing that probably needs discussing.
>>>Essentially, these two clause types are the same:
>>>
>>>   a IN (1, 2, 3)
>>>
>>>   (a = 1 OR a = 2 OR a = 3)
>>>
>>>but with 8f321bd1 we only recognize the first one as compatible with
>>>functional dependencies. It was always the case that we estimated those
>>>two clauses a bit differently, but the differences were usually small.
>>>But now that we recognize IN as compatible with dependencies, the
>>>difference may be much larger, which bugs me a bit ...
>>>
>>>So I wonder if we should recognize the special form of an OR clause,
>>>with all Vars referencing to the same attribute etc. and treat this as
>>>supported by functional dependencies - the attached patch does that.
>>>MCV lists there's already no difference because OR clauses are
>>>supported.
>>>
>>
>>Makes sense, and the patch looks straightforward enough.
>>
>>>The question is whether we want to do this, and whether we should also
>>>teach the per-column estimates to recognize this special case of IN
>>>clause.
>>
>>I'm not convinced about that second part though. I'd say that
>>recognising the OR clause for functional dependencies is sufficient to
>>prevent the large differences in estimates relative to the equivalent
>>IN clauses. The small differences between the way that OR and IN
>>clauses are handled have always been there, and I think that changing
>>that is out of scope for this work.
>>
>
>Not sure. I think the inconsistency between plan and extended stats may
>be a bit surprising, but I agree that issue may be negligible.
>

OK, I've pushed the change recognizing the special case of OR clauses as
supported by functional dependencies. I've left the estimation of the
clause itself as it's, we can address that in the future if needed.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

19 March 2020, 19:53:39

On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> OK, I took a look. I think from the correctness POV the patch is OK, but
> I think the dependencies_clauselist_selectivity() function now got a bit
> too complex. I've been able to parse it now, but I'm sure I'll have
> trouble in the future :-(
>
> Can we refactor / split it somehow and move bits of the logic to smaller
> functions, or something like that?
>

Yeah, it has gotten a bit long. It's somewhat tricky splitting it up,
because of the number of shared variables used throughout the
function, but here's an updated patch splitting it into what seemed
like the 2 most logical pieces. The first piece (still in
dependencies_clauselist_selectivity()) works out what dependencies
can/should be applied, and the second piece in a new function does the
actual work of applying the list of functional dependencies to the
clause list.

I think that has made it easier to follow, and it has also reduced the
complexity of the final "no applicable stats" branch.

> Another thing I'd like to suggest is keeping the "old" formula, and
> instead of just replacing it with
>
>     P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
>
> but explaining how the old formula may produce nonsensical selectivity,
> and how the new formula addresses that issue.
>

I think this is purely a comment issue? I've added some more extensive
comments attempting to justify the formulae.

Regards,
Dean

Attachment

dependencies_clauselist_selectivity-v3.patch

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

25 March 2020, 00:28:01

On Thu, Mar 19, 2020 at 07:53:39PM +0000, Dean Rasheed wrote:
>On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> OK, I took a look. I think from the correctness POV the patch is OK, but
>> I think the dependencies_clauselist_selectivity() function now got a bit
>> too complex. I've been able to parse it now, but I'm sure I'll have
>> trouble in the future :-(
>>
>> Can we refactor / split it somehow and move bits of the logic to smaller
>> functions, or something like that?
>>
>
>Yeah, it has gotten a bit long. It's somewhat tricky splitting it up,
>because of the number of shared variables used throughout the
>function, but here's an updated patch splitting it into what seemed
>like the 2 most logical pieces. The first piece (still in
>dependencies_clauselist_selectivity()) works out what dependencies
>can/should be applied, and the second piece in a new function does the
>actual work of applying the list of functional dependencies to the
>clause list.
>
>I think that has made it easier to follow, and it has also reduced the
>complexity of the final "no applicable stats" branch.
>

Seems OK to me.

I'd perhaps name deps_clauselist_selectivity differently, it's a bit too
similar to dependencies_clauselist_selectivity. Perhaps something like
clauselist_apply_dependencies? But that's a minor detail.

>> Another thing I'd like to suggest is keeping the "old" formula, and
>> instead of just replacing it with
>>
>>     P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
>>
>> but explaining how the old formula may produce nonsensical selectivity,
>> and how the new formula addresses that issue.
>>
>
>I think this is purely a comment issue? I've added some more extensive
>comments attempting to justify the formulae.
>

Yes, it was purely a comment issue. Seems fine now.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

28 March 2020, 13:18:17

On Wed, 25 Mar 2020 at 00:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Seems OK to me.
>
> I'd perhaps name deps_clauselist_selectivity differently, it's a bit too
> similar to dependencies_clauselist_selectivity. Perhaps something like
> clauselist_apply_dependencies? But that's a minor detail.
>

OK, I've pushed that with your recommendation for that function name.

Regards,
Dean

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Dean Rasheed

Date:

29 March 2020, 09:22:25

On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> OK, I've pushed that with your recommendation for that function name.
>

Does this now complete everything that you wanted to do for functional
dependency stats for PG13? Re-reading the thread, I couldn't see
anything else that needed looking at. If that's the case, the CF entry
can be closed.

Regards,
Dean

Re: PATCH: add support for IN and @> in functional-dependencystatistics use

From

Tomas Vondra

Date:

29 March 2020, 15:27:36

On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote:
>On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>
>> OK, I've pushed that with your recommendation for that function name.
>>
>
>Does this now complete everything that you wanted to do for functional
>dependency stats for PG13? Re-reading the thread, I couldn't see
>anything else that needed looking at. If that's the case, the CF entry
>can be closed.
>

Yes. There were two improvements proposed, we've committed one of them
(the IN/ANY operator handling) and the other (containment) needs more
discussion. So I think it's OK to mark this either as committed or maybe
returned with feedback.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependency statistics use

From

Daniel Gustafsson

Date:

01 July 2020, 08:47:46

> On 29 Mar 2020, at 17:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote:
>> On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

>>> OK, I've pushed that with your recommendation for that function name.
>>
>> Does this now complete everything that you wanted to do for functional
>> dependency stats for PG13? Re-reading the thread, I couldn't see
>> anything else that needed looking at. If that's the case, the CF entry
>> can be closed.
>
> Yes. There were two improvements proposed, we've committed one of them
> (the IN/ANY operator handling) and the other (containment) needs more
> discussion. So I think it's OK to mark this either as committed or maybe
> returned with feedback.

Since there hasn't been more discussion on the second item I've closed this
item as committed. The containment part can be opened as a new CF entry.

cheers ./daniel