Re: PATCH: add support for IN and @> in functional-dependency statistics use - Mailing list pgsql-hackers

From Pierre Ducroquet
Subject Re: PATCH: add support for IN and @> in functional-dependency statistics use
Date
Msg-id 3937607.jCdSCb7NGl@pierred-pdoc
Whole thread Raw
In response to Re: PATCH: add support for IN and @> in functional-dependencystatistics use  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: PATCH: add support for IN and @> in functional-dependencystatistics use
List pgsql-hackers
On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:
> On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:
> >Hello
> >
> >At my current job, we have a lot of multi-tenant databases, thus with
> >tables containing a tenant_id column. Such a column introduces a severe
> >bias in statistics estimation since any other FK in the next columns is
> >very likely to have a functional dependency on the tenant id. We found
> >several queries where this functional dependency messed up the estimations
> >so much the optimizer chose wrong plans.
> >When we tried to use extended statistics with CREATE STATISTIC on
> >tenant_id, other_id, we noticed that the current implementation for
> >detecting functional dependency lacks two features (at least in our use
> >case):
> >- support for IN clauses
> >- support for the array contains operator (that could be considered as a
> >special case of IN)
> 
> Thanks for the patch. I don't think the lack of support for these clause
> types is an oversight - we haven't done them because we were not quite
> sure the functional dependencies can actually apply to them. But maybe
> we can support them, I'm not against that in principle.
> 
> >After digging in the source code, I think the lack of support for IN
> >clauses is an oversight and due to the fact that IN clauses are
> >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
> >simply copying the code- path for OpExpr and changing the type name. It
> >compiles and the results are correct, with a dependency being correctly
> >taken into consideration when estimating rows. If you think such a copy
> >paste is bad and should be factored in another static bool function,
> >please say so and I will happily provide an updated patch.
> 
> Hmmm. Consider a query like this:
> 
>    ... WHERE tenant_id = 1 AND another_column IN (2,3)
> 
> which kinda contradicts the idea of functional dependencies that knowing
> a value in one column, tells us something about a value in a second
> column. But that assumes a single value, which is not quite true here.
> 
> The above query is essentially the same thing as
> 
>    ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))
> 
> and also
> 
>    ... WHERE (tenant_id=1 AND another_column=2)
>           OR (tenant_id=1 AND another_column=3)
> 
> at wchich point we could apply functional dependencies - but we'd do it
> once for each AND-clause, and then combine the results to compute
> selectivity for the OR clause.
> 
> But this means that if we apply functional dependencies directly to the
> original clause, it'll be inconsistent. Which seems a bit unfortunate.
> 
> Or do I get this wrong?

In my tests, I've got a table with two columns a and b, generated this way:
    CREATE TABLE test (a INT, b INT)
        AS SELECT i, i/10 FROM 
                  generate_series(1, 100000) s(i),
                  generate_series(1, 5) x;

With statistics defined on the a, b columns

Here are the estimated selectivity results without any patch:

SELECT * FROM test WHERE a = 1 AND b = 1 : 5
SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1

With the patch, the estimated rows of the last query goes back to 5, which is 
more logical. The other ones don't change.

> BTW the code added in the 0001 patch is the same as for is_opclause, so
> maybe we can simply do
> 
>      if (is_opclause(rinfo->clause) ||
>          IsA(rinfo->clause, ScalarArrayOpExpr))
>      {
>          ...
>      }
> 
> instead of just duplicating the code.

I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary 
compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is 
at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work 
around with this kind of code, but I don't like it much:
List *args;
Oid opno;
if (IsA(rinfo->clause, OpExpr))
{
    /* If it's an opclause, we will check for Var = Const or Const = Var. */
    OpExpr       *expr = (OpExpr *) rinfo->clause;
    args = expr->args;
    opno = expr->opno;
}
else if (IsA(rinfo->clause, ScalarArrayOpExpr))
{
    /* If it's a ScalarArrayOpExpr, check for Var IN Const. */
    ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
    args = expr->args;
    opno = expr->opno;
}

Or I can rewrite it in C++ to play with templates... :)

> We also need some at least some
> regression tests, testing functional dependencies with this clause type.

Agreed

> >The lack of support for @> operator, on the other hand, seems to be a
> >decision taken when writing the initial code, but I can not find any
> >mathematical nor technical reason for it. The current code restricts
> >dependency calculation to the = operator, obviously because inequality
> >operators are not going to work... but array contains is just several =
> >operators grouped, thus the same for the dependency calculation. The
> >second patch refactors the operator check in order to also include array
> >contains.
> 
> No concrete opinion on this yet. I think my concerns are pretty similar
> to the IN clause, although I'm not sure what you mean by "this could be
> considered as special case of IN".

I meant from a mathematical point of view.

> >I tested the patches on current HEAD, but I can test and provide
> >back-ported versions of the patch for other versions if needed (this code
> >path hardly changed since its introduction in 10).
> 
> I think the chance of this getting backpatched is zero, because it might
> easily break existing apps.

I understand





pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: psql - add SHOW_ALL_RESULTS option
Next
From: Daniel Gustafsson
Date:
Subject: Re: BUG #16171: Potential malformed JSON in explain output