Thread: PATCH: add support for IN and @> in functional-dependency statistics use
PATCH: add support for IN and @> in functional-dependency statistics use
From
Pierre Ducroquet
Date:
Hello At my current job, we have a lot of multi-tenant databases, thus with tables containing a tenant_id column. Such a column introduces a severe bias in statistics estimation since any other FK in the next columns is very likely to have a functional dependency on the tenant id. We found several queries where this functional dependency messed up the estimations so much the optimizer chose wrong plans. When we tried to use extended statistics with CREATE STATISTIC on tenant_id, other_id, we noticed that the current implementation for detecting functional dependency lacks two features (at least in our use case): - support for IN clauses - support for the array contains operator (that could be considered as a special case of IN) After digging in the source code, I think the lack of support for IN clauses is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by simply copying the code- path for OpExpr and changing the type name. It compiles and the results are correct, with a dependency being correctly taken into consideration when estimating rows. If you think such a copy paste is bad and should be factored in another static bool function, please say so and I will happily provide an updated patch. The lack of support for @> operator, on the other hand, seems to be a decision taken when writing the initial code, but I can not find any mathematical nor technical reason for it. The current code restricts dependency calculation to the = operator, obviously because inequality operators are not going to work... but array contains is just several = operators grouped, thus the same for the dependency calculation. The second patch refactors the operator check in order to also include array contains. I tested the patches on current HEAD, but I can test and provide back-ported versions of the patch for other versions if needed (this code path hardly changed since its introduction in 10). Best regards Pierre Ducroquet
Attachment
On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote: >Hello > >At my current job, we have a lot of multi-tenant databases, thus with tables >containing a tenant_id column. Such a column introduces a severe bias in >statistics estimation since any other FK in the next columns is very likely to >have a functional dependency on the tenant id. We found several queries where >this functional dependency messed up the estimations so much the optimizer >chose wrong plans. >When we tried to use extended statistics with CREATE STATISTIC on tenant_id, >other_id, we noticed that the current implementation for detecting functional >dependency lacks two features (at least in our use case): >- support for IN clauses >- support for the array contains operator (that could be considered as a >special case of IN) > Thanks for the patch. I don't think the lack of support for these clause types is an oversight - we haven't done them because we were not quite sure the functional dependencies can actually apply to them. But maybe we can support them, I'm not against that in principle. >After digging in the source code, I think the lack of support for IN clauses >is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr >instead of OpExpr. The attached patch fixes this by simply copying the code- >path for OpExpr and changing the type name. It compiles and the results are >correct, with a dependency being correctly taken into consideration when >estimating rows. If you think such a copy paste is bad and should be factored >in another static bool function, please say so and I will happily provide an >updated patch. Hmmm. Consider a query like this: ... WHERE tenant_id = 1 AND another_column IN (2,3) which kinda contradicts the idea of functional dependencies that knowing a value in one column, tells us something about a value in a second column. But that assumes a single value, which is not quite true here. The above query is essentially the same thing as ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3)) and also ... WHERE (tenant_id=1 AND another_column=2) OR (tenant_id=1 AND another_column=3) at wchich point we could apply functional dependencies - but we'd do it once for each AND-clause, and then combine the results to compute selectivity for the OR clause. But this means that if we apply functional dependencies directly to the original clause, it'll be inconsistent. Which seems a bit unfortunate. Or do I get this wrong? BTW the code added in the 0001 patch is the same as for is_opclause, so maybe we can simply do if (is_opclause(rinfo->clause) || IsA(rinfo->clause, ScalarArrayOpExpr)) { ... } instead of just duplicating the code. We also need some at least some regression tests, testing functional dependencies with this clause type. >The lack of support for @> operator, on the other hand, seems to be a decision >taken when writing the initial code, but I can not find any mathematical nor >technical reason for it. The current code restricts dependency calculation to >the = operator, obviously because inequality operators are not going to >work... but array contains is just several = operators grouped, thus the same >for the dependency calculation. The second patch refactors the operator check >in order to also include array contains. > No concrete opinion on this yet. I think my concerns are pretty similar to the IN clause, although I'm not sure what you mean by "this could be considered as special case of IN". >I tested the patches on current HEAD, but I can test and provide back-ported >versions of the patch for other versions if needed (this code path hardly >changed since its introduction in 10). I think the chance of this getting backpatched is zero, because it might easily break existing apps. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: add support for IN and @> in functional-dependency statistics use
From
Pierre Ducroquet
Date:
On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote: > On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote: > >Hello > > > >At my current job, we have a lot of multi-tenant databases, thus with > >tables containing a tenant_id column. Such a column introduces a severe > >bias in statistics estimation since any other FK in the next columns is > >very likely to have a functional dependency on the tenant id. We found > >several queries where this functional dependency messed up the estimations > >so much the optimizer chose wrong plans. > >When we tried to use extended statistics with CREATE STATISTIC on > >tenant_id, other_id, we noticed that the current implementation for > >detecting functional dependency lacks two features (at least in our use > >case): > >- support for IN clauses > >- support for the array contains operator (that could be considered as a > >special case of IN) > > Thanks for the patch. I don't think the lack of support for these clause > types is an oversight - we haven't done them because we were not quite > sure the functional dependencies can actually apply to them. But maybe > we can support them, I'm not against that in principle. > > >After digging in the source code, I think the lack of support for IN > >clauses is an oversight and due to the fact that IN clauses are > >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by > >simply copying the code- path for OpExpr and changing the type name. It > >compiles and the results are correct, with a dependency being correctly > >taken into consideration when estimating rows. If you think such a copy > >paste is bad and should be factored in another static bool function, > >please say so and I will happily provide an updated patch. > > Hmmm. Consider a query like this: > > ... WHERE tenant_id = 1 AND another_column IN (2,3) > > which kinda contradicts the idea of functional dependencies that knowing > a value in one column, tells us something about a value in a second > column. But that assumes a single value, which is not quite true here. > > The above query is essentially the same thing as > > ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3)) > > and also > > ... WHERE (tenant_id=1 AND another_column=2) > OR (tenant_id=1 AND another_column=3) > > at wchich point we could apply functional dependencies - but we'd do it > once for each AND-clause, and then combine the results to compute > selectivity for the OR clause. > > But this means that if we apply functional dependencies directly to the > original clause, it'll be inconsistent. Which seems a bit unfortunate. > > Or do I get this wrong? In my tests, I've got a table with two columns a and b, generated this way: CREATE TABLE test (a INT, b INT) AS SELECT i, i/10 FROM generate_series(1, 100000) s(i), generate_series(1, 5) x; With statistics defined on the a, b columns Here are the estimated selectivity results without any patch: SELECT * FROM test WHERE a = 1 AND b = 1 : 5 SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1 SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1 SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1 With the patch, the estimated rows of the last query goes back to 5, which is more logical. The other ones don't change. > BTW the code added in the 0001 patch is the same as for is_opclause, so > maybe we can simply do > > if (is_opclause(rinfo->clause) || > IsA(rinfo->clause, ScalarArrayOpExpr)) > { > ... > } > > instead of just duplicating the code. I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work around with this kind of code, but I don't like it much: List *args; Oid opno; if (IsA(rinfo->clause, OpExpr)) { /* If it's an opclause, we will check for Var = Const or Const = Var. */ OpExpr *expr = (OpExpr *) rinfo->clause; args = expr->args; opno = expr->opno; } else if (IsA(rinfo->clause, ScalarArrayOpExpr)) { /* If it's a ScalarArrayOpExpr, check for Var IN Const. */ ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause; args = expr->args; opno = expr->opno; } Or I can rewrite it in C++ to play with templates... :) > We also need some at least some > regression tests, testing functional dependencies with this clause type. Agreed > >The lack of support for @> operator, on the other hand, seems to be a > >decision taken when writing the initial code, but I can not find any > >mathematical nor technical reason for it. The current code restricts > >dependency calculation to the = operator, obviously because inequality > >operators are not going to work... but array contains is just several = > >operators grouped, thus the same for the dependency calculation. The > >second patch refactors the operator check in order to also include array > >contains. > > No concrete opinion on this yet. I think my concerns are pretty similar > to the IN clause, although I'm not sure what you mean by "this could be > considered as special case of IN". I meant from a mathematical point of view. > >I tested the patches on current HEAD, but I can test and provide > >back-ported versions of the patch for other versions if needed (this code > >path hardly changed since its introduction in 10). > > I think the chance of this getting backpatched is zero, because it might > easily break existing apps. I understand
On Sun, Feb 02, 2020 at 10:59:32AM +0100, Pierre Ducroquet wrote: >On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote: >> On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote: >> >Hello >> > >> >At my current job, we have a lot of multi-tenant databases, thus with >> >tables containing a tenant_id column. Such a column introduces a severe >> >bias in statistics estimation since any other FK in the next columns is >> >very likely to have a functional dependency on the tenant id. We found >> >several queries where this functional dependency messed up the estimations >> >so much the optimizer chose wrong plans. >> >When we tried to use extended statistics with CREATE STATISTIC on >> >tenant_id, other_id, we noticed that the current implementation for >> >detecting functional dependency lacks two features (at least in our use >> >case): >> >- support for IN clauses >> >- support for the array contains operator (that could be considered as a >> >special case of IN) >> >> Thanks for the patch. I don't think the lack of support for these clause >> types is an oversight - we haven't done them because we were not quite >> sure the functional dependencies can actually apply to them. But maybe >> we can support them, I'm not against that in principle. >> >> >After digging in the source code, I think the lack of support for IN >> >clauses is an oversight and due to the fact that IN clauses are >> >ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by >> >simply copying the code- path for OpExpr and changing the type name. It >> >compiles and the results are correct, with a dependency being correctly >> >taken into consideration when estimating rows. If you think such a copy >> >paste is bad and should be factored in another static bool function, >> >please say so and I will happily provide an updated patch. >> >> Hmmm. Consider a query like this: >> >> ... WHERE tenant_id = 1 AND another_column IN (2,3) >> >> which kinda contradicts the idea of functional dependencies that knowing >> a value in one column, tells us something about a value in a second >> column. But that assumes a single value, which is not quite true here. >> >> The above query is essentially the same thing as >> >> ... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3)) >> >> and also >> >> ... WHERE (tenant_id=1 AND another_column=2) >> OR (tenant_id=1 AND another_column=3) >> >> at wchich point we could apply functional dependencies - but we'd do it >> once for each AND-clause, and then combine the results to compute >> selectivity for the OR clause. >> >> But this means that if we apply functional dependencies directly to the >> original clause, it'll be inconsistent. Which seems a bit unfortunate. >> >> Or do I get this wrong? > >In my tests, I've got a table with two columns a and b, generated this way: > CREATE TABLE test (a INT, b INT) > AS SELECT i, i/10 FROM > generate_series(1, 100000) s(i), > generate_series(1, 5) x; > >With statistics defined on the a, b columns > >Here are the estimated selectivity results without any patch: > >SELECT * FROM test WHERE a = 1 AND b = 1 : 5 >SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1 >SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1 >SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1 > >With the patch, the estimated rows of the last query goes back to 5, which is >more logical. The other ones don't change. > Yes, I think you're right. I've been playing with this a bit more, and I think you're right we can support it the way you propose. I'm still a bit annoyed by the inconsistency this might/does introduce. Consider for example these clauses: a) ... WHERE a = 10 AND b IN (100, 200) b) ... WHERE a = 10 AND (b = 100 OR b = 200) c) ... WHERE (a = 10 AND b = 100) OR (a = 10 AND b = 200) All three cases are logically equivalent and do return the same set of rows. But we estimate them differently, arriving at different estimates. Case (a) is the one you improve in your patch. Case (c) is actually not possible in practice, because we rewrite it as (b) during planning. But (b) is estimated very differently, because we don't recognize the OR clause as supported by functional dependencies. On the one hand I'm sure it's not the first case where we already estimate equivalent clauses differently. On the other hand I wonder how difficult would it be to support this specific type of OR clause (with all expressions having the form "Var = Const" and all Vars referencing the same rel). I'm not going to block the patch because of this, of course. Similarly, it'd be nice to add support for ScalarArrayOpExpr to MCV stats, not just functional dependencies ... >> BTW the code added in the 0001 patch is the same as for is_opclause, so >> maybe we can simply do >> >> if (is_opclause(rinfo->clause) || >> IsA(rinfo->clause, ScalarArrayOpExpr)) >> { >> ... >> } >> >> instead of just duplicating the code. > >I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary >compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is >at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work >around with this kind of code, but I don't like it much: >List *args; >Oid opno; >if (IsA(rinfo->clause, OpExpr)) >{ > /* If it's an opclause, we will check for Var = Const or Const = Var. */ > OpExpr *expr = (OpExpr *) rinfo->clause; > args = expr->args; > opno = expr->opno; >} >else if (IsA(rinfo->clause, ScalarArrayOpExpr)) >{ > /* If it's a ScalarArrayOpExpr, check for Var IN Const. */ > ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause; > args = expr->args; > opno = expr->opno; >} > Oh, right. I'm dumb, I missed this obvious detail. I blame belgian beer. >Or I can rewrite it in C++ to play with templates... :) > Please don't ;-) >> We also need some at least some >> regression tests, testing functional dependencies with this clause type. > >Agreed > >> >The lack of support for @> operator, on the other hand, seems to be a >> >decision taken when writing the initial code, but I can not find any >> >mathematical nor technical reason for it. The current code restricts >> >dependency calculation to the = operator, obviously because inequality >> >operators are not going to work... but array contains is just several = >> >operators grouped, thus the same for the dependency calculation. The >> >second patch refactors the operator check in order to also include array >> >contains. >> >> No concrete opinion on this yet. I think my concerns are pretty similar >> to the IN clause, although I'm not sure what you mean by "this could be >> considered as special case of IN". > >I meant from a mathematical point of view. > Can you elaborate a bit? I still don't understand how it's just "several equality operators grouped". I think the challenge here is in applying the functional dependency computed for the whole array to individual elements. I'm not sure we can do that. For example, with a table like this: CREATE TABLE t (a int, b int[]); CREATE STATISTICS s (dependencies) ON a, b FROM t; Let's say the functional dependency is "perfect" i.e. has strength 1.0. But that only tells us dependency for complete array values, we don't know how much information we gain by knowledge of subset of the values. For example, all the arrays may contain {1, 2, 3} as subset, and then some "unique" element, like this: INSERT INTO t SELECT i/1000, ARRAY[1,2,3, 4 + i/100] FROM generate_series(1,1000000) s(i); and then do a query like this: select * from t where a = 10 and b @> ARRAY[1,2]; Without extended stats, it's estimated like this: QUERY PLAN --------------------------------------------------------------- Seq Scan on t (cost=0.00..24346.00 rows=997 width=41) (actual time=1.391..140.261 rows=1000 loops=1) Filter: ((b @> '{1,2}'::integer[]) AND (a = 10)) Rows Removed by Filter: 999000 Planning Time: 0.052 ms Execution Time: 140.707 ms (5 rows) but the moment you create functional stats, you get this: QUERY PLAN --------------------------------------------------------------- Seq Scan on t (cost=0.00..24346.00 rows=1000000 width=41) (actual time=1.432..143.047 rows=1000 loops=1) Filter: ((b @> '{1,2}'::integer[]) AND (a = 10)) Rows Removed by Filter: 999000 Planning Time: 0.099 ms Execution Time: 143.527 ms (5 rows) That doesn't seem very useful :-( >> >I tested the patches on current HEAD, but I can test and provide >> >back-ported versions of the patch for other versions if needed (this code >> >path hardly changed since its introduction in 10). >> >> I think the chance of this getting backpatched is zero, because it might >> easily break existing apps. > >I understand > regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Pierre, I've looked at this patch series, hoping to get it close to committable. Here is a somewhat improved version of the patch series, split into 5 pieces. The first 4 parts are about applying functional dependencies to ScalarArrayOpExpr clauses. The last part is about doing the same thing for MCV lists, so it seemed fine to post it here. 0001 is the patch you posted back in October 0002 simplifies the handling logic a bit, because ScalarArrayOpExpr can only have form (Var op Const) but not (Const op Var). 0003 fixes what I think is a bug - ScalarArrayOpExpr can represent three different cases: * Var op ANY () * Var IN () -- special case of ANY * Var op ALL () I don't think functional dependencies can handle the ALL case, we need to reject it by checking the useOr flag. 0004 adds queries to the stats_ext test suite, to test all of this (some of the cases illustrate the need for 0003, I think) 0005 allows estimation of ScalarArrayOpExpr by MCV lists, including regression tests etc. Will you have time to look at this, particularly 0001-0004, but maybe even the 0005 part? As for the second part of your patch (the one allowing estimation of array containment queries), I still think that's not something we can reasonably do without also building statistics on elements (which is what we have in pg_stats but not pg_stats_ext). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
[ For the sake of the archives, some of the discussion on the other thread [1-3] should really have been on this thread. ] On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I think the challenge here is in applying the functional dependency > computed for the whole array to individual elements. I'm not sure we can > do that. > > For example, with a table like this: > > CREATE TABLE t (a int, b int[]); > CREATE STATISTICS s (dependencies) ON a, b FROM t; > > Let's say the functional dependency is "perfect" i.e. has strength 1.0. > But that only tells us dependency for complete array values, we don't > know how much information we gain by knowledge of subset of the values. > The more I think about this example, the more I think this is really just a special case of the more general problem of compatibility of clauses. Once you add support for IN (...) clauses, any query of the form SELECT ... WHERE (any clauses on col a) AND (any clauses on col b) can be recast as SELECT ... WHERE a IN (...) AND b IN (...) so any counter-example with bad estimates produced with a query in the first form can also be written in the second form. I think we should really be thinking in terms of making a strong functional dependency (a => b) applicable generally to queries in the first form, which will work well if the clauses on b are compatible with those on b, but not if they're incompatible. However, that's not so very different from the current state without extended stats, which assumes independence, and will return poor estimates if the columns/clauses aren't independent. So I'd be tempted to apply a tidied up version of the patch from [3], and then lift all restrictions from dependency_is_compatible_clause(), other than the requirement that the clause refer to a single variable. Regards, Dean [1] https://www.postgresql.org/message-id/CAEZATCXaNFZyOhR4XXAfkvj1tibRBEjje6ZbXwqWUB_tqbH%3Drw%40mail.gmail.com [2] https://www.postgresql.org/message-id/20200309181915.5lxhuw2qxoihfoqo%40development [3] https://www.postgresql.org/message-id/CAEZATCUic8PwhTnexC%2BUx-Z_e5MhWD-8jk%3DJ1MtnVW8TJD%2BVHw%40mail.gmail.com
On Thu, Mar 12, 2020 at 10:25:41AM +0000, Dean Rasheed wrote: >[ For the sake of the archives, some of the discussion on the other >thread [1-3] should really have been on this thread. ] > >On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> I think the challenge here is in applying the functional dependency >> computed for the whole array to individual elements. I'm not sure we can >> do that. >> >> For example, with a table like this: >> >> CREATE TABLE t (a int, b int[]); >> CREATE STATISTICS s (dependencies) ON a, b FROM t; >> >> Let's say the functional dependency is "perfect" i.e. has strength 1.0. >> But that only tells us dependency for complete array values, we don't >> know how much information we gain by knowledge of subset of the values. >> > >The more I think about this example, the more I think this is really >just a special case of the more general problem of compatibility of >clauses. Once you add support for IN (...) clauses, any query of the >form > > SELECT ... WHERE (any clauses on col a) AND (any clauses on col b) > >can be recast as > > SELECT ... WHERE a IN (...) AND b IN (...) > >so any counter-example with bad estimates produced with a query in the >first form can also be written in the second form. > >I think we should really be thinking in terms of making a strong >functional dependency (a => b) applicable generally to queries in the >first form, which will work well if the clauses on b are compatible >with those on b, but not if they're incompatible. However, that's not >so very different from the current state without extended stats, which >assumes independence, and will return poor estimates if the >columns/clauses aren't independent. > I'm sorry, but I don't see how we could do this for arbitrary clauses. I think we could do that for clauses that have equality semantics and reference column values as a whole. So I think it's possible to do this for IN clauses (which is what the first part of the patch does), but I don't think we can do it for the containment operator. I.e. we can do that for WHERE a IN (...) AND b IN (...) but I don't see how we could do that for WHERE a @> (...) AND b @> (...) I don't think the dependency degree gives us any reliable insight into statistical dependency of elements of the values. Or maybe we're just talking about different things? You seem to be talking abotu IN clauses (which I think is doable), but my question was about using functional dependencies to estimate array containment clauses (which I think is not really doable). >So I'd be tempted to apply a tidied up version of the patch from [3], >and then lift all restrictions from dependency_is_compatible_clause(), >other than the requirement that the clause refer to a single variable. > I haven't looked at the patch from [3] closely yet, but you're right P(A & B) <= Min(P(A), P(B)) and the approach you proposed seems reasonable. I don't think how we can just remove all the restriction on clause type - the restriction that dependencies only handle equality-like clauses seems pretty much baked into the dependencies. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I'm sorry, but I don't see how we could do this for arbitrary clauses. I > think we could do that for clauses that have equality semantics and > reference column values as a whole. So I think it's possible to do this > for IN clauses (which is what the first part of the patch does), but I > don't think we can do it for the containment operator. > > I.e. we can do that for > > WHERE a IN (...) AND b IN (...) > Hmm, the difficulty always comes back to the compatibility of the clauses though. It's easy to come up with artificial examples for which functional dependencies come up with bad estimates, even with just = and IN (...) operators. For example, given a perfect correlation like a | b ------- 1 | 1 2 | 2 3 | 3 : | : you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b IN (2,4,6,8,...)" to get a very bad estimate from functional dependencies. However, I don't think such artificial examples are that useful. I think you have to think in terms of real data distributions together with real queries expected to go with them. For example: Using the OP's original example of a multi-tenant system, you might well have a table with columns (product_type, tenant_id) and a functional dependency product_type => tenant_id. In that case, it could well be very useful in optimising queries like "WHERE product_type IN (X,Y,Z) AND tenant_id = 123". But this isn't necessarily limited to = and IN (...). For example, consider a table with UK-based geographic data with columns (location point, postcode text). Then there would be a strong functional dependency location => postcode (and possibly also the other way round, depending on how dense the points were). That dependency could be used to estimate much more general queries like "WHERE location <@ some_area AND postcode ~ '^CB.*'", where there may be no useful stats on location, but a histogram on postcode might give a reasonable estimate. This also extends to inequalities. For example a table with columns (weight, category) might have a strong functional dependency weight => category. Then a query like "WHERE weight > 10 AND weight < 20 AND category = 'large'" could get a decent estimate from a histogram on the weight column, and then use the functional dependency to note that that implies the category. Note that such an example would work with my patch from the other thread, because it groups clauses by column, and uses clauselist_selectivity_simple() on them. So in this case, the clauses "weight > 10 AND weight < 20" would be estimated together, and would be able to make use of the RangeQueryClause code. Of course, it's equally easy to come up with counter-example queries for any of those examples, where using the functional dependency would produce a poor estimate. Ultimately, it's up to the user to decide whether or not to build functional dependency statistics, and that decision needs to be based not just on the data distribution, but also on the types of queries expected. Given the timing though, perhaps it is best to limit this to IN (..) clauses for PG13, and we can consider other possibilities later. Regards, Dean
Re: PATCH: add support for IN and @> in functional-dependencystatistics use
From
Bruce Momjian
Date:
On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote: > On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > > I'm sorry, but I don't see how we could do this for arbitrary clauses. I > > think we could do that for clauses that have equality semantics and > > reference column values as a whole. So I think it's possible to do this > > for IN clauses (which is what the first part of the patch does), but I > > don't think we can do it for the containment operator. > > > > I.e. we can do that for > > > > WHERE a IN (...) AND b IN (...) > > > > Hmm, the difficulty always comes back to the compatibility of the > clauses though. It's easy to come up with artificial examples for > which functional dependencies come up with bad estimates, even with > just = and IN (...) operators. For example, given a perfect > correlation like > > a | b > ------- > 1 | 1 > 2 | 2 > 3 | 3 > : | : > > you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b > IN (2,4,6,8,...)" to get a very bad estimate from functional > dependencies. Wow, that is a very good example --- the arrays do not tie elements in one array to elements in another array; good point. I get it now! -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote: >On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> I'm sorry, but I don't see how we could do this for arbitrary clauses. I >> think we could do that for clauses that have equality semantics and >> reference column values as a whole. So I think it's possible to do this >> for IN clauses (which is what the first part of the patch does), but I >> don't think we can do it for the containment operator. >> >> I.e. we can do that for >> >> WHERE a IN (...) AND b IN (...) >> > >Hmm, the difficulty always comes back to the compatibility of the >clauses though. It's easy to come up with artificial examples for >which functional dependencies come up with bad estimates, even with >just = and IN (...) operators. For example, given a perfect >correlation like > > a | b > ------- > 1 | 1 > 2 | 2 > 3 | 3 > : | : > >you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b >IN (2,4,6,8,...)" to get a very bad estimate from functional >dependencies. > >However, I don't think such artificial examples are that useful. I >think you have to think in terms of real data distributions together >with real queries expected to go with them. For example: > >Using the OP's original example of a multi-tenant system, you might >well have a table with columns (product_type, tenant_id) and a >functional dependency product_type => tenant_id. In that case, it >could well be very useful in optimising queries like "WHERE >product_type IN (X,Y,Z) AND tenant_id = 123". > >But this isn't necessarily limited to = and IN (...). For example, >consider a table with UK-based geographic data with columns (location >point, postcode text). Then there would be a strong functional >dependency location => postcode (and possibly also the other way >round, depending on how dense the points were). That dependency could >be used to estimate much more general queries like "WHERE location <@ >some_area AND postcode ~ '^CB.*'", where there may be no useful stats >on location, but a histogram on postcode might give a reasonable >estimate. > >This also extends to inequalities. For example a table with columns >(weight, category) might have a strong functional dependency weight => >category. Then a query like "WHERE weight > 10 AND weight < 20 AND >category = 'large'" could get a decent estimate from a histogram on >the weight column, and then use the functional dependency to note that >that implies the category. Note that such an example would work with >my patch from the other thread, because it groups clauses by column, >and uses clauselist_selectivity_simple() on them. So in this case, the >clauses "weight > 10 AND weight < 20" would be estimated together, and >would be able to make use of the RangeQueryClause code. > >Of course, it's equally easy to come up with counter-example queries >for any of those examples, where using the functional dependency would >produce a poor estimate. Ultimately, it's up to the user to decide >whether or not to build functional dependency statistics, and that >decision needs to be based not just on the data distribution, but also >on the types of queries expected. > Well, yeah. I'm sure we can produce countless examples where applying the functional dependencies to additional types of clauses helps a lot. I'm somewhat hesitant to just drop any restrictions, though, because it's equally simple to produce examples with poor results. The main issue I have with just applying dependencies to arbitrary clauses is it uses "degree" computed for the value as a whole, and just uses it to estimate dependency between pieces of the values. The IN() clause does not have this problem, the other cases like @> or pattern matching do. >Given the timing though, perhaps it is best to limit this to IN (..) >clauses for PG13, and we can consider other possibilities later. > Yeah, I was gonna propose the same thing. I'll get the IN bit committed shortly (both for dependencies and MCV), improved handling of OR clauses and some additional regression tests to increase the coverage. Then we can discuss these improvement in more detail for PG14. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've pushed the first part of the patch, adding ScalarArrayOpExpr as supported clause for functional dependencies, and then also doing the same for MCV lists. As discussed, I'm not going to do anything about the array containment clauses for PG13, that needs more discussion. I have a bunch of additional improvements for extended stats, discussed in [1]. I'll wait a bit for buildfarm and maybe some feedback before pushing those. [1] https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I realized there's one more thing that probably needs discussing. Essentially, these two clause types are the same: a IN (1, 2, 3) (a = 1 OR a = 2 OR a = 3) but with 8f321bd1 we only recognize the first one as compatible with functional dependencies. It was always the case that we estimated those two clauses a bit differently, but the differences were usually small. But now that we recognize IN as compatible with dependencies, the difference may be much larger, which bugs me a bit ... So I wonder if we should recognize the special form of an OR clause, with all Vars referencing to the same attribute etc. and treat this as supported by functional dependencies - the attached patch does that. MCV lists there's already no difference because OR clauses are supported. The question is whether we want to do this, and whether we should also teach the per-column estimates to recognize this special case of IN clause. That would allow producing exactly the same estimates even with functional dependencies - recognizing the OR clause as supported gets us only half-way there, because we still use estimates for each clause (and those will be slightly different). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I realized there's one more thing that probably needs discussing. > Essentially, these two clause types are the same: > > a IN (1, 2, 3) > > (a = 1 OR a = 2 OR a = 3) > > but with 8f321bd1 we only recognize the first one as compatible with > functional dependencies. It was always the case that we estimated those > two clauses a bit differently, but the differences were usually small. > But now that we recognize IN as compatible with dependencies, the > difference may be much larger, which bugs me a bit ... > > So I wonder if we should recognize the special form of an OR clause, > with all Vars referencing to the same attribute etc. and treat this as > supported by functional dependencies - the attached patch does that. > MCV lists there's already no difference because OR clauses are > supported. > Makes sense, and the patch looks straightforward enough. > The question is whether we want to do this, and whether we should also > teach the per-column estimates to recognize this special case of IN > clause. I'm not convinced about that second part though. I'd say that recognising the OR clause for functional dependencies is sufficient to prevent the large differences in estimates relative to the equivalent IN clauses. The small differences between the way that OR and IN clauses are handled have always been there, and I think that changing that is out of scope for this work. The other thing that I'm still concerned about is the possibility of returning estimates with P(a,b) > P(a) or P(b). I think that such a thing becomes much more likely with the new types of clause supported here, because they now allow multiple values from each column, where before we only allowed one. I took another look at the patch I posted on the other thread, and I've convinced myself that it's correct. Attached is an updated version, with some cosmetic tidying up and now with some additional regression tests. Regards, Dean
Attachment
On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote: >On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> I realized there's one more thing that probably needs discussing. >> Essentially, these two clause types are the same: >> >> a IN (1, 2, 3) >> >> (a = 1 OR a = 2 OR a = 3) >> >> but with 8f321bd1 we only recognize the first one as compatible with >> functional dependencies. It was always the case that we estimated those >> two clauses a bit differently, but the differences were usually small. >> But now that we recognize IN as compatible with dependencies, the >> difference may be much larger, which bugs me a bit ... >> >> So I wonder if we should recognize the special form of an OR clause, >> with all Vars referencing to the same attribute etc. and treat this as >> supported by functional dependencies - the attached patch does that. >> MCV lists there's already no difference because OR clauses are >> supported. >> > >Makes sense, and the patch looks straightforward enough. > >> The question is whether we want to do this, and whether we should also >> teach the per-column estimates to recognize this special case of IN >> clause. > >I'm not convinced about that second part though. I'd say that >recognising the OR clause for functional dependencies is sufficient to >prevent the large differences in estimates relative to the equivalent >IN clauses. The small differences between the way that OR and IN >clauses are handled have always been there, and I think that changing >that is out of scope for this work. > Not sure. I think the inconsistency between plan and extended stats may be a bit surprising, but I agree that issue may be negligible. >The other thing that I'm still concerned about is the possibility of >returning estimates with P(a,b) > P(a) or P(b). I think that such a >thing becomes much more likely with the new types of clause supported >here, because they now allow multiple values from each column, where >before we only allowed one. I took another look at the patch I posted >on the other thread, and I've convinced myself that it's correct. >Attached is an updated version, with some cosmetic tidying up and now >with some additional regression tests. > Yeah, I agree that's something we need to fix. Do you plan to push the fix, or do you want me to do it? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote: > > >The other thing that I'm still concerned about is the possibility of > >returning estimates with P(a,b) > P(a) or P(b). I think that such a > >thing becomes much more likely with the new types of clause supported > >here, because they now allow multiple values from each column, where > >before we only allowed one. I took another look at the patch I posted > >on the other thread, and I've convinced myself that it's correct. > >Attached is an updated version, with some cosmetic tidying up and now > >with some additional regression tests. > > Yeah, I agree that's something we need to fix. Do you plan to push the > fix, or do you want me to do it? > I can push it. Have you had a chance to review it? Regards, Dean
On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote: >On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote: >> >> >The other thing that I'm still concerned about is the possibility of >> >returning estimates with P(a,b) > P(a) or P(b). I think that such a >> >thing becomes much more likely with the new types of clause supported >> >here, because they now allow multiple values from each column, where >> >before we only allowed one. I took another look at the patch I posted >> >on the other thread, and I've convinced myself that it's correct. >> >Attached is an updated version, with some cosmetic tidying up and now >> >with some additional regression tests. >> >> Yeah, I agree that's something we need to fix. Do you plan to push the >> fix, or do you want me to do it? >> > >I can push it. Have you had a chance to review it? > Not yet, I'll take a look today. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 17, 2020 at 06:05:17PM +0100, Tomas Vondra wrote: >On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote: >>On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> >>>On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote: >>> >>>>The other thing that I'm still concerned about is the possibility of >>>>returning estimates with P(a,b) > P(a) or P(b). I think that such a >>>>thing becomes much more likely with the new types of clause supported >>>>here, because they now allow multiple values from each column, where >>>>before we only allowed one. I took another look at the patch I posted >>>>on the other thread, and I've convinced myself that it's correct. >>>>Attached is an updated version, with some cosmetic tidying up and now >>>>with some additional regression tests. >>> >>>Yeah, I agree that's something we need to fix. Do you plan to push the >>>fix, or do you want me to do it? >>> >> >>I can push it. Have you had a chance to review it? >> > >Not yet, I'll take a look today. > OK, I took a look. I think from the correctness POV the patch is OK, but I think the dependencies_clauselist_selectivity() function now got a bit too complex. I've been able to parse it now, but I'm sure I'll have trouble in the future :-( Can we refactor / split it somehow and move bits of the logic to smaller functions, or something like that? Another thing I'd like to suggest is keeping the "old" formula, and instead of just replacing it with P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) but explaining how the old formula may produce nonsensical selectivity, and how the new formula addresses that issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 17, 2020 at 04:37:06PM +0100, Tomas Vondra wrote: >On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote: >>On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> >>>I realized there's one more thing that probably needs discussing. >>>Essentially, these two clause types are the same: >>> >>> a IN (1, 2, 3) >>> >>> (a = 1 OR a = 2 OR a = 3) >>> >>>but with 8f321bd1 we only recognize the first one as compatible with >>>functional dependencies. It was always the case that we estimated those >>>two clauses a bit differently, but the differences were usually small. >>>But now that we recognize IN as compatible with dependencies, the >>>difference may be much larger, which bugs me a bit ... >>> >>>So I wonder if we should recognize the special form of an OR clause, >>>with all Vars referencing to the same attribute etc. and treat this as >>>supported by functional dependencies - the attached patch does that. >>>MCV lists there's already no difference because OR clauses are >>>supported. >>> >> >>Makes sense, and the patch looks straightforward enough. >> >>>The question is whether we want to do this, and whether we should also >>>teach the per-column estimates to recognize this special case of IN >>>clause. >> >>I'm not convinced about that second part though. I'd say that >>recognising the OR clause for functional dependencies is sufficient to >>prevent the large differences in estimates relative to the equivalent >>IN clauses. The small differences between the way that OR and IN >>clauses are handled have always been there, and I think that changing >>that is out of scope for this work. >> > >Not sure. I think the inconsistency between plan and extended stats may >be a bit surprising, but I agree that issue may be negligible. > OK, I've pushed the change recognizing the special case of OR clauses as supported by functional dependencies. I've left the estimation of the clause itself as it's, we can address that in the future if needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > OK, I took a look. I think from the correctness POV the patch is OK, but > I think the dependencies_clauselist_selectivity() function now got a bit > too complex. I've been able to parse it now, but I'm sure I'll have > trouble in the future :-( > > Can we refactor / split it somehow and move bits of the logic to smaller > functions, or something like that? > Yeah, it has gotten a bit long. It's somewhat tricky splitting it up, because of the number of shared variables used throughout the function, but here's an updated patch splitting it into what seemed like the 2 most logical pieces. The first piece (still in dependencies_clauselist_selectivity()) works out what dependencies can/should be applied, and the second piece in a new function does the actual work of applying the list of functional dependencies to the clause list. I think that has made it easier to follow, and it has also reduced the complexity of the final "no applicable stats" branch. > Another thing I'd like to suggest is keeping the "old" formula, and > instead of just replacing it with > > P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) > > but explaining how the old formula may produce nonsensical selectivity, > and how the new formula addresses that issue. > I think this is purely a comment issue? I've added some more extensive comments attempting to justify the formulae. Regards, Dean
Attachment
On Thu, Mar 19, 2020 at 07:53:39PM +0000, Dean Rasheed wrote: >On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> OK, I took a look. I think from the correctness POV the patch is OK, but >> I think the dependencies_clauselist_selectivity() function now got a bit >> too complex. I've been able to parse it now, but I'm sure I'll have >> trouble in the future :-( >> >> Can we refactor / split it somehow and move bits of the logic to smaller >> functions, or something like that? >> > >Yeah, it has gotten a bit long. It's somewhat tricky splitting it up, >because of the number of shared variables used throughout the >function, but here's an updated patch splitting it into what seemed >like the 2 most logical pieces. The first piece (still in >dependencies_clauselist_selectivity()) works out what dependencies >can/should be applied, and the second piece in a new function does the >actual work of applying the list of functional dependencies to the >clause list. > >I think that has made it easier to follow, and it has also reduced the >complexity of the final "no applicable stats" branch. > Seems OK to me. I'd perhaps name deps_clauselist_selectivity differently, it's a bit too similar to dependencies_clauselist_selectivity. Perhaps something like clauselist_apply_dependencies? But that's a minor detail. >> Another thing I'd like to suggest is keeping the "old" formula, and >> instead of just replacing it with >> >> P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) >> >> but explaining how the old formula may produce nonsensical selectivity, >> and how the new formula addresses that issue. >> > >I think this is purely a comment issue? I've added some more extensive >comments attempting to justify the formulae. > Yes, it was purely a comment issue. Seems fine now. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 25 Mar 2020 at 00:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Seems OK to me. > > I'd perhaps name deps_clauselist_selectivity differently, it's a bit too > similar to dependencies_clauselist_selectivity. Perhaps something like > clauselist_apply_dependencies? But that's a minor detail. > OK, I've pushed that with your recommendation for that function name. Regards, Dean
On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > OK, I've pushed that with your recommendation for that function name. > Does this now complete everything that you wanted to do for functional dependency stats for PG13? Re-reading the thread, I couldn't see anything else that needed looking at. If that's the case, the CF entry can be closed. Regards, Dean
On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote: >On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> OK, I've pushed that with your recommendation for that function name. >> > >Does this now complete everything that you wanted to do for functional >dependency stats for PG13? Re-reading the thread, I couldn't see >anything else that needed looking at. If that's the case, the CF entry >can be closed. > Yes. There were two improvements proposed, we've committed one of them (the IN/ANY operator handling) and the other (containment) needs more discussion. So I think it's OK to mark this either as committed or maybe returned with feedback. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: add support for IN and @> in functional-dependency statistics use
From
Daniel Gustafsson
Date:
> On 29 Mar 2020, at 17:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote: >> On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >>> OK, I've pushed that with your recommendation for that function name. >> >> Does this now complete everything that you wanted to do for functional >> dependency stats for PG13? Re-reading the thread, I couldn't see >> anything else that needed looking at. If that's the case, the CF entry >> can be closed. > > Yes. There were two improvements proposed, we've committed one of them > (the IN/ANY operator handling) and the other (containment) needs more > discussion. So I think it's OK to mark this either as committed or maybe > returned with feedback. Since there hasn't been more discussion on the second item I've closed this item as committed. The containment part can be opened as a new CF entry. cheers ./daniel