Thread: Use extended statistics to estimate (Var op Var) clauses
Hi, Attached is a patch to allow estimation of (Var op Var) clauses using extended statistics. Currently we only use extended stats to estimate (Var op Const) clauses, which is sufficient for most cases, but it's not very hard to support this second type of clauses. This is not an entirely new patch - I've originally included it in the patch series in [1] but it's probably better to discuss it separately, so that it does not get buried in that discussion. [1] https://www.postgresql.org/message-id/flat/20200113230008.g67iyk4cs3xbnjju@development To illustrate the purpose of this patch, consider this: db=# create table t (a int, b int); CREATE TABLE db=# insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,100000) s(i); INSERT 0 100000 db=# analyze t; ANALYZE db=# explain select * from t where a < b; QUERY PLAN -------------------------------------------------------- Seq Scan on t (cost=0.00..1693.00 rows=33333 width=8) Filter: (a < b) (2 rows) db=# explain select * from t where a > b; QUERY PLAN -------------------------------------------------------- Seq Scan on t (cost=0.00..1693.00 rows=33333 width=8) Filter: (a > b) (2 rows) db=# create statistics s (mcv) on a,b from t; CREATE STATISTICS db=# analyze t; ANALYZE db=# explain select * from t where a < b; QUERY PLAN --------------------------------------------------------- Seq Scan on t (cost=0.00..1693.00 rows=100000 width=8) Filter: (a < b) (2 rows) db=# explain select * from t where a > b; QUERY PLAN ---------------------------------------------------- Seq Scan on t (cost=0.00..1693.00 rows=1 width=8) Filter: (a > b) (2 rows) I'm not entirely convinced this patch (on it's own) is very useful, for a couple of reasons: (a) Clauses of this form are not particularly common, at least compared to the Var op Const clauses. (I don't recall slow-query reports from any of our mailing lists that might be attributed to such clauses.) (b) For known cases of such queries (e.g. several TPC-H queries do use clauses like "l_commitdate < l_receiptdate" etc.) this is somewhat hindered by extended statistics only supporting MCV lists, which may not work particularly well for high-cardinality columns like dates etc. But despite that it seems like a useful feature / building block, and those limitations may be addressed in some other way (e.g. we may add multi-dimensional histograms to address the second one). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, Here is a slightly updated version of the patch - rebased to current master and fixing some minor issues to handle expressions (and not just the Var nodes as before). The changes needed to support (Expr op Expr) are mostly mechanical, though I'm sure the code needs some cleanup. The main issue I ran into is the special case clauselist_selectivity, which does if (list_length(clauses) == 1) return clause_selectivity_ext(...); which applies to cases like "WHERE a < b" which can now be handled by extended statistics, thanks to this patch. But clause_selectivity_ext only used to call restriction_selectivity for these clauses, which does not use extended statistics, of course. I considered either getting rid of the special case, passing everything through extended stats, including cases with a single clause. But that ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a bit, which seems like a better approach. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sun, Jun 13, 2021 at 1:29 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
Here is a slightly updated version of the patch - rebased to current
master and fixing some minor issues to handle expressions (and not just
the Var nodes as before).
The changes needed to support (Expr op Expr) are mostly mechanical,
though I'm sure the code needs some cleanup. The main issue I ran into
is the special case clauselist_selectivity, which does
if (list_length(clauses) == 1)
return clause_selectivity_ext(...);
which applies to cases like "WHERE a < b" which can now be handled by
extended statistics, thanks to this patch. But clause_selectivity_ext
only used to call restriction_selectivity for these clauses, which does
not use extended statistics, of course.
I considered either getting rid of the special case, passing everything
through extended stats, including cases with a single clause. But that
ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a
bit, which seems like a better approach.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
+ if (cst) /* Expr op Const */
It seems the Const op Expr is also covered by this if branch. Hence the comment should include this case.
Cheers
> On Jun 13, 2021, at 1:28 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Here is a slightly updated version of the patch Thanks for taking this up again! Applying the new test cases from your patch, multiple estimates have gotten better. That looks good. I wrote a few extratest cases and saw no change, which is fine. I was looking for regressions where the estimates are now worse than before. Do you expect there to be any such cases? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/14/21 5:36 PM, Mark Dilger wrote: > > >> On Jun 13, 2021, at 1:28 PM, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> Here is a slightly updated version of the patch > > Thanks for taking this up again! > > Applying the new test cases from your patch, multiple estimates have > gotten better. That looks good. I wrote a few extra test cases and > saw no change, which is fine. I was looking for regressions where > the estimates are now worse than before. Do you expect there to be > any such cases? > Not really. These clauses could not be estimated before, we generally used default estimates for them. So WHERE a = b would use 0.5%, while WHERE a < b would use 33%, and so on. OTOH it depends on the accuracy of the extended statistics - particularly the MCV list (what fraction of the data it covers, etc.). So it's possible the default estimate is very accurate by chance, and MCV list represents only a tiny fraction of the data. Then the new estimate could we worse. Consider for example this: create table t (a int, b int); insert into t select 100, 100 from generate_series(1,5000) s(i); insert into t select i, i+1 from generate_series(1,995000) s(i); This has exactly 0.5% of rows with (a=b). Without extended stats it's perfect: explain analyze select * from t where a = b; Seq Scan on t (cost=0.00..16925.00 rows=5000 width=8) (actual time=0.064..159.928 rows=5000 loops=1) while with statistics it gets worse: create statistics s (mcv) on a, b from t; analyze t; Seq Scan on t (cost=0.00..16925.00 rows=9810 width=8) (actual time=0.059..160.467 rows=5000 loops=1) It's not terrible, although we could construct worse examples. But the same issue applies to other clauses, not just to these new ones. And it relies on the regular estimation producing better estimate by chance. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, 13 Jun 2021 at 21:28, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Here is a slightly updated version of the patch > > The main issue I ran into > is the special case clauselist_selectivity, which does > > if (list_length(clauses) == 1) > return clause_selectivity_ext(...); > > which applies to cases like "WHERE a < b" which can now be handled by > extended statistics, thanks to this patch. But clause_selectivity_ext > only used to call restriction_selectivity for these clauses, which does > not use extended statistics, of course. > > I considered either getting rid of the special case, passing everything > through extended stats, including cases with a single clause. But that > ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a > bit, which seems like a better approach. Yeah, I looked at this a few months ago, while looking at the other extended stats stuff, and I came to the same conclusion that solving this issue by tweaking clause_selectivity_ext() is the best approach. I haven't looked at the patch in much detail yet, but I think the basic approach looks OK. There are a few comments that need updating, e.g.: - In statext_is_compatible_clause_internal(), before the "if (is_opclause(clause))" test. - The description of the arguments for examine_opclause_args(). I wonder if "expronleftp" for examine_opclause_args() should be "constonrightp", or some such -- as it stands it's being set to false for an Expr Op Expr clause, which doesn't seem right because there *is* an expression on the left. Regards, Dean
On Sun, 13 Jun 2021 at 21:28, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Here is a slightly updated version of the patch > Hi, I have looked at this in some more detail, and it all looks pretty good, other than some mostly cosmetic stuff. The new code in statext_is_compatible_clause_internal() is a little hard to follow because some of the comments aren't right (e.g. when checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr) as the comment says). Rather than trying to comment on each conditional branch, it might be simpler to just have a single catch-all comment at the top, and also remove the "return true" in the middle, to make it something like: /* * Check Vars appearing on either side by recursing, and make a note of * any expressions. */ if (IsA(clause_expr, Var)) { if (!statext_is_compatible_clause_internal(...)) return false; } else *exprs = lappend(*exprs, clause_expr); if (clause_expr2) { if (IsA(clause_expr2, Var)) { if (!statext_is_compatible_clause_internal(...)) return false; } else *exprs = lappend(*exprs, clause_expr2); } return true; Is the FIXME comment in examine_opclause_args() necessary? The check for a single relation has already been done in clause[list]_selectivity_ext(), and I'm not sure what examine_opclause_args() would do differently. In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks first in each loop. Also in mcv_get_match_bitmap(), the 2 "First check whether the constant is below the lower boundary ..." comments don't make any sense to me. Were those perhaps copied and pasted from somewhere else? They should perhaps say "Otherwise, compare the MCVItem with the constant" and "Otherwise compare the values from the MCVItem using the clause operator", or something like that. But other than such cosmetic things, I think the patch is good, and gives some nice estimate improvements. Regards, Dean
Hi, On 7/5/21 2:46 PM, Dean Rasheed wrote: > On Sun, 13 Jun 2021 at 21:28, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Here is a slightly updated version of the patch >> > > Hi, > > I have looked at this in some more detail, and it all looks pretty > good, other than some mostly cosmetic stuff. > Thanks for the review! > The new code in statext_is_compatible_clause_internal() is a little > hard to follow because some of the comments aren't right (e.g. when > checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr) > as the comment says). Rather than trying to comment on each > conditional branch, it might be simpler to just have a single > catch-all comment at the top, and also remove the "return true" in the > middle, to make it something like: > > /* > * Check Vars appearing on either side by recursing, and make a note of > * any expressions. > */ > if (IsA(clause_expr, Var)) > { > if (!statext_is_compatible_clause_internal(...)) > return false; > } > else > *exprs = lappend(*exprs, clause_expr); > > if (clause_expr2) > { > if (IsA(clause_expr2, Var)) > { > if (!statext_is_compatible_clause_internal(...)) > return false; > } > else > *exprs = lappend(*exprs, clause_expr2); > } > > return true; > I ended up doing something slightly different - examine_opclause_args now "returns" a list of expressions, instead of explicitly setting two parameters. That means we can do a simple foreach() here, which seems cleaner. It means we have to extract the expressions from the list in a couple places, but that seems acceptable. Do you agree? I also went through the comments and updated those that seemed wrong. > Is the FIXME comment in examine_opclause_args() necessary? The check > for a single relation has already been done in > clause[list]_selectivity_ext(), and I'm not sure what > examine_opclause_args() would do differently. > Yeah, I came to the same conclusion. > In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks > first in each loop. > This is how master already does that now, and I wonder if it's done in this order intentionally. It's not clear to me doing it in the other way would be faster? > Also in mcv_get_match_bitmap(), the 2 "First check whether the > constant is below the lower boundary ..." comments don't make any > sense to me. Were those perhaps copied and pasted from somewhere else? > They should perhaps say "Otherwise, compare the MCVItem with the > constant" and "Otherwise compare the values from the MCVItem using the > clause operator", or something like that. > Yeah, that's another bit that comes from current master - the patch just makes a new copy of the comment. I agree it's bogus, Seems like a remainder of the original code which did various "smart" things we removed over time. Will fix. > But other than such cosmetic things, I think the patch is good, and > gives some nice estimate improvements. > Thanks, sounds good. I guess the last thing is maybe mentioning this in the docs, adding an example etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, 20 Jul 2021 at 19:28, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > The new code in statext_is_compatible_clause_internal() is a little > > hard to follow because some of the comments aren't right > > I ended up doing something slightly different - examine_opclause_args > now "returns" a list of expressions, instead of explicitly setting two > parameters. That means we can do a simple foreach() here, which seems > cleaner. It means we have to extract the expressions from the list in a > couple places, but that seems acceptable. Do you agree? Yes, that looks much neater. > > In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks > > first in each loop. > > This is how master already does that now, and I wonder if it's done in > this order intentionally. It's not clear to me doing it in the other way > would be faster? Ah OK, it just felt more natural to do it the other way round. I suppose though, that for the first clause, the is-final check isn't going to catch anything, whereas the is-null checks might. For the remaining clauses, it will depend on the data as to which way is faster, but it probably isn't going to make any noticeable difference either way. So, although it initially seems a bit counter-intuitive, it's probably better the way it is. > I guess the last thing is maybe mentioning this in > the docs, adding an example etc. Yeah, good idea. Regards, Dean
> On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > <0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch> Hi Tomas, I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I foundsome. The tests are too large to attach, but the scripts that generate them are not. To perform the tests: git checkout master perl ./gentest.pl > src/test/regress/sql/gentest.sql cat /dev/null > src/test/regress/expected/gentest.out echo "test: gentest" >> src/test/regress/parallel_schedule ./configure && make && make check cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch make check cat src/test/regress/regression.diffs | perl ./check.pl This shows patterns of conditions that get worse, such as: better:0, worse:80: A < B and A <> A or not A < A better:0, worse:80: A < B and not A <= A or A <= A better:0, worse:80: A < B or A = A better:0, worse:80: A < B or A = A or not A >= A better:0, worse:80: A < B or A >= A better:0, worse:80: A < B or A >= A and not A <> A better:0, worse:80: A < B or not A < A better:0, worse:80: A < B or not A <> A better:0, worse:80: A < B or not A <> A or A <= A better:0, worse:80: A < B or not A >= A or not A < A It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handledincorrectly. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 8/9/21 9:19 PM, Mark Dilger wrote: > > >> On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Tomas Vondra >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> <0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch> > > Hi Tomas, > > I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I foundsome. > > The tests are too large to attach, but the scripts that generate them are not. To perform the tests: > > git checkout master > perl ./gentest.pl > src/test/regress/sql/gentest.sql > cat /dev/null > src/test/regress/expected/gentest.out > echo "test: gentest" >> src/test/regress/parallel_schedule > ./configure && make && make check > cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out > patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch > make check > cat src/test/regress/regression.diffs | perl ./check.pl > > This shows patterns of conditions that get worse, such as: > > better:0, worse:80: A < B and A <> A or not A < A > better:0, worse:80: A < B and not A <= A or A <= A > better:0, worse:80: A < B or A = A > better:0, worse:80: A < B or A = A or not A >= A > better:0, worse:80: A < B or A >= A > better:0, worse:80: A < B or A >= A and not A <> A > better:0, worse:80: A < B or not A < A > better:0, worse:80: A < B or not A <> A > better:0, worse:80: A < B or not A <> A or A <= A > better:0, worse:80: A < B or not A >= A or not A < A > > It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handledincorrectly. > Thanks for this testing! I took a quick look, and I think this is mostly due to luck in how the (default) range estimates combine without and with extended statistics. Consider for example this simple example: create table t (a int, b int); insert into t select mod(i,10), mod(i,20) from generate_series(1,1000000) s(i); Without stats, the first clauses example is estimated like this: explain (timing off, analyze) select * from t where (A < B and A <> A) or not A < A; QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..21925.00 rows=554444 width=8) (actual rows=1000000 loops=1) Filter: (((a < b) AND (a <> a)) OR (a >= a)) Planning Time: 0.054 ms Execution Time: 80.485 ms (4 rows) and with MCV on (a,b) it gets estimates like this: QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..21925.00 rows=333333 width=8) (actual rows=1000000 loops=1) Filter: (((a < b) AND (a <> a)) OR (a >= a)) Planning Time: 0.152 ms Execution Time: 79.917 ms (4 rows) So with the statistics, the estimate gets a bit worse. The reason is fairly simple - if you look at the two parts of the OR clause, we get this: clause actual no stats with stats --------------------------------------------------------------- (A < B and A <> A) 0 331667 1 not (A < A) 1000000 333333 333333 This clearly shows that the first clause is clearly improved, while the (A < A) is estimated the same way, because the clause has a single Var so it's considered to be "simple" so we ignore the MCV selectivity and just use the simple_sel calculated by clause_selectivity_ext. And the 333333 and 331667 just happen to be closer to the actual row count. But that's mostly by luck, clearly. But now that I think about it, maybe the problem really is in how statext_mcv_clauselist_selectivity treats this clause - the definition of "simple" clauses as "has one attnum" was appropriate when only clauses (Var op Const) were supported. But with (Var op Var) that's probably not correct anymore. And indeed, commenting out the if condition on line 1933 (so ignoring simple_sel) and that does improve the estimates for this query. But perhaps I'm missing something, this needs more thought. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 11 Aug 2021 at 00:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > So with the statistics, the estimate gets a bit worse. The reason is > fairly simple - if you look at the two parts of the OR clause, we get this: > > clause actual no stats with stats > --------------------------------------------------------------- > (A < B and A <> A) 0 331667 1 > not (A < A) 1000000 333333 333333 > > This clearly shows that the first clause is clearly improved, while the > (A < A) is estimated the same way, because the clause has a single Var > so it's considered to be "simple" so we ignore the MCV selectivity and > just use the simple_sel calculated by clause_selectivity_ext. > > And the 333333 and 331667 just happen to be closer to the actual row > count. But that's mostly by luck, clearly. > > But now that I think about it, maybe the problem really is in how > statext_mcv_clauselist_selectivity treats this clause - the definition > of "simple" clauses as "has one attnum" was appropriate when only > clauses (Var op Const) were supported. But with (Var op Var) that's > probably not correct anymore. > Hmm, interesting. Clearly the fact that the combined estimate without extended stats was better was just luck, based on it's large overestimate of the first clause. But it's also true that a (Var op Var) clause should not be treated as simple, because "simple" in this context is meant to be for clauses that are likely to be better estimated with regular stats, whereas in this case, extended stats would almost certainly do better on the second clause. Perhaps the easiest way to identify simple clauses would be in statext_is_compatible_clause(), rather than the way it's done now, because it has the relevant information at hand, so it could be made to return an extra flag. This feels like rather an artificial example though. Is there any real use for this sort of clause? Regards, Dean
> On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > This feels like rather an artificial example though. Is there any real > use for this sort of clause? The test generated random combinations of clauses and then checked if any had consistently worse performance. These cameup. I don't know that they represent anything real. What was not random in the tests was the data in the tables. I've gotten curious if these types of clauses (with columnscompared against themselves) would still be bad for random rather than orderly data sets. I'll go check.... testing.... Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing to thispatch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random()data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns willbe smaller, and the mcv list will be irrelevant, since values should not tend to repeat. Over all queries, 47791 have better estimates after the patch, but 34802 had worse estimates after the patch (with the remaining17407 queries having roughly equal quality). The worst estimates are still ones that have a column compared to itself: better:0, worse:33: A <= B or A <= A or A <= A better:0, worse:33: A <= B or A = A or not A <> A better:0, worse:33: A <= B or A >= A or not A <> A better:0, worse:33: A <> B or A <= A better:0, worse:33: A <> B or A <= A or A <> A better:0, worse:33: A <> B or A <= A or A >= A better:0, worse:33: A <> B or A <= A or not A = A better:0, worse:33: A <> B or A > A or not A < A better:0, worse:33: A <> B or A >= A better:0, worse:33: A <> B or A >= A and A <= A better:0, worse:33: A = B or not A > A or not A > A better:0, worse:33: A >= B or not A <> A or A = A better:0, worse:39: B <= A or B <= B or B <= B better:0, worse:39: B <= A or B = B or not B <> B better:0, worse:39: B <= A or B >= B or not B <> B better:0, worse:39: B <> A or B <= B better:0, worse:39: B <> A or B <= B or B <> B better:0, worse:39: B <> A or B <= B or B >= B better:0, worse:39: B <> A or B <= B or not B = B better:0, worse:39: B <> A or B > B or not B < B better:0, worse:39: B <> A or B >= B better:0, worse:39: B <> A or B >= B and B <= B better:0, worse:39: B = A or not B > B or not B > B better:0, worse:39: B >= A or not B <> B or B = B But there are plenty that got worse without that, such as the following examples: better:25, worse:39: A < B and A < B or B > A better:10, worse:48: A < B and A < C better:10, worse:54: A < B and A < C or C > A I'll go test random data designed to have mcv lists of significance.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/11/21 2:08 PM, Dean Rasheed wrote: > On Wed, 11 Aug 2021 at 00:05, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> So with the statistics, the estimate gets a bit worse. The reason is >> fairly simple - if you look at the two parts of the OR clause, we get this: >> >> clause actual no stats with stats >> --------------------------------------------------------------- >> (A < B and A <> A) 0 331667 1 >> not (A < A) 1000000 333333 333333 >> >> This clearly shows that the first clause is clearly improved, while the >> (A < A) is estimated the same way, because the clause has a single Var >> so it's considered to be "simple" so we ignore the MCV selectivity and >> just use the simple_sel calculated by clause_selectivity_ext. >> >> And the 333333 and 331667 just happen to be closer to the actual row >> count. But that's mostly by luck, clearly. >> >> But now that I think about it, maybe the problem really is in how >> statext_mcv_clauselist_selectivity treats this clause - the definition >> of "simple" clauses as "has one attnum" was appropriate when only >> clauses (Var op Const) were supported. But with (Var op Var) that's >> probably not correct anymore. >> > > Hmm, interesting. Clearly the fact that the combined estimate without > extended stats was better was just luck, based on it's large > overestimate of the first clause. But it's also true that a (Var op > Var) clause should not be treated as simple, because "simple" in this > context is meant to be for clauses that are likely to be better > estimated with regular stats, whereas in this case, extended stats > would almost certainly do better on the second clause. I don't see why extended stats would do better on the second clause. I mean, if you have (A < A) then extended stats pretty much "collapse" into per-column stats. We could get almost the same estimate on single-column MCV list, etc. The reason why that does not happen is that we just treat it as a range clause, and assign it a default 33% estimate. But we could make that a bit smarter, and assign better estimates to those clauses (A < A) => 0.0 (A = A) => 1.0 (A <= A) => 1.0 And that'd give us the same estimates, I think. Not sure that's worth it, because (A op A) clauses are probably very rare, OTOH it's cheap. > > Perhaps the easiest way to identify simple clauses would be in > statext_is_compatible_clause(), rather than the way it's done now, > because it has the relevant information at hand, so it could be made > to return an extra flag. > Agreed, that seems like a better place to fix this. > This feels like rather an artificial example though. Is there any real > use for this sort of clause? > True. It seems a bit artificial, which is understandable as it came from a synthetic test generating all possible clauses. OTOH, fixing it seems fairly cheap ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 11, 2021, at 7:51 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I'll go test random data designed to have mcv lists of significance.... Done. The data for column_i is set to floor(random()^i*20). column_1 therefore is evenly distributed between 0..19, withsuccessive columns weighted more towards smaller values. This still gives (marginally) worse results than the original test I posted, but better than the completely random data fromthe last post. After the patch, 72294 estimates got better and 30654 got worse. The biggest losers from this data setare: better:0, worse:31: A >= B or A = A or not A = A better:0, worse:31: A >= B or A = A better:0, worse:31: A >= B or not A <> A better:0, worse:31: A >= A or A = B or not B = A better:0, worse:31: A >= B and not A < A or A = A better:0, worse:31: A = A or not A > B or B <> A better:0, worse:31: A >= B or not A <> A or not A >= A better:0, worse:32: B < A and B > C and not C < B <---- better:1, worse:65: A <> C and A <= B <---- better:0, worse:33: B <> A or B >= B better:0, worse:33: B <> A or B <= B better:0, worse:33: B <= A or B = B or not B > B better:0, worse:33: B <> A or not B >= B or not B < B better:0, worse:33: B = A or not B > B or B = B better:0, worse:44: A = B or not A > A or A = A better:0, worse:44: A <> B or A <= A better:0, worse:44: A <> B or not A >= A or not A < A better:0, worse:44: A <= B or A = A or not A > A better:0, worse:44: A <> B or A >= A Of which, a few do not contain columns compared against themselves, marked with <---- above. I don't really know what to make of these results. It doesn't bother me that any particular estimate gets worse after thepatch. That's just the nature of estimating. But it does bother me a bit that some types of estimates consistently getworse. We should either show that my analysis is wrong about that, or find a way to address it to avoid performance regressions. If I'm right that there are whole classes of estimates that are made consistently worse, then it stands to reasonsome users will have those data distributions and queries, and could easily notice. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/11/21 4:51 PM, Mark Dilger wrote: > > >> On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> This feels like rather an artificial example though. Is there any real >> use for this sort of clause? > > The test generated random combinations of clauses and then checked if any had consistently worse performance. These cameup. I don't know that they represent anything real. > > What was not random in the tests was the data in the tables. I've gotten curious if these types of clauses (with columnscompared against themselves) would still be bad for random rather than orderly data sets. I'll go check.... > > testing.... > > Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing tothis patch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random()data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns willbe smaller, and the mcv list will be irrelevant, since values should not tend to repeat. > > Over all queries, 47791 have better estimates after the patch, but 34802 had worse estimates after the patch (with theremaining 17407 queries having roughly equal quality). > > The worst estimates are still ones that have a column compared to itself: > > better:0, worse:33: A <= B or A <= A or A <= A > better:0, worse:33: A <= B or A = A or not A <> A > better:0, worse:33: A <= B or A >= A or not A <> A > better:0, worse:33: A <> B or A <= A > better:0, worse:33: A <> B or A <= A or A <> A > better:0, worse:33: A <> B or A <= A or A >= A > better:0, worse:33: A <> B or A <= A or not A = A > better:0, worse:33: A <> B or A > A or not A < A > better:0, worse:33: A <> B or A >= A > better:0, worse:33: A <> B or A >= A and A <= A > better:0, worse:33: A = B or not A > A or not A > A > better:0, worse:33: A >= B or not A <> A or A = A > better:0, worse:39: B <= A or B <= B or B <= B > better:0, worse:39: B <= A or B = B or not B <> B > better:0, worse:39: B <= A or B >= B or not B <> B > better:0, worse:39: B <> A or B <= B > better:0, worse:39: B <> A or B <= B or B <> B > better:0, worse:39: B <> A or B <= B or B >= B > better:0, worse:39: B <> A or B <= B or not B = B > better:0, worse:39: B <> A or B > B or not B < B > better:0, worse:39: B <> A or B >= B > better:0, worse:39: B <> A or B >= B and B <= B > better:0, worse:39: B = A or not B > B or not B > B > better:0, worse:39: B >= A or not B <> B or B = B > The other interesting thing all those clauses have in common is that they're OR clauses. And we handle that a bit differently. But I think the "strange" clauses with the same Var on both sides is the main issue, and not detecting them as "simple" clauses should fix that. > But there are plenty that got worse without that, such as the following examples: > > better:25, worse:39: A < B and A < B or B > A > better:10, worse:48: A < B and A < C > better:10, worse:54: A < B and A < C or C > A > > I'll go test random data designed to have mcv lists of significance.... > Hard to say without having a look at the data set, but there'll always be cases where the extended stats perform a bit worse, due to (a) luck and (b) the stats covering only small fraction of the table. But of course, it's worth investigating the suspicious cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/11/21 4:51 PM, Mark Dilger wrote: > > >> On Aug 11, 2021, at 5:08 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> This feels like rather an artificial example though. Is there any real >> use for this sort of clause? > > The test generated random combinations of clauses and then checked if > any had consistently worse performance. These came up. I don't > know that they represent anything real. > > What was not random in the tests was the data in the tables. I've > gotten curious if these types of clauses (with columns compared > against themselves) would still be bad for random rather than orderly > data sets. I'll go check.... > > testing.... > > Wow. Randomizing the data makes the problems even more extreme. It seems my original test set was actually playing to this patch's strengths, not its weaknesses. I've changed the columns to double precision and filled the columns with random() data, where column1 gets random()^1, column2 gets random()^2, etc. So on average the larger numbered columns will be smaller, and the mcv list will be irrelevant, since values should not tend to repeat. > I tried using the same randomized data set, i.e. essentially create statistics s (mcv) on a, b, c from t; insert into t select random(), pow(random(), 2), pow(random(), 3), pow(random(),4) from generate_series(1,1000000) s(i); create statistics s (mcv) on a, b, c from t; But I don't see any difference compared to the estimates without extended statistics, which is not surprising because there should be no MCV list built. So I'm a bit puzzled about the claim that random data make the problems more extreme. Can you explain? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/11/21 5:17 PM, Mark Dilger wrote: > > >> On Aug 11, 2021, at 7:51 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> >> I'll go test random data designed to have mcv lists of significance.... > > Done. The data for column_i is set to floor(random()^i*20). > column_1 therefore is evenly distributed between 0..19, with > successive columns weighted more towards smaller values. > > This still gives (marginally) worse results than the original test I > posted, but better than the completely random data from the last post. > After the patch, 72294 estimates got better and 30654 got worse. The > biggest losers from this data set are: > > better:0, worse:31: A >= B or A = A or not A = A > better:0, worse:31: A >= B or A = A > better:0, worse:31: A >= B or not A <> A > better:0, worse:31: A >= A or A = B or not B = A > better:0, worse:31: A >= B and not A < A or A = A > better:0, worse:31: A = A or not A > B or B <> A > better:0, worse:31: A >= B or not A <> A or not A >= A > better:0, worse:32: B < A and B > C and not C < B <---- > better:1, worse:65: A <> C and A <= B <---- > better:0, worse:33: B <> A or B >= B > better:0, worse:33: B <> A or B <= B > better:0, worse:33: B <= A or B = B or not B > B > better:0, worse:33: B <> A or not B >= B or not B < B > better:0, worse:33: B = A or not B > B or B = B > better:0, worse:44: A = B or not A > A or A = A > better:0, worse:44: A <> B or A <= A > better:0, worse:44: A <> B or not A >= A or not A < A > better:0, worse:44: A <= B or A = A or not A > A > better:0, worse:44: A <> B or A >= A > > Of which, a few do not contain columns compared against themselves, > marked with <---- above. > > I don't really know what to make of these results. It doesn't > bother me that any particular estimate gets worse after the patch. > That's just the nature of estimating. But it does bother me a bit > that some types of estimates consistently get worse. We should > either show that my analysis is wrong about that, or find a way to > address it to avoid performance regressions. If I'm right that there > are whole classes of estimates that are made consistently worse, then > it stands to reason some users will have those data distributions and > queries, and could easily notice. I'm not quite sure that's really a problem. Extended statistics are meant for correlated columns, and it's mostly expected the estimates may be a bit worse for random / independent data. The idea is mostly that statistics will be created only for correlated columns, in which case it should improve the estimates. I'd be way more concerned if you observed consistently worse estimates on such data set. Of course, there may be errors - the incorrect handling of (A op A) is an example of such issue, probably. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 11, 2021, at 10:38 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > So I'm a bit puzzled about the claim that random data make the problems more extreme. Can you explain? Hmm... you appear to be right. I changed the gentest.pl script to fill the tables with randomized data, but the random data is being regenerated each testrun (since the calls to random() are in the gentest.sql file). Adding an explicit setseed() call in the test to makesure the data is the same before and after applying your patch eliminates the differences. So there are three tests here. The first tests deterministic orderly data. The second tests deterministic random data withoutrepeats and hence without meaningful mvc. The third tests deterministic random data with rounding into twenty bucketsskewed towards lower numbered buckets and hence with both repeats and meaningful mvc. The original test set: TOTAL: better: 77827 worse: 12317 The random test set, with setseed() calls to make it deterministic: TOTAL: better: 49708 worse: 19393 The random test set , with setseed() calls to make it deterministic plus rounding into buckets: TOTAL: better: 81764 worse: 19594 Once the data is made deterministic, the third set looks slightly better than the first, rather than slightly worse. Butalmost 20% of the query types still look worse after applying the patch. I'm going to dig deeper into those to see ifthat conclusion survives bumping up the size of the dataset. It will take quite some time to run the tests with a hugedataset, but I don't see how else to investigate this. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/12/21 12:02 AM, Mark Dilger wrote: > > ... > > Once the data is made deterministic, the third set looks slightly > better than the first, rather than slightly worse. But almost 20% of > the query types still look worse after applying the patch. I'm going to > dig deeper into those to see if that conclusion survives bumping up the > size of the dataset. It will take quite some time to run the tests with > a huge dataset, but I don't see how else to investigate this. > As I said in my last reply, I'm not sure it's particularly useful to look at overall results from data sets with independent columns. That's not what extended statistics are for, and people should not create them in those cases ... Maybe it'd be better to focus on cases with the largest difference in estimates, and investigate those more closely. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 11, 2021, at 3:45 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > As I said in my last reply, I'm not sure it's particularly useful to look at overall results from data sets with independentcolumns. That's not what extended statistics are for, and people should not create them in those cases ... We sent our last emails more or less simultaneously. I'm not ignoring your email; I just hadn't seen it yet when I sentmine. > Maybe it'd be better to focus on cases with the largest difference in estimates, and investigate those more closely. Yeah, I'm working on a correlated stats test as I write this. I'll get back to you when I have results. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/12/21 12:48 AM, Mark Dilger wrote: > > >> On Aug 11, 2021, at 3:45 PM, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> As I said in my last reply, I'm not sure it's particularly useful >> to look at overall results from data sets with independent columns. >> That's not what extended statistics are for, and people should not >> create them in those cases ... > > We sent our last emails more or less simultaneously. I'm not > ignoring your email; I just hadn't seen it yet when I sent mine. > Apologies, I didn't mean to imply you're ignoring my messages. >> Maybe it'd be better to focus on cases with the largest difference >> in estimates, and investigate those more closely. > > Yeah, I'm working on a correlated stats test as I write this. I'll > get back to you when I have results. > Cool, thanks! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 11, 2021, at 3:48 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I'm working on a correlated stats test as I write this. I'll get back to you when I have results. Ok, the tests showed no statistically significant regressions. All tests included the same sorts of whereclause expressionsas used in the tests from yesterday's email. The first test created loosely correlated data and found no significant row estimate improvements or regressions. The second test of more tightly correlated data showed a row estimate improvement overall, with no class of whereclause showingan estimate regression. I think the apparent regressions from yesterday were just statistical noise. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Mark, This thread inspired me to do something fairly similar - a generator that generates queries of varying complexity, executes them on table with and without extended statistics. I've been thinking about that before, but this finally pushed me to do that, and some of the results are fairly interesting ... I've pushed everything (generator and results) to this github repo: https://github.com/tvondra/stats-test with a summary of all results here: https://github.com/tvondra/stats-test/blob/master/results.md Some basic facts about the generator.py (see query_generator): * It's using a fixed seed to make it deterministic. * A small fraction of generated queries is sampled and executed (5%). * Thanks to a fixed seed we generate/sample the same set of queries for different runs, which allows us to compare runs easily. * The queries use 2 - 5 clauses, either (Var op Const) or (Var op Var). * The operators are the usual equality/inequality ones. * The clauses are combined using AND/OR (also randomly picked). * There's a random set of parens added, to vary the operator precedence (otherwise it'd be driven entirely by AND/OR). * There are two datasets - a random and correlated one, with different number of distinct values in each column (10, 100, 1000, 10000). * The statistics target is set to 10, 100, 1000, 10000. It's a bit hacky, with various bits hard-coded at the moment. But it could be extended to do other stuff fairly easily, I think. Anyway, the repository contains results for three cases: 1) master 2) patched: master with the (Var op Var) patch 3) fixed: patched, with a fix for "simple" clauses (a crude patch) And for each case we have three row counts: * actual (from explain analyze) * estimate without extended stats * estimate with extended stats And then we can calculate "estimation error" as estimate / actual both with and without statistics. Results for two cases can be plotted as a scatter plot, with the two estimation errors as (x,y) values. The idea is that this shows how a patch affects estimates - a point (100,10) means that it was 100x over-estimated, and with the patch it's just 10x, and similarly for other points. This is what the charts at https://github.com/tvondra/stats-test/blob/master/results.md do, for each combination of parameters (dataset, statistics target and number of distinct values). There's one chart without extended stats, one with extended stats. An "ideal" chart would look like like a single point (1,1) which means "accurate estimates without/with patch", or (?,1) which means "poor estimates before, accurate estimates now". Diagonal means "no change". In principle, we expect the charts to look like this: * random: diagonal charts, because there should be no extended stats built, hence no impact on estimates is expected * correlated: getting closer to 1.0, which looks like a horizontal line in the chart Consider for example this: https://github.com/tvondra/stats-test/raw/master/correlated-1000-10.png which clearly shows that the first patch is almost exactly the same as master, while with the fix the estimates improve significantly (and are almost perfect), at least with the statistics. Without stats there's a bunch of queries that suddenly get from "perfect" to much worse (looks like a vertical line on the left chart). But there are other "strange" cases with "interesting patterns", like for example * https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-100-100.png * https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-1000-100.png * https://raw.githubusercontent.com/tvondra/stats-test/master/random-10000-10.png This likely shows the patches are a significant improvement for some queries (either getting better than master, or even making the estimates pretty accurate). But it's probably worth looking into the queries that got worse, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Aug 18, 2021, at 3:43 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I've pushed everything (generator and results) to this github repo Thanks for the link. I took a very brief look. Perhaps we can combine efforts. I need to make progress on several otherpatches first, but hope to get back to this. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/18/21 3:16 PM, Mark Dilger wrote: > > >> On Aug 18, 2021, at 3:43 AM, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> I've pushed everything (generator and results) to this github repo > > Thanks for the link. I took a very brief look. Perhaps we can > combine efforts. I need to make progress on several other patches > first, but hope to get back to this. > Sure - it'd be great to combine efforts. That's why I posted my scripts & results. I understand there's plenty other work for both of us, so take your time - no rush. After looking at this for a while, it's clear the main issue is handling of clauses referencing the same Var twice, like for example (a = a) or (a < a). But it's not clear to me if this is something worth fixing, or if extended statistics is the right place to do it. If those clauses are worth the effort, why not to handle them better even without extended statistics? We can easily evaluate these clauses on per-column MCV, because they only reference a single Var. It'd be rather strange if for example select * from t where (a < a) is mis-estimated simply because it can't use extended statistics (there's just a single Var, so we won't consider extended stats), while select * from t where (a < a) and b = 1 suddenly gets much better thanks to extended stats on (a,b), even when (a,b) are perfectly independent. So I think we better make eqsel/ineqsel smarter about estimating those clauses, assuming we consider them important enough. I think we can either reject the patch, which would mean we don't consider (Var op Var) clauses to be common/important enough. Or we need to improve the existing selectivity functions (even those without extended statistics) to handle those clauses in a smarter way. Otherwise there'd be strange/surprising inconsistencies. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 20, 2021, at 11:20 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I think we can either reject the patch, which would mean we don't consider (Var op Var) clauses to be common/importantenough. Or we need to improve the existing selectivity functions (even those without extended statistics)to handle those clauses in a smarter way. Otherwise there'd be strange/surprising inconsistencies. For datatypes with very few distinct values (bool, some enums, etc.) keeping an mcv list of (a,b) pairs seems helpful. Thepatch may be worth keeping for such cases. In other cases, I don't much see the point. It seems that sampling the fraction of rows where (A op B) is true for any given op would be more helpful. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 20, 2021 at 2:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > After looking at this for a while, it's clear the main issue is handling > of clauses referencing the same Var twice, like for example (a = a) or > (a < a). But it's not clear to me if this is something worth fixing, or > if extended statistics is the right place to do it. > > If those clauses are worth the effort, why not to handle them better > even without extended statistics? We can easily evaluate these clauses > on per-column MCV, because they only reference a single Var. +1. It seems to me that what we ought to do is make "a < a", "a > a", and "a != 0" all have an estimate of zero, and make "a <= a", "a >= a", and "a = a" estimate 1-nullfrac. The extended statistics mechanism can just ignore the first three types of clauses; the zero estimate has to be 100% correct. It can't necessarily ignore the second three cases, though. If the query says "WHERE a = a AND b = 1", "b = 1" may be more or less likely given that a is known to be not null, and extended statistics can tell us that. -- Robert Haas EDB: http://www.enterprisedb.com
On 8/18/21 12:43 PM, Tomas Vondra wrote: > Hi Mark, > > This thread inspired me to do something fairly similar - a generator > that generates queries of varying complexity, executes them on table > with and without extended statistics. I've been thinking about that > before, but this finally pushed me to do that, and some of the results > are fairly interesting ... > > I've pushed everything (generator and results) to this github repo: > > https://github.com/tvondra/stats-test > > with a summary of all results here: > > https://github.com/tvondra/stats-test/blob/master/results.md > FWIW I've pushed slightly reworked scripts and results - there are results from two machines - xeon and i5. Xeon is mostly the same as before, with some minor fixes, while i5 is does not allow clauses referencing the same column twice (per discussion in this thread). I think there was a bug in the original plot script, combining incorrect data series in some cases, causing (at least) some of the strange patterns mentioned. I've also made the charts easier to read by splitting the cases into separate plots and using transparency. I've also added png version back, because plotting the .svg is quite slow. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/20/21 8:56 PM, Robert Haas wrote: > On Fri, Aug 20, 2021 at 2:21 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> After looking at this for a while, it's clear the main issue is handling >> of clauses referencing the same Var twice, like for example (a = a) or >> (a < a). But it's not clear to me if this is something worth fixing, or >> if extended statistics is the right place to do it. >> >> If those clauses are worth the effort, why not to handle them better >> even without extended statistics? We can easily evaluate these clauses >> on per-column MCV, because they only reference a single Var. > > +1. > > It seems to me that what we ought to do is make "a < a", "a > a", and > "a != 0" all have an estimate of zero, and make "a <= a", "a >= a", > and "a = a" estimate 1-nullfrac. The extended statistics mechanism can > just ignore the first three types of clauses; the zero estimate has to > be 100% correct. It can't necessarily ignore the second three cases, > though. If the query says "WHERE a = a AND b = 1", "b = 1" may be more > or less likely given that a is known to be not null, and extended > statistics can tell us that. > Yeah, I agree this seems like the right approach (except I guess you meant "a != a" and not "a != 0"). Assuming we want to do something about these clauses at all - I'm still wondering if those clauses are common in practice or just synthetic. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 20, 2021 at 3:32 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Yeah, I agree this seems like the right approach (except I guess you > meant "a != a" and not "a != 0"). Err, yes. > Assuming we want to do something about > these clauses at all - I'm still wondering if those clauses are common > in practice or just synthetic. Well, they are certainly less common than some things, but query generators do a lot of wonky things. Also, as a practical matter, it might be cheaper to do something about them than to not do something about them. I don't really understand the mechanism of operation of the patch, but I guess if somebody writes "WHERE a = b", one thing you could do would be check whether any of the MCVs for a are also MCVs for b, and if so you could estimate something on that basis. If you happened to have extended statistics for (a, b) then I guess you could do even better using, uh, math, or something. But all of that sounds like hard work, and checking whether "a" happens to be the same as "b" sounds super-cheap by comparison. If, as normally will be the case, the two sides are not the same, you haven't really lost anything, because the expenditure of cycles to test varno and varattno for equality must be utterly trivial in comparison with fetching stats data and looping over MCV lists and things. But if on occasion you find out that they are the same, then you win! You can give a more accurate estimate with less computational work. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, The attached patch series is modified to improve estimates for these special clauses (Var op Var with the same var on both sides) without extended statistics. This is done in 0001, and it seems fairly simple and cheap. The 0002 part is still the same patch as on 2021/07/20. Part 0003 fixes handling of those clauses so that we don't treat them as simple, but it does that by tweaking statext_is_compatible_clause(), as suggested by Dean. It does work, although it's a bit more invasive than simply checking the shape of clause in statext_mcv_clauselist_selectivity. I do have results for the randomly generated queries, and this does improve the situation a lot - pretty much all the queries with (a=a) or (a<a) clauses had terrible estimates, and this fixes that. That being said, I'm still not sure if this is an issue in real-world applications, or whether we're solving something because of synthetic queries generated by the randomized generator. But the checks seem fairly cheap, so maybe it doesn't matter too much. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Aug 28, 2021 at 6:53 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
The attached patch series is modified to improve estimates for these
special clauses (Var op Var with the same var on both sides) without
extended statistics. This is done in 0001, and it seems fairly simple
and cheap.
The 0002 part is still the same patch as on 2021/07/20. Part 0003 fixes
handling of those clauses so that we don't treat them as simple, but it
does that by tweaking statext_is_compatible_clause(), as suggested by
Dean. It does work, although it's a bit more invasive than simply
checking the shape of clause in statext_mcv_clauselist_selectivity.
I do have results for the randomly generated queries, and this does
improve the situation a lot - pretty much all the queries with (a=a) or
(a<a) clauses had terrible estimates, and this fixes that.
That being said, I'm still not sure if this is an issue in real-world
applications, or whether we're solving something because of synthetic
queries generated by the randomized generator. But the checks seem
fairly cheap, so maybe it doesn't matter too much.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For 0001-Improve-estimates-for-Var-op-Var-with-the-s-20210828.patch :
+ * form (variable op variable) with the save variable on both sides.
typo: save -> same
Cheers
> On Aug 28, 2021, at 6:52 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Part 0003 fixes handling of those clauses so that we don't treat them as simple, but it does that by tweaking statext_is_compatible_clause(),as suggested by Dean. Function examine_opclause_args() doesn't set issimple to anything in the IsA(rightop, Const) case, but assigns *issimplep= issimple at the bottom. The compiler is not complaining about using a possibly uninitialized variable, but ifI change the "return true" on the very next line to "return issimple", the compiler complains quite loudly. Some functions define bool *issimple, others bool *issimplep and bool issimple. You might want to standardize the naming. It's difficult to know what "simple" means in extended_stats.c. There is no file-global comment explaining the concept,and functions like compare_scalars_simple don't have correlates named compare_scalars_complex or such, so the readercannot infer by comparison what the difference might be between a "simple" case and some non-"simple" case. The functions'issimple (or issimplep) argument are undocumented. There is a comment: /* * statext_mcv_clauselist_selectivity * Estimate clauses using the best multi-column statistics. .... * * - simple selectivity: Computed without extended statistics, i.e. as if the * columns/clauses were independent. * .... */ but it takes a while to find if you search for "issimple". In both scalarineqsel_wrapper() and eqsel_internal(), the call to matching_restriction_variables() should usually returnfalse, since comparing a variable to itself is an unusual case. The next call is to get_restriction_variable(), whichrepeats the work of examining the left and right variables. So in almost all cases, after throwing away the resultsof: examine_variable(root, left, varRelid, &ldata); examine_variable(root, right, varRelid, &rdata); performed in matching_restriction_variables(), we'll do exactly the same work again (with one variable named differently)in get_restriction_variable(): examine_variable(root, left, varRelid, vardata); examine_variable(root, right, varRelid, &rdata); That'd be fine if example_variable() were a cheap function, but it appears not to be. Do you think you could save the resultsrather than recomputing them? It's a little messy, since these are the only two functions out of about ten whichfollow this pattern, so you'd have to pass NULLs into get_restriction_variable() from the other eight callers, but itstill looks like that would be a win. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 28, 2021 at 9:30 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> On Aug 28, 2021, at 6:52 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Part 0003 fixes handling of those clauses so that we don't treat them as simple, but it does that by tweaking statext_is_compatible_clause(), as suggested by Dean.
Function examine_opclause_args() doesn't set issimple to anything in the IsA(rightop, Const) case, but assigns *issimplep = issimple at the bottom. The compiler is not complaining about using a possibly uninitialized variable, but if I change the "return true" on the very next line to "return issimple", the compiler complains quite loudly.
Some functions define bool *issimple, others bool *issimplep and bool issimple. You might want to standardize the naming.
It's difficult to know what "simple" means in extended_stats.c. There is no file-global comment explaining the concept, and functions like compare_scalars_simple don't have correlates named compare_scalars_complex or such, so the reader cannot infer by comparison what the difference might be between a "simple" case and some non-"simple" case. The functions' issimple (or issimplep) argument are undocumented.
There is a comment:
/*
* statext_mcv_clauselist_selectivity
* Estimate clauses using the best multi-column statistics.
....
*
* - simple selectivity: Computed without extended statistics, i.e. as if the
* columns/clauses were independent.
*
....
*/
but it takes a while to find if you search for "issimple".
In both scalarineqsel_wrapper() and eqsel_internal(), the call to matching_restriction_variables() should usually return false, since comparing a variable to itself is an unusual case. The next call is to get_restriction_variable(), which repeats the work of examining the left and right variables. So in almost all cases, after throwing away the results of:
examine_variable(root, left, varRelid, &ldata);
examine_variable(root, right, varRelid, &rdata);
performed in matching_restriction_variables(), we'll do exactly the same work again (with one variable named differently) in get_restriction_variable():
examine_variable(root, left, varRelid, vardata);
examine_variable(root, right, varRelid, &rdata);
That'd be fine if example_variable() were a cheap function, but it appears not to be. Do you think you could save the results rather than recomputing them? It's a little messy, since these are the only two functions out of about ten which follow this pattern, so you'd have to pass NULLs into get_restriction_variable() from the other eight callers, but it still looks like that would be a win.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I wonder if the queries with (a=a) or (a<a) clauses are worth this additional complexity to address.
Has anyone seen such clause in production queries ?
I would think the randomly generated queries should be narrowed a bit to exclude such cases where the result of the clause is known regardless of the underlying data.
Cheers
> On Aug 28, 2021, at 10:18 AM, Zhihong Yu <zyu@yugabyte.com> wrote: > > I wonder if the queries with (a=a) or (a<a) clauses are worth this additional complexity to address. > Has anyone seen such clause in production queries ? You might expect clauses like WHERE FALSE to be unusual, but that phrase gets added quite a lot by query generators. Somebodycould add "WHERE a != a" in a misguided attempt to achieve the same thing. I wouldn't be terribly surprised if query generators might generate queries with a variable number of tables joined togetherwith comparisons between the joined tables, and in the degenerate case of only one table end up with a table columncompared against itself. You could argue that those people need to fix their queries/generators to not do this sort of thing, but the end user affectedby such queries may have little ability to fix it. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/28/21 6:30 PM, Mark Dilger wrote: > > >> On Aug 28, 2021, at 6:52 AM, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> Part 0003 fixes handling of those clauses so that we don't treat >> them as simple, but it does that by tweaking >> statext_is_compatible_clause(), as suggested by Dean. > > Function examine_opclause_args() doesn't set issimple to anything in > the IsA(rightop, Const) case, but assigns *issimplep = issimple at > the bottom. The compiler is not complaining about using a possibly > uninitialized variable, but if I change the "return true" on the very > next line to "return issimple", the compiler complains quite loudly. > Yeah, true. Thanks for noticing this was a bug - I forgot to set the issimple variable in the first branch. > > Some functions define bool *issimple, others bool *issimplep and bool > issimple. You might want to standardize the naming. > I think the naming is standard with respect to the surrounding code. If the other parameters use "p" to mark "pointer" then issimplep is used, but in other places it's just "issimple". IMHO this is appropriate. > It's difficult to know what "simple" means in extended_stats.c. > There is no file-global comment explaining the concept, and functions > like compare_scalars_simple don't have correlates named > compare_scalars_complex or such, so the reader cannot infer by > comparison what the difference might be between a "simple" case and > some non-"simple" case. The functions' issimple (or issimplep) > argument are undocumented. > > There is a comment: > > /* * statext_mcv_clauselist_selectivity * Estimate clauses using > the best multi-column statistics. .... * * - simple selectivity: > Computed without extended statistics, i.e. as if the * > columns/clauses were independent. * .... */ > > but it takes a while to find if you search for "issimple". > Yeah, true. This was added a while ago when Dean reworked the estimation (based on MCV), and it seemed clear back then. But now a comment explaining this concept (and how it affects the estimation) would be helpful. I'll try digging in the archives for the details. > > In both scalarineqsel_wrapper() and eqsel_internal(), the call to > matching_restriction_variables() should usually return false, since > comparing a variable to itself is an unusual case. The next call is > to get_restriction_variable(), which repeats the work of examining > the left and right variables. So in almost all cases, after throwing > away the results of: > > examine_variable(root, left, varRelid, &ldata); > examine_variable(root, right, varRelid, &rdata); > > performed in matching_restriction_variables(), we'll do exactly the > same work again (with one variable named differently) in > get_restriction_variable(): > > examine_variable(root, left, varRelid, vardata); > examine_variable(root, right, varRelid, &rdata); > > That'd be fine if example_variable() were a cheap function, but it > appears not to be. Do you think you could save the results rather > than recomputing them? It's a little messy, since these are the only > two functions out of about ten which follow this pattern, so you'd > have to pass NULLs into get_restriction_variable() from the other > eight callers, but it still looks like that would be a win. > I had similar concerns, although I don't think those functions are very expensive compared to the rest of the estimation code. I haven't done any measurements yet, though. But I don't think saving the results is the way to go - in a way, we already store the stats (which seems like the most expensive bit) in syscache. It seems better to just simplify examine_variable() so that it does not lookup the statistics, which we don't need here at all. The attached version of the patches fixes the other bugs reported here so far - most importantly it reworks how we set issimple while examining the clauses, so that it's never skips the initialization. Hopefully the added comments also explain it a bit more clearly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Aug 30, 2021 at 9:00 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 8/28/21 6:30 PM, Mark Dilger wrote:
>
>
>> On Aug 28, 2021, at 6:52 AM, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Part 0003 fixes handling of those clauses so that we don't treat
>> them as simple, but it does that by tweaking
>> statext_is_compatible_clause(), as suggested by Dean.
>
> Function examine_opclause_args() doesn't set issimple to anything in
> the IsA(rightop, Const) case, but assigns *issimplep = issimple at
> the bottom. The compiler is not complaining about using a possibly
> uninitialized variable, but if I change the "return true" on the very
> next line to "return issimple", the compiler complains quite loudly.
>
Yeah, true. Thanks for noticing this was a bug - I forgot to set the
issimple variable in the first branch.
>
> Some functions define bool *issimple, others bool *issimplep and bool
> issimple. You might want to standardize the naming.
>
I think the naming is standard with respect to the surrounding code. If
the other parameters use "p" to mark "pointer" then issimplep is used,
but in other places it's just "issimple". IMHO this is appropriate.
> It's difficult to know what "simple" means in extended_stats.c.
> There is no file-global comment explaining the concept, and functions
> like compare_scalars_simple don't have correlates named
> compare_scalars_complex or such, so the reader cannot infer by
> comparison what the difference might be between a "simple" case and
> some non-"simple" case. The functions' issimple (or issimplep)
> argument are undocumented.
>
> There is a comment:
>
> /* * statext_mcv_clauselist_selectivity * Estimate clauses using
> the best multi-column statistics. .... * * - simple selectivity:
> Computed without extended statistics, i.e. as if the *
> columns/clauses were independent. * .... */
>
> but it takes a while to find if you search for "issimple".
>
Yeah, true. This was added a while ago when Dean reworked the estimation
(based on MCV), and it seemed clear back then. But now a comment
explaining this concept (and how it affects the estimation) would be
helpful. I'll try digging in the archives for the details.
>
> In both scalarineqsel_wrapper() and eqsel_internal(), the call to
> matching_restriction_variables() should usually return false, since
> comparing a variable to itself is an unusual case. The next call is
> to get_restriction_variable(), which repeats the work of examining
> the left and right variables. So in almost all cases, after throwing
> away the results of:
>
> examine_variable(root, left, varRelid, &ldata);
> examine_variable(root, right, varRelid, &rdata);
>
> performed in matching_restriction_variables(), we'll do exactly the
> same work again (with one variable named differently) in
> get_restriction_variable():
>
> examine_variable(root, left, varRelid, vardata);
> examine_variable(root, right, varRelid, &rdata);
>
> That'd be fine if example_variable() were a cheap function, but it
> appears not to be. Do you think you could save the results rather
> than recomputing them? It's a little messy, since these are the only
> two functions out of about ten which follow this pattern, so you'd
> have to pass NULLs into get_restriction_variable() from the other
> eight callers, but it still looks like that would be a win.
>
I had similar concerns, although I don't think those functions are very
expensive compared to the rest of the estimation code. I haven't done
any measurements yet, though.
But I don't think saving the results is the way to go - in a way, we
already store the stats (which seems like the most expensive bit) in
syscache. It seems better to just simplify examine_variable() so that it
does not lookup the statistics, which we don't need here at all.
The attached version of the patches fixes the other bugs reported here
so far - most importantly it reworks how we set issimple while examining
the clauses, so that it's never skips the initialization. Hopefully the
added comments also explain it a bit more clearly.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For patch 0002,
+ jointype, sjinfo, rel,
+ &estimatedclauses, false);
+
+ estimated = (bms_num_members(estimatedclauses) == 1);
I took a look at clauselist_apply_dependencies() (called by statext_clauselist_selectivity) where estimatedclauses is modified.
Since the caller would not use the returned Selectivity if number of elements in estimatedclauses is greater than 1, I wonder
if a parameter can be added to clauselist_apply_dependencies() which indicates early return if the second element is added to estimatedclauses.
Cheers
On 8/31/21 00:14, Zhihong Yu wrote: > Hi, > For patch 0002, > > + s1 = statext_clauselist_selectivity(root, clauses, > varRelid, > + jointype, > sjinfo, rel, > + > &estimatedclauses, false); > + > + estimated = (bms_num_members(estimatedclauses) == 1); > > I took a look at clauselist_apply_dependencies() (called by > statext_clauselist_selectivity) where estimatedclauses is modified. > Since the caller would not use the returned Selectivity if number of > elements in estimatedclauses is greater than 1, I wonder > if a parameter can be added to clauselist_apply_dependencies() which > indicates early return if the second element is added to estimatedclauses. > Hmmm, I'm not sure I understand your point. Are you suggesting there's a bug in not updating the bitmap, or would this be an optimization? Can you give an example of a query affected by this? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I finally got around to this patch again, focusing mostly on the first part that simply returns either 1.0 or 0.0 for Var op Var conditions (i.e. the part not really using extended statistics). I have been unhappy about using examine_variable, which does various expensive things like searching for statistics (which only got worse because now we're also looking for expression stats). But we don't really need the stats - we just need to check the Vars match (same relation, same attribute). So 0002 fixes this. Which got me thinking that maybe we don't need to restrict this to Var nodes only. We can just as easily compare arbitrary expressions, provided it's for the same relation and there are no volatile functions. So 0003 does this. Conditions with the same complex expression on each side of an operator are probably fairly rare, but it's cheap so why not. 0004 and 0005 parts are unchanged. The next steps is adding some tests to the first parts, and extending the tests in the main patch (to also use more complex expressions, if 0003 gets included). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sun, Dec 12, 2021 at 6:04 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 8/31/21 00:14, Zhihong Yu wrote:
> Hi,
> For patch 0002,
>
> + s1 = statext_clauselist_selectivity(root, clauses,
> varRelid,
> + jointype,
> sjinfo, rel,
> +
> &estimatedclauses, false);
> +
> + estimated = (bms_num_members(estimatedclauses) == 1);
>
> I took a look at clauselist_apply_dependencies() (called by
> statext_clauselist_selectivity) where estimatedclauses is modified.
> Since the caller would not use the returned Selectivity if number of
> elements in estimatedclauses is greater than 1, I wonder
> if a parameter can be added to clauselist_apply_dependencies() which
> indicates early return if the second element is added to estimatedclauses.
>
Hmmm, I'm not sure I understand your point. Are you suggesting there's a
bug in not updating the bitmap, or would this be an optimization? Can
you give an example of a query affected by this?
Hi,
My previous comment was from 3 months ago - let me see if I can come up with an example.
Cheers
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 13 Dec 2021 at 02:21, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > I finally got around to this patch again, focusing mostly on the first > part that simply returns either 1.0 or 0.0 for Var op Var conditions > (i.e. the part not really using extended statistics). > Just starting to look at this again, starting with 0001 ... This needs to account for nullfrac, since x = x is only true if x is not null. I don't like how matching_restriction_variables() is adding a non-trivial amount of overhead to each of these selectivity functions, by calling examine_variable() for each side, duplicating what get_restriction_variable() does later on. I think it would probably be better to make the changes in var_eq_non_const(), where all of this information is available, and get rid of matching_restriction_variables() (it doesn't look like the new scalarineqsel_wrapper() code really needs matching_restriction_variables() - it can just use what get_restriction_variable() already returns). Regards, Dean
On 7/9/22 14:04, Dean Rasheed wrote: > On Mon, 13 Dec 2021 at 02:21, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> I finally got around to this patch again, focusing mostly on the first >> part that simply returns either 1.0 or 0.0 for Var op Var conditions >> (i.e. the part not really using extended statistics). >> > > Just starting to look at this again, starting with 0001 ... > > This needs to account for nullfrac, since x = x is only true if x is not null. > Right, I forgot to account for nullfrac. > I don't like how matching_restriction_variables() is adding a > non-trivial amount of overhead to each of these selectivity functions, > by calling examine_variable() for each side, duplicating what > get_restriction_variable() does later on. > But matching_restriction_variables() does not use examine_variable() anymore. It did, originally, but 0002 removed all of that. Now it does just pull_varnos() and that's it. I admit leaving those bits unsquashed was a bit confusing, but the first part was really 0001+0002+0003. > I think it would probably be better to make the changes in > var_eq_non_const(), where all of this information is available, and > get rid of matching_restriction_variables() (it doesn't look like the > new scalarineqsel_wrapper() code really needs > matching_restriction_variables() - it can just use what > get_restriction_variable() already returns). > I'm not sure how could that help. var_eq_non_const only really applies to eqsel_internal(), so we'd still need to deal with various other places for inequality operators. Attached is a rebased and somewhat cleaned up version of the patch series, addressing the review comments so far and squashing the bits I previously kept separate to showcase the changes. I've also added a bunch of regression tests - queries with (Var op Var) clauses of varying complexity, to demonstrate the effect of each patch. I added them as 0001, so it's clear how the individual patches affect the results. I've also wrote a single generator that generates both data and queries with (Var op Var) clauses, and then runs them on multiple connections to compare the estimates. It requires some effort to run that (setting up the clusters, ...) but shouldn't be too hard to get it working. The results are pretty massive (thousands of queries), but a simple summary showing percentiles (0.25, 0.5, 0.75, 0.9) for estimation error, calculated as (+1 to deal with 0 actual rows) abs(estimated - actual) / (actual + 1) looks like this: clauses | stats | master | patched ---------+-------+---------------------------+------------------------ 1 | no | {0.39, 0.67, 4.17, 10000} | {0.29, 0.67, 1, 10000} 2 | no | {0.38, 0.79, 50, 9950} | {0.26, 0.73, 1, 3333} 3 | no | {0.3, 0.84, 50, 3317} | {0.22, 0.78, 1, 1111} 4 | no | {0.24, 0.84, 25, 1852} | {0.14, 0.78, 1, 370} 5 | no | {0.2, 0.85, 17, 1100} | {0.11, 0.78, 1, 50} 1 | yes | {0.39, 0.67, 4.17, 10000} | {0, 0.14, 1, 1} 2 | yes | {0.38, 0.79, 50, 9950} | {0, 0.15, 1, 1} 3 | yes | {0.3, 0.84, 50, 3317} | {0, 0.15, 1, 1} 4 | yes | {0.24, 0.84, 25, 1852} | {0, 0.17, 1, 1} 5 | yes | {0.2, 0.85, 17, 1100} | {0, 0.14, 1, 1} This seems pretty good, IMO. Without extended stats on the columns, there's only so much we can do, but even then the errors got much smaller. With the stats it's clearly way better. Of course, take this with a grain of salt - those are randomly generated synthetic queries, with all queries being considered equally "likely". But clearly some queries are more likely to appear in the applications, and those are more important to estimate. However, the point of this was to see if there are classes of queries that would get much worse, and I haven't found anything like that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, 21 Jul 2022 at 12:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > This needs to account for nullfrac, since x = x is only true if x is not null. > > Right, I forgot to account for nullfrac. > Ditto variable <= variable > > I don't like how matching_restriction_variables() is adding a > > non-trivial amount of overhead to each of these selectivity functions, > > by calling examine_variable() for each side, duplicating what > > get_restriction_variable() does later on. > > But matching_restriction_variables() does not use examine_variable() > anymore. It did, originally, but 0002 removed all of that. Now it does > just pull_varnos() and that's it. I admit leaving those bits unsquashed > was a bit confusing, but the first part was really 0001+0002+0003. > Ah, right, I only looked at 0001, because I thought that was the main part that I hadn't previously looked at. So my previous concern with matching_restriction_variables() was how many extra cycles it was adding to test for what should be a pretty uncommon case. Looking at the new version, I think it's a lot better, but perhaps it would be more efficient to call equal() as soon as you've extracted "left" and "right", so it can bail out earlier and faster when they're not equal. I think a call to equal() should be very fast compared to pull_varnos() and contain_volatile_functions(). > Attached is a rebased and somewhat cleaned up version of the patch > series, addressing the review comments so far and squashing the bits I > previously kept separate to showcase the changes. > > I've also added a bunch of regression tests - queries with (Var op Var) > clauses of varying complexity, to demonstrate the effect of each patch. > I added them as 0001, so it's clear how the individual patches affect > the results. > Cool. That's much clearer, and the results look quite good. The main thing that jumps out at me now is the whole "issimple" processing stuff. Those changes turned out to be a lot more invasive than I thought. I don't think this part is correct: /* * All AND/OR clauses are considered complex, even if all arguments are * simple clauses. For NOT clauses we need to check the argument and then * we can update the flag. * * XXX Maybe for AND/OR we should check if all arguments reference the * same attnum, and consider them complex only when there are multiple * attnum values (i.e. different Vars)? */ I think the XXX part of that comment is right, and that's what the original code did. I had to go remind myself what "simple" was intended for, so apologies if this is really pedestrian: The basic idea of a "simple" clause was meant to mean any clause that's likely to be better estimated by standard statistics, rather than extended statistics (and "issimple" only applies when such clauses are combined using "OR"). So, for example, given a query with WHERE a = 1 OR (b > 0 AND b < 10) both "a = 1" and "b > 0 AND b < 10" should be considered "simple", since the standard statistics code is likely to estimate them quite well. For the second clause, it might use histograms and the RangeQueryClause-machinery, which ought to work well, whereas applying a multivariate MCV list to "b > 0 AND b < 10" in isolation would probably make its estimate worse. So then, in this example, what the "OR" handling in statext_mcv_clauselist_selectivity() would end up doing is: P(a = 1 OR (b > 0 AND b < 10)) = P(a = 1) + P(b > 0 AND b < 10) - P(a = 1 AND b > 0 AND b < 10) # Overlap and only use extended statistics for the overlap term, since the other 2 clauses are "simple", and best estimated without extended stats. The patch changes that, which ends up making things worse in some cases. Is it the case that the only reason for changing the "issimple" handling was because the standard statistics code didn't work well for things like "a < a", and so we wanted to treat that as not-simple? If so, given the selfuncs improvements, perhaps that's no longer necessary, and the original definition of "simple" is OK. IOW, is 0004 necessary, given 0002?. I notice that 0004 didn't change any test results, so if there are cases where it improves things, they aren't tested. Here's a simple example using the WHERE clause above. Without the patch, extended stats improved the estimate, but with the patch they don't anymore: DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int); INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x); ANALYSE foo; EXPLAIN ANALYSE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10); Estimated: 18 Actual: 9 CREATE STATISTICS foo_s (mcv) ON a,b FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10); Estimated: 9 without the patch, 18 with the patch Actual: 9 It's a worry that none of the existing regression tests picked up on that. Perhaps a similar test could be added using the existing test data. Otherwise, I think it'd be worth adding a new test with similar data to the above. Regards, Dean
On Fri, Jul 22, 2022 at 02:17:47PM +0100, Dean Rasheed wrote: > It's a worry that none of the existing regression tests picked up on > that. Perhaps a similar test could be added using the existing test > data. Otherwise, I think it'd be worth adding a new test with similar > data to the above. This feedback has not been answered for two months, so marked this entry as RwF for now. -- Michael