Thread: Add support for (Var op Var) clause in extended MCV statistics
Hi hackers,
I'd like to submit a patch that improves the estimated rows for queries containing (Var op Var) clauses by applying extended MCV statistics.
New functions:
- mcv_clauselist_selectivity_var_op_var() - calculates the selectivity for (Var op Var) clauses.
- is_opclause_var_op_var() - Checks whether a clause is of the (Var op Var) form.
Implementation Details:
- A new 'if' statement was added to the 'clause_selectivity_ext()' function to handle (Var op Var) clauses. This allows the process to locate matching MCV extended statistics and calculate selectivity using the newly introduced function.
- Additionally, I added 'if' statement in statext_is_compatible_clause_internal() function to determine which columns are included in the clause, find matching extended statistics, and then calculate selectivity through the new function. I did the same in mcv_get_match_bitmap() to check what values are true for (Var op Var).
- To support this, I created a new enum type to differentiate between OR/AND and (Var op Var) clauses.
Examples:
create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
Estimated: 33333
Actual: 100000
explain select * from t where a > b;
`
Estimated: 33333
Actual: 100000
create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
Estimated without patch: 33333
Estimated with patch: 100000
Actual: 100000
explain select * from t where a > b;
`
Estimated without patch: 33333
Estimated with patch: 100000
Actual: 100000
If you want to see more examples, see regress tests in the patch.
Previous thread:
This feature was originally developed two years ago in [1], and at that time, the approach was almost the same. My implementation uses dedicated functions and 'if' statements directly for better readability and maintainability. Additionally, there was a bug in the previous approach that has been resolved with my patch. Here’s an example of the bug and its fix:
CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
Estimated: 18
Actual: 9
CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
Estimated previous patch: 18
Estimated current patch: 9
Actual: 9
[1]: https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com
I look forward to any feedback or suggestions from the community.
Best regars,
Ilia Evdokimov
Tantor Labs LLC.
Attachment
Another issue mentioned in [1] involves cases where the clause is in the form (A op A). In my view, this isn't related to the current patch, as it can be addressed by rewriting the clause, similar to transforming A = A into A IS NOT NULL. This adjustment would result in more accurate estimation. [1]: https://www.postgresql.org/message-id/7C0F91B5-8A43-428B-8D31-556458720305%40enterprisedb.com Ilia Evdokimov, Tantor Labs LLC.
Hi! I think your work is important) I started reviewing it and want to suggest some changes to better code: I think we should consider the case where the expression is not neither an OpExpr and VarOpVar expression. Have you tested this code with any benchmarks? -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 8/12/24 13:44, Alena Rybakina wrote: > Hi! I think your work is important) > I agree, and I'm grateful someone picked up the original patch. I'll try to help to keep it moving forward. If the thread gets stuck, feel free to ping me to take a look. > I started reviewing it and want to suggest some changes to better code: > I think we should consider the case where the expression is not neither > an OpExpr and VarOpVar expression. > Do you have some specific type of clauses in mind? Most of the extended statistics only really handles this type of clauses, so I'm not sure it's feasible to extend that - at least not in this patch. > Have you tested this code with any benchmarks? > FWIW I think we need to test two things - that it (a) improves the estimates and (b) does not have significant overhead. regards -- Tomas Vondra
On 12.8.24 14:53, Tomas Vondra wrote: > I agree, and I'm grateful someone picked up the original patch. I'll try > to help to keep it moving forward. If the thread gets stuck, feel free > to ping me to take a look. Good. Thank you! >> I started reviewing it and want to suggest some changes to better code: >> I think we should consider the case where the expression is not neither >> an OpExpr and VarOpVar expression. >> > Do you have some specific type of clauses in mind? Most of the extended > statistics only really handles this type of clauses, so I'm not sure > it's feasible to extend that - at least not in this patch. I agree with Alena that we need to consider the following clauses: (Expr op Var), (Var op Expr) and (Expr op Expr). And we need to return false in these cases because we did it before my patch in /* Check if the expression has the right shape */ if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL)) return false; In is_opclause_var_op_var() function it is really useless local Node *expr_left, *expr_right variables. However, we can't assign them NULL at the begin because if I passed not-null pointers I have to return the values. Otherwise remain them NULL. Nevertheless, thank you for review, Alena. >> Have you tested this code with any benchmarks? >> > FWIW I think we need to test two things - that it (a) improves the > estimates and (b) does not have significant overhead. Yes, but only TPC-B. And the performance did not drop. In general, it'd be better to do more tests and those listed by Tomas with new attached patch.
Attachment
On 8/12/24 17:57, Ilia Evdokimov wrote: > On 12.8.24 14:53, Tomas Vondra wrote: > >> I agree, and I'm grateful someone picked up the original patch. I'll try >> to help to keep it moving forward. If the thread gets stuck, feel free >> to ping me to take a look. > Good. Thank you! >>> I started reviewing it and want to suggest some changes to better code: >>> I think we should consider the case where the expression is not neither >>> an OpExpr and VarOpVar expression. >>> >> Do you have some specific type of clauses in mind? Most of the extended >> statistics only really handles this type of clauses, so I'm not sure >> it's feasible to extend that - at least not in this patch. > > I agree with Alena that we need to consider the following clauses: (Expr > op Var), (Var op Expr) and (Expr op Expr). And we need to return false > in these cases because we did it before my patch in > > /* Check if the expression has the right shape */ > if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL)) > return false; > > In is_opclause_var_op_var() function it is really useless local Node > *expr_left, *expr_right variables. However, we can't assign them NULL at > the begin because if I passed not-null pointers I have to return the > values. Otherwise remain them NULL. > > Nevertheless, thank you for review, Alena. > Ah, right. I agree we should handle clauses with expressions. I don't recall why I wrote is_opclause_var_op_var() like this, but I believe this was before we allowed extended statistics on expressions (which was added in 2021, the patch is from 2020). I don't see why it could not return expressions, but I haven't tried. >>> Have you tested this code with any benchmarks? >>> >> FWIW I think we need to test two things - that it (a) improves the >> estimates and (b) does not have significant overhead. > Yes, but only TPC-B. And the performance did not drop. In general, it'd > be better to do more tests and those listed by Tomas with new attached > patch. Is TPC-B really interesting/useful for this patch? The queries are super simple, with only a single clause (so it may not even get to the code handling extended statistics). Did you create any extended stats? I think you'll need to construct a custom test, with queries that have multiple (var op var) clauses, extended stats created, etc. And benchmark that. FWIW I don't think it makes sense to benchmark the query execution - if the estimate improves, it's possible to get arbitrary speedup, but that's expected and mostly mostly irrelevant I think. What I'd focus on is benchmarking just the query planning - we need the overhead to be negligible (or at least small) so that it does not hurt people with already good plans. BTW can you elaborate why you are interested in this patch? Do you just think it's interesting/useful, or do you have a workload where it would actually help? I'm asking because me being uncertain how beneficial this is in practice (not just being nice in theory) was one of the reasons why I didn't do more work on this in 2021. regards -- Tomas Vondra