Hi hackers,
I'd like to submit a patch that improves the estimated rows for queries containing (Var op Var) clauses by applying extended MCV statistics.
New functions:
- mcv_clauselist_selectivity_var_op_var() - calculates the selectivity for (Var op Var) clauses.
- is_opclause_var_op_var() - Checks whether a clause is of the (Var op Var) form.
Implementation Details:
- A new 'if' statement was added to the 'clause_selectivity_ext()' function to handle (Var op Var) clauses. This allows the process to locate matching MCV extended statistics and calculate selectivity using the newly introduced function.
- Additionally, I added 'if' statement in statext_is_compatible_clause_internal() function to determine which columns are included in the clause, find matching extended statistics, and then calculate selectivity through the new function. I did the same in mcv_get_match_bitmap() to check what values are true for (Var op Var).
- To support this, I created a new enum type to differentiate between OR/AND and (Var op Var) clauses.
Examples:
create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
Estimated: 33333
Actual: 100000
explain select * from t where a > b;
`
Estimated: 33333
Actual: 100000
create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
Estimated without patch: 33333
Estimated with patch: 100000
Actual: 100000
explain select * from t where a > b;
`
Estimated without patch: 33333
Estimated with patch: 100000
Actual: 100000
If you want to see more examples, see regress tests in the patch.
Previous thread:
This feature was originally developed two years ago in [1], and at that time, the approach was almost the same. My implementation uses dedicated functions and 'if' statements directly for better readability and maintainability. Additionally, there was a bug in the previous approach that has been resolved with my patch. Here’s an example of the bug and its fix:
CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
Estimated: 18
Actual: 9
CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
Estimated previous patch: 18
Estimated current patch: 9
Actual: 9
[1]: https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com
I look forward to any feedback or suggestions from the community.
Best regars,
Ilia Evdokimov
Tantor Labs LLC.