Add support for (Var op Var) clause in extended MCV statistics - Mailing list pgsql-hackers

From Ilia Evdokimov
Subject Add support for (Var op Var) clause in extended MCV statistics
Date
Msg-id 46d09d66-1692-4a12-a106-06981a47cb28@tantorlabs.com
Whole thread Raw
Responses Re: Add support for (Var op Var) clause in extended MCV statistics
Re: Add support for (Var op Var) clause in extended MCV statistics
List pgsql-hackers

Hi hackers,

I'd like to submit a patch that improves the estimated rows for queries containing (Var op Var) clauses by applying extended MCV statistics.

New functions:

  • mcv_clauselist_selectivity_var_op_var() - calculates the selectivity for (Var op Var) clauses.
  • is_opclause_var_op_var() - Checks whether a clause is of the (Var op Var) form.

Implementation Details:

  • A new 'if' statement was added to the 'clause_selectivity_ext()' function to handle (Var op Var) clauses. This allows the process to locate matching MCV extended statistics and calculate selectivity using the newly introduced function.
  • Additionally, I added 'if' statement in statext_is_compatible_clause_internal() function to determine which columns are included in the clause, find matching extended statistics, and then calculate selectivity through the new function. I did the same in mcv_get_match_bitmap() to check what values are true for (Var op Var).
  • To support this, I created a new enum type to differentiate between OR/AND and (Var op Var) clauses.

Examples:

create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
    Estimated:   33333
    Actual:       100000

explain select * from t where a > b;
`
    Estimated:   33333
    Actual:       100000

create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000  
    Actual:                             100000

explain select * from t where a > b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000  
    Actual:                             100000


If you want to see more examples, see regress tests in the patch.

Previous thread:

This feature was originally developed two years ago in [1], and at that time, the approach was almost the same. My implementation uses dedicated functions and 'if' statements directly for better readability and maintainability. Additionally, there was a bug in the previous approach that has been resolved with my patch. Here’s an example of the bug and its fix:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated:   18
    Actual:           9

CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated previous patch:  18
    Estimated current patch:      9
    Actual:                                  9


[1]: https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com

I look forward to any feedback or suggestions from the community.

Best regars,
Ilia Evdokimov
Tantor Labs LLC.

Attachment

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state
Next
From: John Naylor
Date:
Subject: Re: ECPG cleanup and fix for clang compile-time problem