Thread: Add support for (Var op Var) clause in extended MCV statistics

Add support for (Var op Var) clause in extended MCV statistics

From
Ilia Evdokimov
Date:

Hi hackers,

I'd like to submit a patch that improves the estimated rows for queries containing (Var op Var) clauses by applying extended MCV statistics.

New functions:

  • mcv_clauselist_selectivity_var_op_var() - calculates the selectivity for (Var op Var) clauses.
  • is_opclause_var_op_var() - Checks whether a clause is of the (Var op Var) form.

Implementation Details:

  • A new 'if' statement was added to the 'clause_selectivity_ext()' function to handle (Var op Var) clauses. This allows the process to locate matching MCV extended statistics and calculate selectivity using the newly introduced function.
  • Additionally, I added 'if' statement in statext_is_compatible_clause_internal() function to determine which columns are included in the clause, find matching extended statistics, and then calculate selectivity through the new function. I did the same in mcv_get_match_bitmap() to check what values are true for (Var op Var).
  • To support this, I created a new enum type to differentiate between OR/AND and (Var op Var) clauses.

Examples:

create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
    Estimated:   33333
    Actual:       100000

explain select * from t where a > b;
`
    Estimated:   33333
    Actual:       100000

create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000  
    Actual:                             100000

explain select * from t where a > b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000  
    Actual:                             100000


If you want to see more examples, see regress tests in the patch.

Previous thread:

This feature was originally developed two years ago in [1], and at that time, the approach was almost the same. My implementation uses dedicated functions and 'if' statements directly for better readability and maintainability. Additionally, there was a bug in the previous approach that has been resolved with my patch. Here’s an example of the bug and its fix:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated:   18
    Actual:           9

CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated previous patch:  18
    Estimated current patch:      9
    Actual:                                  9


[1]: https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com

I look forward to any feedback or suggestions from the community.

Best regars,
Ilia Evdokimov
Tantor Labs LLC.

Attachment

Re: Add support for (Var op Var) clause in extended MCV statistics

From
Ilia Evdokimov
Date:
Another issue mentioned in [1] involves cases where the clause is in the 
form (A op A). In my view, this isn't related to the current patch, as 
it can be addressed by rewriting the clause, similar to transforming A = 
A into A IS NOT NULL. This adjustment would result in more accurate 
estimation.

[1]: 
https://www.postgresql.org/message-id/7C0F91B5-8A43-428B-8D31-556458720305%40enterprisedb.com

Ilia Evdokimov,
Tantor Labs LLC.




Re: Add support for (Var op Var) clause in extended MCV statistics

From
Alena Rybakina
Date:
Hi! I think your work is important)

I started reviewing it and want to suggest some changes to better code: 
I think we should consider the case where the expression is not neither 
an OpExpr and VarOpVar expression.

Have you tested this code with any benchmarks?

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Add support for (Var op Var) clause in extended MCV statistics

From
Tomas Vondra
Date:
On 8/12/24 13:44, Alena Rybakina wrote:
> Hi! I think your work is important)
> 

I agree, and I'm grateful someone picked up the original patch. I'll try
to help to keep it moving forward. If the thread gets stuck, feel free
to ping me to take a look.

> I started reviewing it and want to suggest some changes to better code:
> I think we should consider the case where the expression is not neither
> an OpExpr and VarOpVar expression.
> 

Do you have some specific type of clauses in mind? Most of the extended
statistics only really handles this type of clauses, so I'm not sure
it's feasible to extend that - at least not in this patch.

> Have you tested this code with any benchmarks?
>
FWIW I think we need to test two things - that it (a) improves the
estimates and (b) does not have significant overhead.



regards

-- 
Tomas Vondra



Re: Add support for (Var op Var) clause in extended MCV statistics

From
Ilia Evdokimov
Date:
On 12.8.24 14:53, Tomas Vondra wrote:

> I agree, and I'm grateful someone picked up the original patch. I'll try
> to help to keep it moving forward. If the thread gets stuck, feel free
> to ping me to take a look.
Good. Thank you!
>> I started reviewing it and want to suggest some changes to better code:
>> I think we should consider the case where the expression is not neither
>> an OpExpr and VarOpVar expression.
>>
> Do you have some specific type of clauses in mind? Most of the extended
> statistics only really handles this type of clauses, so I'm not sure
> it's feasible to extend that - at least not in this patch.

I agree with Alena that we need to consider the following clauses: (Expr 
op Var), (Var op Expr) and (Expr op Expr). And we need to return false 
in these cases because we did it before my patch in

         /* Check if the expression has the right shape */
         if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL))
             return false;

In is_opclause_var_op_var() function it is really useless local Node 
*expr_left, *expr_right variables. However, we can't assign them NULL at 
the begin because if I passed not-null pointers I have to return the 
values. Otherwise remain them NULL.

Nevertheless, thank you for review, Alena.

>> Have you tested this code with any benchmarks?
>>
> FWIW I think we need to test two things - that it (a) improves the
> estimates and (b) does not have significant overhead.
Yes, but only TPC-B. And the performance did not drop. In general, it'd 
be better to do more tests and those listed by Tomas with new attached 
patch.
Attachment

Re: Add support for (Var op Var) clause in extended MCV statistics

From
Tomas Vondra
Date:
On 8/12/24 17:57, Ilia Evdokimov wrote:
> On 12.8.24 14:53, Tomas Vondra wrote:
> 
>> I agree, and I'm grateful someone picked up the original patch. I'll try
>> to help to keep it moving forward. If the thread gets stuck, feel free
>> to ping me to take a look.
> Good. Thank you!
>>> I started reviewing it and want to suggest some changes to better code:
>>> I think we should consider the case where the expression is not neither
>>> an OpExpr and VarOpVar expression.
>>>
>> Do you have some specific type of clauses in mind? Most of the extended
>> statistics only really handles this type of clauses, so I'm not sure
>> it's feasible to extend that - at least not in this patch.
> 
> I agree with Alena that we need to consider the following clauses: (Expr
> op Var), (Var op Expr) and (Expr op Expr). And we need to return false
> in these cases because we did it before my patch in
> 
>         /* Check if the expression has the right shape */
>         if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL))
>             return false;
> 
> In is_opclause_var_op_var() function it is really useless local Node
> *expr_left, *expr_right variables. However, we can't assign them NULL at
> the begin because if I passed not-null pointers I have to return the
> values. Otherwise remain them NULL.
> 
> Nevertheless, thank you for review, Alena.
> 

Ah, right. I agree we should handle clauses with expressions.

I don't recall why I wrote is_opclause_var_op_var() like this, but I
believe this was before we allowed extended statistics on expressions
(which was added in 2021, the patch is from 2020). I don't see why it
could not return expressions, but I haven't tried.

>>> Have you tested this code with any benchmarks?
>>>
>> FWIW I think we need to test two things - that it (a) improves the
>> estimates and (b) does not have significant overhead.
> Yes, but only TPC-B. And the performance did not drop. In general, it'd
> be better to do more tests and those listed by Tomas with new attached
> patch.

Is TPC-B really interesting/useful for this patch? The queries are super
simple, with only a single clause (so it may not even get to the code
handling extended statistics). Did you create any extended stats?

I think you'll need to construct a custom test, with queries that have
multiple (var op var) clauses, extended stats created, etc. And
benchmark that.

FWIW I don't think it makes sense to benchmark the query execution - if
the estimate improves, it's possible to get arbitrary speedup, but
that's expected and mostly mostly irrelevant I think.

What I'd focus on is benchmarking just the query planning - we need the
overhead to be negligible (or at least small) so that it does not hurt
people with already good plans.

BTW can you elaborate why you are interested in this patch? Do you just
think it's interesting/useful, or do you have a workload where it would
actually help? I'm asking because me being uncertain how beneficial this
is in practice (not just being nice in theory) was one of the reasons
why I didn't do more work on this in 2021.


regards

-- 
Tomas Vondra