Thread: Implement predicate propagation for non-equivalence clauses

Implement predicate propagation for non-equivalence clauses

From
Richard Guo
Date:
Hi,

As we know, current planner will generate additional restriction clauses from
equivalence clauses. This will generally lower the total cost because some of
tuples may be filtered out before joins.

In this patch, we are trying to do the similar deduction, from non-equivalence
clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
restrictions on f. It can be turned on/off by GUC enable_predicate_propagation.
This is ported from Greenplum database.

The idea is to collect the non-equivalence clauses in a dedicated list. Then
for each RestrictInfo in the list, we replace its expression with another
equivalent one and make a new RestrictInfo. This is done after all equivalence
classes have been built. For example, if a RestrictInfo stands for A>10, and
if A and B are in the same equivalence class, we generate a new RestrictInfo by
replacing A with B, so we get B>10.

This patch will introduce extra cost for relation scan, due to the cost of
evaluating the new implied quals. Meanwhile, since the extra filter may reduce
the number of tuples returned by the scan, it may lower the cost of following
joins. So, whether we will get a better plan depends on the selectivity of the
implied quals.

Below is a simple demo, in which this patch will deduce an addition qual (b.i <
10 in the first query and b.i > 10 in the second query).

create table a (i int);
create table b (i int);
insert into a select generate_series(1,10000);
insert into b select generate_series(1,10000);
analyze a;
analyze b;


-- low selectivity

set enable_predicate_propagation to off;
explain select * from a,b where a.i = b.i and a.i < 10;
                          QUERY PLAN
---------------------------------------------------------------
 Hash Join  (cost=170.11..352.70 rows=9 width=8)
   Hash Cond: (b.i = a.i)
   ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4)
   ->  Hash  (cost=170.00..170.00 rows=9 width=4)
         ->  Seq Scan on a  (cost=0.00..170.00 rows=9 width=4)
               Filter: (i < 10)
(6 rows)

set enable_predicate_propagation to on;
gpadmin=# explain select * from a,b where a.i = b.i and a.i < 10;
                          QUERY PLAN
---------------------------------------------------------------
 Nested Loop  (cost=0.00..341.24 rows=1 width=8)
   Join Filter: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..170.00 rows=9 width=4)
         Filter: (i < 10)
   ->  Materialize  (cost=0.00..170.04 rows=9 width=4)
         ->  Seq Scan on b  (cost=0.00..170.00 rows=9 width=4)
               Filter: (i < 10)
(7 rows)



-- high selectivity

set enable_predicate_propagation to off;
gpadmin=# explain select * from a,b where a.i = b.i and a.i > 10;
                            QUERY PLAN
-------------------------------------------------------------------
 Hash Join  (cost=270.00..577.36 rows=9990 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..170.00 rows=9990 width=4)
         Filter: (i > 10)
   ->  Hash  (cost=145.00..145.00 rows=10000 width=4)
         ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4)
(6 rows)

set enable_predicate_propagation to on;
gpadmin=# explain select * from a,b where a.i = b.i and a.i > 10;
                            QUERY PLAN
------------------------------------------------------------------
 Hash Join  (cost=294.88..602.14 rows=9980 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..170.00 rows=9990 width=4)
         Filter: (i > 10)
   ->  Hash  (cost=170.00..170.00 rows=9990 width=4)
         ->  Seq Scan on b  (cost=0.00..170.00 rows=9990 width=4)
               Filter: (i > 10)
(7 rows)

In general, if the selectivity of the implied qual is low, the extra cost for
relation scan is more likely to be compromised by the cost reduced in join, and
vise versa.

Thanks
Richard
Attachment

Re: Implement predicate propagation for non-equivalence clauses

From
Heikki Linnakangas
Date:
On 05/09/18 09:34, Richard Guo wrote:
> Hi,
> 
> As we know, current planner will generate additional restriction clauses 
> from
> equivalence clauses. This will generally lower the total cost because 
> some of
> tuples may be filtered out before joins.
> 
> In this patch, we are trying to do the similar deduction, from 
> non-equivalence
> clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
> restrictions on f.

I haven't read the patch in detail, but that really only applies under 
special circumstances. Tom caught me making that assumption just 
recently 
(https://www.postgresql.org/message-id/8003.1527092720%40sss.pgh.pa.us). 
I think the restriction here is that f(x) must be an operator that's in 
the same operator family as the = operator. In a quick read-through, 
it's not clear to me what conditions are in the patch now. Please have a 
comment somewhere to list them explicitly.

> This patch will introduce extra cost for relation scan, due to the
> cost of evaluating the new implied quals. Meanwhile, since the extra
> filter may reduce the number of tuples returned by the scan, it may
> lower the cost of following joins. So, whether we will get a better
> plan depends on the selectivity of the implied quals.
Perhaps we should evaluate the selectivity of the clause, and only add 
them if they seem helpful, based on the cost vs. selectivity?

At least in this case from the regression tests:

>  explain (costs off)
>    select * from ec0 a, ec1 b
>    where a.ff = b.ff and a.ff = 43::bigint::int8alias1;
> -                 QUERY PLAN                  
> ----------------------------------------------
> +                              QUERY PLAN                              
> +----------------------------------------------------------------------
>   Nested Loop
>     ->  Index Scan using ec0_pkey on ec0 a
>           Index Cond: (ff = '43'::int8alias1)
>     ->  Index Scan using ec1_pkey on ec1 b
>           Index Cond: (ff = a.ff)
> -         Filter: (f1 < '5'::int8alias1)
> +         Filter: ((f1 < '5'::int8alias1) AND (ff = '43'::int8alias1))
>  (6 rows)

the new qual is redundant with the Index Condition. If we could avoid 
generating such redundant quals, that would be good.

- Heikki


Re: Implement predicate propagation for non-equivalence clauses

From
Tom Lane
Date:
Richard Guo <riguo@pivotal.io> writes:
> In this patch, we are trying to do the similar deduction, from
> non-equivalence
> clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
> restrictions on f.

Uh, *what* restrictions on f()?  In general the above equivalence
does not hold, at least not for any data type more complicated than
integers; and we do not have any semantic model for deciding
which functions it would be correct for.

One simple example to show what I'm talking about is that float8 zero
and minus zero are equal according to float8eq (assuming IEEE float
arithmetic); but they aren't equivalent for any function f() that is
sensitive to the sign or the text representation of the value.
The numeric data type likewise has values that are "equal" without
being identical for all purposes, eg 0.0 vs 0.000.  Or consider
citext.

The existing planner deduction rules for equivalence classes are
carefully designed to ensure that we only generate derived clauses
using operators from the same operator class or family, so that
it's on the opclass author to ensure that the operators have consistent
semantics.  I don't think we can extrapolate from that to any random
function that accepts the datatype.

            regards, tom lane


Re: Implement predicate propagation for non-equivalence clauses

From
Richard Guo
Date:

On Wed, Sep 5, 2018 at 2:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 05/09/18 09:34, Richard Guo wrote:
Hi,

As we know, current planner will generate additional restriction clauses from
equivalence clauses. This will generally lower the total cost because some of
tuples may be filtered out before joins.

In this patch, we are trying to do the similar deduction, from non-equivalence
clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
restrictions on f.

I haven't read the patch in detail, but that really only applies under special circumstances. Tom caught me making that assumption just recently (https://www.postgresql.org/message-id/8003.1527092720%40sss.pgh.pa.us). I think the restriction here is that f(x) must be an operator that's in the same operator family as the = operator. In a quick read-through, it's not clear to me what conditions are in the patch now. Please have a comment somewhere to list them explicitly.

Right. Above all the operator in f(x) should be in the same opfamily as the equivalence class. We neglected that in this patch and it would result in wrong plan. In addition, it should not contain volatile functions or subplans. Will address this in v2 and list the conditions in comment. Thanks!



This patch will introduce extra cost for relation scan, due to the
cost of evaluating the new implied quals. Meanwhile, since the extra
filter may reduce the number of tuples returned by the scan, it may
lower the cost of following joins. So, whether we will get a better
plan depends on the selectivity of the implied quals.
Perhaps we should evaluate the selectivity of the clause, and only add them if they seem helpful, based on the cost vs. selectivity?

At least in this case from the regression tests:

 explain (costs off)
   select * from ec0 a, ec1 b
   where a.ff = b.ff and a.ff = 43::bigint::int8alias1;
-                 QUERY PLAN                  ----------------------------------------------
+                              QUERY PLAN                              +----------------------------------------------------------------------
  Nested Loop
    ->  Index Scan using ec0_pkey on ec0 a
          Index Cond: (ff = '43'::int8alias1)
    ->  Index Scan using ec1_pkey on ec1 b
          Index Cond: (ff = a.ff)
-         Filter: (f1 < '5'::int8alias1)
+         Filter: ((f1 < '5'::int8alias1) AND (ff = '43'::int8alias1))
 (6 rows)

the new qual is redundant with the Index Condition. If we could avoid generating such redundant quals, that would be good.

Nice point. I am not sure how complex to evaluate the selectivity of the new qual before applying it. But that deserves a try.



- Heikki

Re: Implement predicate propagation for non-equivalence clauses

From
Richard Guo
Date:
Hi,

On Wed, Sep 5, 2018 at 2:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <riguo@pivotal.io> writes:
> In this patch, we are trying to do the similar deduction, from
> non-equivalence
> clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
> restrictions on f.

Uh, *what* restrictions on f()?  In general the above equivalence
does not hold, at least not for any data type more complicated than
integers; and we do not have any semantic model for deciding
which functions it would be correct for.

Exactly! The operator in f() should be at least in the same opfamily as the equivalence class containing A,B.
Besides, as far as I can consider, the clause in f() should not contain volatile functions or subplans. Not sure
if these restrictions are enough to make it safe.


One simple example to show what I'm talking about is that float8 zero
and minus zero are equal according to float8eq (assuming IEEE float
arithmetic); but they aren't equivalent for any function f() that is
sensitive to the sign or the text representation of the value.
The numeric data type likewise has values that are "equal" without
being identical for all purposes, eg 0.0 vs 0.000.  Or consider
citext.

Thanks for the example. Heikki materialized this example as:

create table a (f float8);
create table b (f float8);

insert into a values ('0'), ('-0');
insert into b values ('0'), ('-0');

select * from a, b where a.f = b.f and a.f::text <> '-0';

And run that query, this patch would give wrong result. Will address this in v2.


The existing planner deduction rules for equivalence classes are
carefully designed to ensure that we only generate derived clauses
using operators from the same operator class or family, so that
it's on the opclass author to ensure that the operators have consistent
semantics.  I don't think we can extrapolate from that to any random
function that accepts the datatype. 
 
 

                        regards, tom lane

Re: Implement predicate propagation for non-equivalence clauses

From
Richard Guo
Date:
Hi,

Thanks everyone for reviewing. We updated the patch and added more strict
checks for the non-equivalence clauses before deducing new quals, including:

1) The non-equivalence clause for deduction can only be type of OpExpr or
ScalarArrayOpExpr, with two arguments.

2) The operator of the non-equivalence clause must be in the eclass's operator
family, if it intends to deduce new quals based on the eclass.

3) The expression in the non-equivalence clause must be equal to the
EquivalenceMember, if we want to replace that expression to get a new qual.

Thanks
Richard

On Thu, Sep 6, 2018 at 12:01 PM Richard Guo <riguo@pivotal.io> wrote:
Hi,

On Wed, Sep 5, 2018 at 2:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <riguo@pivotal.io> writes:
> In this patch, we are trying to do the similar deduction, from
> non-equivalence
> clauses, that is, A=B AND f(A) implies A=B AND f(A) and f(B), under some
> restrictions on f.

Uh, *what* restrictions on f()?  In general the above equivalence
does not hold, at least not for any data type more complicated than
integers; and we do not have any semantic model for deciding
which functions it would be correct for.

Exactly! The operator in f() should be at least in the same opfamily as the equivalence class containing A,B.
Besides, as far as I can consider, the clause in f() should not contain volatile functions or subplans. Not sure
if these restrictions are enough to make it safe.


One simple example to show what I'm talking about is that float8 zero
and minus zero are equal according to float8eq (assuming IEEE float
arithmetic); but they aren't equivalent for any function f() that is
sensitive to the sign or the text representation of the value.
The numeric data type likewise has values that are "equal" without
being identical for all purposes, eg 0.0 vs 0.000.  Or consider
citext.

Thanks for the example. Heikki materialized this example as:

create table a (f float8);
create table b (f float8);

insert into a values ('0'), ('-0');
insert into b values ('0'), ('-0');

select * from a, b where a.f = b.f and a.f::text <> '-0';

And run that query, this patch would give wrong result. Will address this in v2.


The existing planner deduction rules for equivalence classes are
carefully designed to ensure that we only generate derived clauses
using operators from the same operator class or family, so that
it's on the opclass author to ensure that the operators have consistent
semantics.  I don't think we can extrapolate from that to any random
function that accepts the datatype. 
 
 

                        regards, tom lane

Attachment

Re: Implement predicate propagation for non-equivalence clauses

From
Alexander Kuzmenkov
Date:
Hi Richard,

I took a look at the v2, here are some comments:

* This feature needs tests, especially for the cases where opfamilies or 
data types or collations don't match, and other non-obvious cases where 
it shouldn't work.


* Deducing an inequality to a constant is not always helpful. If we know 
that a = b and a = const1 and b < const2, we will deduce b = const1 from 
the EC, and adding b < const2 doesn't improve selectivity and only makes 
the cost estimate worse. One situation where it does make sense is when 
we can detect contradictions in these clauses, e.g. if we know that 
const1 > const2 and therefore know that the above selection clause is 
always false. Looking at the regression test changes, I see that v2 
doesn't do that. I think the handling of deduced inequalities shoud be 
modeled on the flow of generate_base_implied_equalities -> 
process_implied_equality -> distribute_qual_to_rels. This would allow us 
to correctly handle a deduced const1 < const2 clause and turn it into a 
gating Result qual.


* There are functions named generate_implied_quals, gen_implied_quals 
and gen_implied_qual. The names are confusingly similar, we could use 
something like generate_implied_quals_for_clause and 
generate_one_implied_qual for the latter two.


@@ gen_implied_quals
     else if (clause && IsA(clause, ScalarArrayOpExpr))
     {
* When can the clause be NULL?


@@ gen_implied_quals
     item1 = canonicalize_ec_expression(item1,
                                        exprType((Node *) item1),
                                        collation);
     item2 = canonicalize_ec_expression(item2,
                                        exprType((Node *) item2),
                                        collation);
* Why do we do this? If the collation or type of the original expression 
isn't right and we have to add RelabelType, the resulting expression 
won't match the original clause and we won't be able to substitute it 
with other equavalence members. So we can just check that the type and 
collation match.


* In gen_implied_quals, I'd rename item1 => left, item2 => right, em1 => 
orig_em, em2 => other_em, and same for list cells and types. As it is 
now, em1 can actually match both item1 and item2, and em2 is not related 
to either of them, so it took me some effort to understand what's going on.


* In gen_implied_qual, why do we search the clause for a matching 
subexpression? Reading the thread, I thought that we can only do the 
substitution for OpExprs of the same opfamilies as the generating EC. 
This code looks like it can do the substitution an at arbitrary depth, 
so we might change an argument of some unsuitable function, and the 
result will not be correct. What we should probably do is that after we 
matched one side of OpExpr to one EC member, we just replace it with 
another suitable member and add the resulting clause.


@@ gen_implied_qual
     check_mergejoinable(new_rinfo);
     check_hashjoinable(new_rinfo);
...
     /*
      * If the clause has a mergejoinable operator, set the EquivalenceClass
      * links. Otherwise, a mergejoinable operator with NULL 
left_ec/right_ec
      * will cause update_mergeclause_eclasses fails at assertion.
      */
     if (new_rinfo->mergeopfamilies)
         initialize_mergeclause_eclasses(root, new_rinfo);
* It's not an equality clause, so it is not going to be mergejoinable 
nor hashjoinable and we can skip these checks.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Implement predicate propagation for non-equivalence clauses

From
Andres Freund
Date:
Hi,

On 2018-11-22 19:15:42 +0300, Alexander Kuzmenkov wrote:
> Hi Richard,
> 
> I took a look at the v2, here are some comments:
> 
> * This feature needs tests, especially for the cases where opfamilies or
> data types or collations don't match, and other non-obvious cases where it
> shouldn't work.
> 
> 
> * Deducing an inequality to a constant is not always helpful. If we know
> that a = b and a = const1 and b < const2, we will deduce b = const1 from the
> EC, and adding b < const2 doesn't improve selectivity and only makes the
> cost estimate worse. One situation where it does make sense is when we can
> detect contradictions in these clauses, e.g. if we know that const1 > const2
> and therefore know that the above selection clause is always false. Looking
> at the regression test changes, I see that v2 doesn't do that. I think the
> handling of deduced inequalities shoud be modeled on the flow of
> generate_base_implied_equalities -> process_implied_equality ->
> distribute_qual_to_rels. This would allow us to correctly handle a deduced
> const1 < const2 clause and turn it into a gating Result qual.
> 
> 
> * There are functions named generate_implied_quals, gen_implied_quals and
> gen_implied_qual. The names are confusingly similar, we could use something
> like generate_implied_quals_for_clause and generate_one_implied_qual for the
> latter two.
> 
> 
> @@ gen_implied_quals
>     else if (clause && IsA(clause, ScalarArrayOpExpr))
>     {
> * When can the clause be NULL?
> 
> 
> @@ gen_implied_quals
>     item1 = canonicalize_ec_expression(item1,
>                                        exprType((Node *) item1),
>                                        collation);
>     item2 = canonicalize_ec_expression(item2,
>                                        exprType((Node *) item2),
>                                        collation);
> * Why do we do this? If the collation or type of the original expression
> isn't right and we have to add RelabelType, the resulting expression won't
> match the original clause and we won't be able to substitute it with other
> equavalence members. So we can just check that the type and collation match.
> 
> 
> * In gen_implied_quals, I'd rename item1 => left, item2 => right, em1 =>
> orig_em, em2 => other_em, and same for list cells and types. As it is now,
> em1 can actually match both item1 and item2, and em2 is not related to
> either of them, so it took me some effort to understand what's going on.
> 
> 
> * In gen_implied_qual, why do we search the clause for a matching
> subexpression? Reading the thread, I thought that we can only do the
> substitution for OpExprs of the same opfamilies as the generating EC. This
> code looks like it can do the substitution an at arbitrary depth, so we
> might change an argument of some unsuitable function, and the result will
> not be correct. What we should probably do is that after we matched one side
> of OpExpr to one EC member, we just replace it with another suitable member
> and add the resulting clause.
> 
> 
> @@ gen_implied_qual
>     check_mergejoinable(new_rinfo);
>     check_hashjoinable(new_rinfo);
> ...
>     /*
>      * If the clause has a mergejoinable operator, set the EquivalenceClass
>      * links. Otherwise, a mergejoinable operator with NULL left_ec/right_ec
>      * will cause update_mergeclause_eclasses fails at assertion.
>      */
>     if (new_rinfo->mergeopfamilies)
>         initialize_mergeclause_eclasses(root, new_rinfo);
> * It's not an equality clause, so it is not going to be mergejoinable nor
> hashjoinable and we can skip these checks.

As this review has not been addressed, I'm marking this patch as
returned with feedback for now. Please resubmit once these are
addressed!

Greetings,

Andres Freund