Thread: [PATCH] Erase the distinctClause if the result is unique by definition

[PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:
Hi:

I wrote a patch to erase the distinctClause if the result is unique by 
definition,  I find this because a user switch this code from oracle 
to PG and find the performance is bad due to this,  so I adapt pg for
this as well. 

This patch doesn't work for a well-written SQL,  but some drawback 
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add.. 

Please see the patch for details.   

Thank you. 
Attachment

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:
update the patch with considering the semi/anti join. 

Can anyone help to review this patch?  

Thanks


On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

I wrote a patch to erase the distinctClause if the result is unique by 
definition,  I find this because a user switch this code from oracle 
to PG and find the performance is bad due to this,  so I adapt pg for
this as well. 

This patch doesn't work for a well-written SQL,  but some drawback 
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add.. 

Please see the patch for details.   

Thank you. 
Attachment

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:
Hi Andy,
What might help is to add more description to your email message like giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN          
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct?

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results. PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged.
3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well
4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys.
5. Have you tested this OUTER joins, which can render inner side nullable?

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
update the patch with considering the semi/anti join. 

Can anyone help to review this patch?  

Thanks


On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

I wrote a patch to erase the distinctClause if the result is unique by 
definition,  I find this because a user switch this code from oracle 
to PG and find the performance is bad due to this,  so I adapt pg for
this as well. 

This patch doesn't work for a well-written SQL,  but some drawback 
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add.. 

Please see the patch for details.   

Thank you. 


--
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:
Hi Ashutosh:
   Thanks for your time. 

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi Andy,
What might help is to add more description to your email message like giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN          
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct?

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now. 

[PATCH] Erase the distinctClause if the result is unique by
 definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted" 
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the patch
to continue discussing of this issue. 
 
PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged.
 
I checked the comments of check_functional_grouping,  the reason is 

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able 
to get the not null information from relation->rd_attr->attrs[n].attnotnull which 
is just what this patch did.   

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well

This is a good point.   The rules may have some different for join,  so I prefer 
to to focus on the current one so far. 
 
4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys.
 
Do you mean adding some information into PlannerInfo,  and when we create 
a node for Unique/HashAggregate/Group,  we can just create a dummy node? 
 
5. Have you tested this OUTER joins, which can render inner side nullable?

Yes, that part was missed in the test case.  I just added them.   

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
update the patch with considering the semi/anti join. 

Can anyone help to review this patch?  

Thanks


On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

I wrote a patch to erase the distinctClause if the result is unique by 
definition,  I find this because a user switch this code from oracle 
to PG and find the performance is bad due to this,  so I adapt pg for
this as well. 

This patch doesn't work for a well-written SQL,  but some drawback 
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add.. 

Please see the patch for details.   

Thank you. 


--
--
Best Wishes,
Ashutosh Bapat
Attachment

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:


On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi Ashutosh:
   Thanks for your time. 

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi Andy,
What might help is to add more description to your email message like giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN          
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct?

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now. 

[PATCH] Erase the distinctClause if the result is unique by
 definition

I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it.
 


However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted" 
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the patch
to continue discussing of this issue. 

In PostgreSQL there's no "read uncommitted". But that doesn't matter since a query can be prepared outside a transaction and executed within one or more subsequent transactions.
 
 
PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged.
 
I checked the comments of check_functional_grouping,  the reason is 

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able 
to get the not null information from relation->rd_attr->attrs[n].attnotnull which 
is just what this patch did.   

The problem isn't whether not-null-less can be inferred or not, the problem is whether that can be guaranteed across planning and execution of query (prepare and execute for example.) The constraintDep machinary registers the constraints used for preparing plan and invalidates the plan if any of those constraints change after plan is created.
 

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well

This is a good point.   The rules may have some different for join,  so I prefer 
to to focus on the current one so far.

I doubt that since DISTINCT is ultimately carried out as Grouping operation. But anyway, I won't hang upon that.
 
4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys.
 
Do you mean adding some information into PlannerInfo,  and when we create 
a node for Unique/HashAggregate/Group,  we can just create a dummy node? 

Not so much as PlannerInfo but something on lines of PathKey. See PathKey structure and related code. What I envision is PathKey class is also annotated with the information whether that PathKey implies uniqueness. E.g. a PathKey derived from a Primary index would imply uniqueness also. A PathKey derived from say Group operation also implies uniqueness. Then just by looking at the underlying Path we would be able to say whether we need Group/Unique node on top of it or not. I think that would make it much wider usecase and a very useful optimization.

--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
>> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> Do you mean adding some information into PlannerInfo,  and when we create
>> a node for Unique/HashAggregate/Group,  we can just create a dummy node?

> Not so much as PlannerInfo but something on lines of PathKey. See PathKey
> structure and related code. What I envision is PathKey class is also
> annotated with the information whether that PathKey implies uniqueness.
> E.g. a PathKey derived from a Primary index would imply uniqueness also. A
> PathKey derived from say Group operation also implies uniqueness. Then just
> by looking at the underlying Path we would be able to say whether we need
> Group/Unique node on top of it or not. I think that would make it much
> wider usecase and a very useful optimization.

FWIW, that doesn't seem like a very prudent approach to me, because it
confuses sorted-ness with unique-ness.  PathKeys are about sorting,
but it's possible to have uniqueness guarantees without having sorted
anything, for instance via hashed grouping.

I haven't looked at this patch, but I'd expect it to use infrastructure
related to query_is_distinct_for(), and that doesn't deal in PathKeys.

            regards, tom lane



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:


On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:



[PATCH] Erase the distinctClause if the result is unique by
 definition

I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it.  what is the purpose of keeping the evidence?
 
 


However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted" 
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the patch
to continue discussing of this issue. 

In PostgreSQL there's no "read uncommitted".

Thanks for the hint, I just noticed read uncommitted is treated as read committed
 in Postgresql. 
 
But that doesn't matter since a query can be prepared outside a transaction and executed within one or more subsequent transactions.
 
Suppose after a DDL, the prepared statement need to be re-parsed/planned 
if it is not executed or it will prevent the DDL to happen.  

The following is my test. 

postgres=# create table t (a int primary key, b int not null,  c int);
CREATE TABLE
postgres=# insert into t values(1, 1, 1), (2, 2, 2);
INSERT 0 2
postgres=# create unique index t_idx1 on t(b);
CREATE INDEX

postgres=# prepare st as select distinct b from t where c = $1;
PREPARE
postgres=# explain execute st(1);
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
   Filter: (c = 1)
(2 rows)
...
postgres=# explain execute st(1);
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
   Filter: (c = $1)
(2 rows)

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
                         QUERY PLAN
-------------------------------------------------------------
 Unique  (cost=1.03..1.04 rows=1 width=4)
   ->  Sort  (cost=1.03..1.04 rows=1 width=4)
         Sort Key: b
         ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
               Filter: (c = $1)
(5 rows)

-- session 2
postgres=# insert into t values (3, null, 3), (4, null, 3);
INSERT 0 2

-- session 1
postgres=# execute st(3);
 b
---

(1 row)

and if we prepare sql outside a transaction, and execute it in the 
transaction, the other session can't drop the constraint until the 
transaction is ended. 

 
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique bydefinition

From
Julien Rouhaud
Date:
On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:
> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> ashutosh.bapat.oss@gmail.com> wrote:
>
> > I forgot to mention this in the last round of comments. Your patch was
> > actually removing distictClause from the Query structure. Please avoid
> > doing that. If you remove it, you are also removing the evidence that this
> > Query had a DISTINCT clause in it.
> >
>
> Yes, I removed it because it is the easiest way to do it.  what is the
> purpose of keeping the evidence?
>
> >> However the patch as presented has some problems
> >> 1. What happens if the primary key constraint or NOT NULL constraint gets
> >> dropped between a prepare and execute? The plan will no more be valid and
> >> thus execution may produce non-distinct results.
>
> > But that doesn't matter since a query can be prepared outside a
> > transaction and executed within one or more subsequent transactions.
> >
>
> Suppose after a DDL, the prepared statement need to be re-parsed/planned
> if it is not executed or it will prevent the DDL to happen.
>
> The following is my test.
>
> postgres=# create table t (a int primary key, b int not null,  c int);
> CREATE TABLE
> postgres=# insert into t values(1, 1, 1), (2, 2, 2);
> INSERT 0 2
> postgres=# create unique index t_idx1 on t(b);
> CREATE INDEX
>
> postgres=# prepare st as select distinct b from t where c = $1;
> PREPARE
> postgres=# explain execute st(1);
>                    QUERY PLAN
> -------------------------------------------------
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>    Filter: (c = 1)
> (2 rows)
> ...
> postgres=# explain execute st(1);
>                    QUERY PLAN
> -------------------------------------------------
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>    Filter: (c = $1)
> (2 rows)
>
> -- session 2
> postgres=# alter table t alter column b drop not null;
> ALTER TABLE
>
> -- session 1:
> postgres=# explain execute st(1);
>                          QUERY PLAN
> -------------------------------------------------------------
>  Unique  (cost=1.03..1.04 rows=1 width=4)
>    ->  Sort  (cost=1.03..1.04 rows=1 width=4)
>          Sort Key: b
>          ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>                Filter: (c = $1)
> (5 rows)
>
> -- session 2
> postgres=# insert into t values (3, null, 3), (4, null, 3);
> INSERT 0 2
>
> -- session 1
> postgres=# execute st(3);
>  b
> ---
>
> (1 row)
>
> and if we prepare sql outside a transaction, and execute it in the
> transaction, the other session can't drop the constraint until the
> transaction is ended.

And what if you create a view on top of a query containing a distinct clause
rather than using prepared statements?  FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null,  c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

I also think this is not the right way to handle this optimization.



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:


On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:
> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> ashutosh.bapat.oss@gmail.com> wrote:
>
> > I forgot to mention this in the last round of comments. Your patch was
> > actually removing distictClause from the Query structure. Please avoid
> > doing that. If you remove it, you are also removing the evidence that this
> > Query had a DISTINCT clause in it.
> >
>
> Yes, I removed it because it is the easiest way to do it.  what is the
> purpose of keeping the evidence?
>
> >> However the patch as presented has some problems
> >> 1. What happens if the primary key constraint or NOT NULL constraint gets
> >> dropped between a prepare and execute? The plan will no more be valid and
> >> thus execution may produce non-distinct results.
>
> > But that doesn't matter since a query can be prepared outside a
> > transaction and executed within one or more subsequent transactions.
> >
>
> Suppose after a DDL, the prepared statement need to be re-parsed/planned
> if it is not executed or it will prevent the DDL to happen.
>
> The following is my test.
>
> postgres=# create table t (a int primary key, b int not null,  c int);
> CREATE TABLE
> postgres=# insert into t values(1, 1, 1), (2, 2, 2);
> INSERT 0 2
> postgres=# create unique index t_idx1 on t(b);
> CREATE INDEX
>
> postgres=# prepare st as select distinct b from t where c = $1;
> PREPARE
> postgres=# explain execute st(1);
>                    QUERY PLAN
> -------------------------------------------------
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>    Filter: (c = 1)
> (2 rows)
> ...
> postgres=# explain execute st(1);
>                    QUERY PLAN
> -------------------------------------------------
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>    Filter: (c = $1)
> (2 rows)
>
> -- session 2
> postgres=# alter table t alter column b drop not null;
> ALTER TABLE
>
> -- session 1:
> postgres=# explain execute st(1);
>                          QUERY PLAN
> -------------------------------------------------------------
>  Unique  (cost=1.03..1.04 rows=1 width=4)
>    ->  Sort  (cost=1.03..1.04 rows=1 width=4)
>          Sort Key: b
>          ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>                Filter: (c = $1)
> (5 rows)
>
> -- session 2
> postgres=# insert into t values (3, null, 3), (4, null, 3);
> INSERT 0 2
>
> -- session 1
> postgres=# execute st(3);
>  b
> ---
>
> (1 row)
>
> and if we prepare sql outside a transaction, and execute it in the
> transaction, the other session can't drop the constraint until the
> transaction is ended.

And what if you create a view on top of a query containing a distinct clause
rather than using prepared statements?  FWIW your patch doesn't handle such
case at all, without even needing to drop constraints: 
 
CREATE TABLE t (a int primary key, b int not null,  c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


Thanks for pointing it out.  This is unexpected based on my current knowledge,    I 
will check that. 
 
I also think this is not the right way to handle this optimization.
 
I started to check query_is_distinct_for when Tom point it out,  but still doesn't
understand the context fully.   I will take your finding with this as well. 

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:


On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> and if we prepare sql outside a transaction, and execute it in the
> transaction, the other session can't drop the constraint until the
> transaction is ended.

And what if you create a view on top of a query containing a distinct clause
rather than using prepared statements?  FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null,  c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


This error can be fixed with

-       num_of_rtables = bms_num_members(non_semi_anti_relids);
+       num_of_rtables = list_length(query->rtable);

This test case also be added into the patch.   
 
I also think this is not the right way to handle this optimization.

do you have any other concerns?   
Attachment

Re: [PATCH] Erase the distinctClause if the result is unique bydefinition

From
Julien Rouhaud
Date:
On Tue, Feb 11, 2020 at 08:14:14PM +0800, Andy Fan wrote:
> On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> 
> > >
> > > and if we prepare sql outside a transaction, and execute it in the
> > > transaction, the other session can't drop the constraint until the
> > > transaction is ended.
> >
> > And what if you create a view on top of a query containing a distinct
> > clause
> > rather than using prepared statements?  FWIW your patch doesn't handle such
> > case at all, without even needing to drop constraints:
> >
> > CREATE TABLE t (a int primary key, b int not null,  c int);
> > INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
> > CREATE UNIQUE INDEX t_idx1 on t(b);
> > CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
> > EXPLAIN SELECT * FROM v1;
> > server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.
> >
> >
> This error can be fixed with
> 
> -       num_of_rtables = bms_num_members(non_semi_anti_relids);
> +       num_of_rtables = list_length(query->rtable);
> 
> This test case also be added into the patch.
> 
> 
> > I also think this is not the right way to handle this optimization.
> >
> 
> do you have any other concerns?

Yes, it seems to be broken as soon as you alter the view's underlying table:

=# CREATE TABLE t (a int primary key, b int not null,  c int);
CREATE TABLE

=# INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
INSERT 0 2

=# CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE INDEX

=# CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
CREATE VIEW

=# EXPLAIN SELECT * FROM v1;
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# ALTER TABLE t ALTER COLUMN b DROP NOT NULL;
ALTER TABLE

=# EXPLAIN SELECT * FROM v1;
                   QUERY PLAN
-------------------------------------------------
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
                         QUERY PLAN
-------------------------------------------------------------
 Unique  (cost=1.03..1.04 rows=2 width=4)
   ->  Sort  (cost=1.03..1.03 rows=2 width=4)
         Sort Key: b
         ->  Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(4 rows)




Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:


On Mon, Feb 10, 2020 at 10:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
>> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> Do you mean adding some information into PlannerInfo,  and when we create
>> a node for Unique/HashAggregate/Group,  we can just create a dummy node?

> Not so much as PlannerInfo but something on lines of PathKey. See PathKey
> structure and related code. What I envision is PathKey class is also
> annotated with the information whether that PathKey implies uniqueness.
> E.g. a PathKey derived from a Primary index would imply uniqueness also. A
> PathKey derived from say Group operation also implies uniqueness. Then just
> by looking at the underlying Path we would be able to say whether we need
> Group/Unique node on top of it or not. I think that would make it much
> wider usecase and a very useful optimization.

FWIW, that doesn't seem like a very prudent approach to me, because it
confuses sorted-ness with unique-ness.  PathKeys are about sorting,
but it's possible to have uniqueness guarantees without having sorted
anything, for instance via hashed grouping.

I haven't looked at this patch, but I'd expect it to use infrastructure
related to query_is_distinct_for(), and that doesn't deal in PathKeys.

Thanks for the pointer. I think there's another problem with my approach. PathKeys are specific to paths since the order of the result depends upon the Path. But uniqueness is a property of the result i.e. relation and thus should be attached to RelOptInfo as query_is_distinct_for() does. I think uniquness should bubble up the RelOptInfo tree, annotating each RelOptInfo with the minimum set of TLEs which make the result from that relation unique. Thus we could eliminate extra Group/Unique node if the underlying RelOptInfo's unique column set is subset of required uniqueness.
--
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:


On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:



[PATCH] Erase the distinctClause if the result is unique by
 definition

I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it.  what is the purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is serialised into a view definition. Removing distinctClause from there means that the view will never try to produce unique results.
 
 

Suppose after a DDL, the prepared statement need to be re-parsed/planned 
if it is not executed or it will prevent the DDL to happen.  

The query will be replanned. I am not sure about reparsed though.
 


-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
                         QUERY PLAN
-------------------------------------------------------------
 Unique  (cost=1.03..1.04 rows=1 width=4)
   ->  Sort  (cost=1.03..1.04 rows=1 width=4)
         Sort Key: b
         ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
               Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning it every time it gets executed. It's not using a stored prepared plan. Try without parameters. Also make sure that a prepared plan is used for execution and not a new plan.
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique bydefinition

From
Julien Rouhaud
Date:
On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:
> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> > ashutosh.bapat.oss@gmail.com> wrote:
> >
> >>>
> >>> [PATCH] Erase the distinctClause if the result is unique by
> >>>  definition
> >>
> >> I forgot to mention this in the last round of comments. Your patch was
> >> actually removing distictClause from the Query structure. Please avoid
> >> doing that. If you remove it, you are also removing the evidence that this
> >> Query had a DISTINCT clause in it.
> >
> > Yes, I removed it because it is the easiest way to do it.  what is the
> > purpose of keeping the evidence?
> >
>
> Julien's example provides an explanation for this. The Query structure is
> serialised into a view definition. Removing distinctClause from there means
> that the view will never try to produce unique results.

And also I think that this approach will have a lot of other unexpected side
effects.  Isn't changing the Query going to affect pg_stat_statements queryid
computing for instance?



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:
On Thu, Feb 13, 2020 at 5:39 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:
> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> > ashutosh.bapat.oss@gmail.com> wrote:
> >
> >>>
> >>> [PATCH] Erase the distinctClause if the result is unique by
> >>>  definition
> >>
> >> I forgot to mention this in the last round of comments. Your patch was
> >> actually removing distictClause from the Query structure. Please avoid
> >> doing that. If you remove it, you are also removing the evidence that this
> >> Query had a DISTINCT clause in it.
> >
> > Yes, I removed it because it is the easiest way to do it.  what is the
> > purpose of keeping the evidence?
> >
>
> Julien's example provides an explanation for this. The Query structure is
> serialised into a view definition. Removing distinctClause from there means
> that the view will never try to produce unique results.

And also I think that this approach will have a lot of other unexpected side
effects.  Isn't changing the Query going to affect pg_stat_statements queryid
computing for instance?

Thanks,  the 2 factors above are pretty valuable.  so erasing the 
distinctClause is not reasonable,  I will try another way. 

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:


On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:


On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:



[PATCH] Erase the distinctClause if the result is unique by
 definition

I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it.  what is the purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is serialised into a view definition. Removing distinctClause from there means that the view will never try to produce unique results.
 
 
Actually it is not true.  If a view is used in the query,  the definition will be *copied*
into the query tree. so if we modify the query tree,  the definition of the view never
touched.  The issue of Julien reported is because of a typo error. 

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
                         QUERY PLAN
-------------------------------------------------------------
 Unique  (cost=1.03..1.04 rows=1 width=4)
   ->  Sort  (cost=1.03..1.04 rows=1 width=4)
         Sort Key: b
         ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
               Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning it every time it gets executed. It's not using a stored prepared plan. Try without parameters. Also make sure that a prepared plan is used for execution and not a new plan.

Even for  parameterised prepared statement, it is still possible to generate an generic
plan. so it will not replanning every time.  But no matter generic plan or not,  after a DDL like
changing the NOT NULL constraints.   pg will generated a plan based on the stored query
tree.   However, the query tree will be *copied* again to generate a new plan. so even I 
modified the query tree,  everything will be ok as well. 

At last,  I am agreed with that modifying the query tree is not a good idea. 
so my updated patch doesn't use it any more.  

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Andy Fan
Date:
Hi All: 

Here is the updated patch.  It used some functions from query_is_distinct_for. 
I check the query's distinctness in create_distinct_paths,  if it is distinct already,  
it will not generate the paths for that.  so at last the query tree is not untouched.

Please see if you have any comments.   Thanks

On Mon, Feb 24, 2020 at 8:38 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:


On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:



[PATCH] Erase the distinctClause if the result is unique by
 definition

I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it.  what is the purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is serialised into a view definition. Removing distinctClause from there means that the view will never try to produce unique results.
 
 
Actually it is not true.  If a view is used in the query,  the definition will be *copied*
into the query tree. so if we modify the query tree,  the definition of the view never
touched.  The issue of Julien reported is because of a typo error. 

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
                         QUERY PLAN
-------------------------------------------------------------
 Unique  (cost=1.03..1.04 rows=1 width=4)
   ->  Sort  (cost=1.03..1.04 rows=1 width=4)
         Sort Key: b
         ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
               Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning it every time it gets executed. It's not using a stored prepared plan. Try without parameters. Also make sure that a prepared plan is used for execution and not a new plan.

Even for  parameterised prepared statement, it is still possible to generate an generic
plan. so it will not replanning every time.  But no matter generic plan or not,  after a DDL like
changing the NOT NULL constraints.   pg will generated a plan based on the stored query
tree.   However, the query tree will be *copied* again to generate a new plan. so even I 
modified the query tree,  everything will be ok as well. 

At last,  I am agreed with that modifying the query tree is not a good idea. 
so my updated patch doesn't use it any more.  
Attachment
Andy Fan <zhihui.fan1213@gmail.com> writes:
> Please see if you have any comments.   Thanks

The cfbot isn't at all happy with this.  Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.

Some other random comments:

* Don't especially like the way you broke query_is_distinct_for()
into two functions, especially when you then introduced a whole
lot of other code in between.  That's just making reviewer's job
harder to see what changed.  It makes the comments a bit disjointed
too, that is where you even had any.  (Zero introductory comment
for query_is_distinct_agg is *not* up to project coding standards.
There are a lot of other undercommented places in this patch, too.)

* Definitely don't like having query_distinct_through_join re-open
all the relations.  The data needed for this should get absorbed
while plancat.c has the relations open at the beginning.  (Non-nullness
of columns, in particular, seems like it'll be useful for other
purposes; I'm a bit surprised the planner isn't using that already.)

* In general, query_distinct_through_join seems hugely messy, expensive,
and not clearly correct.  If it is correct, the existing comments sure
aren't enough to explain what it is doing or why.

* Not entirely convinced that a new GUC is needed for this, but if
it is, you have to document it.

* I wouldn't bother with bms_array_free(), nor with any of the other
cleanup you've got at the bottom of query_distinct_through_join.
The planner leaks *lots* of memory, and this function isn't going to
be called so many times that it'll move the needle.

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them.  Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

* I see you decided to add a new regression test file select_distinct_2.
That's a poor choice of name because it conflicts with our rules for the
naming of alternative output files.  Besides which, you forgot to plug
it into the test schedule files, so it isn't actually getting run.
Is there a reason not to just add the new test cases to select_distinct?

* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step.  That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change.  I'm also
wondering what caused the plan change in expected/join.out.

            regards, tom lane



Thank you Tom for the review! 

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> Please see if you have any comments.   Thanks

The cfbot isn't at all happy with this.  Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

 
Before I submit the patch, I can make sure "make check-world" is successful, but
since the compile option is not same,  so I didn't catch the possibly-uninitialized 
variable.   As for the crash on the windows,  I didn't get the enough information 
now,  I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will make sure
the next patch can go pass this test.  


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.


OK,  Looks group by is a bit harder  than distinct since the aggregation function. 
I will go through the code to see why to add this logic.  
 
Some other random comments:

* Don't especially like the way you broke query_is_distinct_for()
into two functions, especially when you then introduced a whole
lot of other code in between. 

This is not expected by me until you point it out.  In this case, I have to 
break the query_is_distinct_for to two functions, but it true that we 
should put the two functions together. 


That's just making reviewer's job
harder to see what changed.  It makes the comments a bit disjointed
too, that is where you even had any.  (Zero introductory comment
for query_is_distinct_agg is *not* up to project coding standards.
There are a lot of other undercommented places in this patch, too.)

* Definitely don't like having query_distinct_through_join re-open
all the relations.  The data needed for this should get absorbed
while plancat.c has the relations open at the beginning.  (Non-nullness
of columns, in particular, seems like it'll be useful for other
purposes; I'm a bit surprised the planner isn't using that already.)

I can add new attributes to RelOptInfo and fill the value in get_relation_info
call.  
 
* In general, query_distinct_through_join seems hugely messy, expensive,
and not clearly correct.  If it is correct, the existing comments sure
aren't enough to explain what it is doing or why. 
 
Removing the relation_open call can make it a bit simpler,  I will try more 
comment to make it clearer in the following patch. 


* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them.  Please also
avoid unnecessary whitespace changes, those also slow down reviewing.


That may because I added the header file some time 1 and then refactored
the code later then forget the remove the header file accordingly.  Do we need
to relay on experience to tell if the header file is needed or not, or do have have
any code to tell it automatically? 
 
* I see you decided to add a new regression test file select_distinct_2.
That's a poor choice of name because it conflicts with our rules for the
naming of alternative output files.  Besides which, you forgot to plug
it into the test schedule files, so it isn't actually getting run.
Is there a reason not to just add the new test cases to select_distinct?


Adding it to select_distinct.sql is ok for me as well.  Actually I have no
obviously reason to add the new file.
 
* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step.  That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change.  I'm also
wondering what caused the plan change in expected/join.out.

Per my purpose it should be in the same patch,  the logical here is we 
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited from 
the original query_is_distinct_for clause).  


                        regards, tom lane


On Tue, Mar 3, 2020 at 1:24 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Thank you Tom for the review! 

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> Please see if you have any comments.   Thanks

The cfbot isn't at all happy with this.  Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

 
Before I submit the patch, I can make sure "make check-world" is successful, but
since the compile option is not same,  so I didn't catch the possibly-uninitialized 
variable.   As for the crash on the windows,  I didn't get the enough information 
now,  I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will make sure
the next patch can go pass this test.  


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.


OK,  Looks group by is a bit harder  than distinct since the aggregation function. 
I will go through the code to see why to add this logic.  
 

Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we can do
some work on the pathtarget/reltarget by transforming the Aggref to raw expr.
I checked the execution path of the aggregation call, looks it depends on Agg node 
which is the thing we want to remove.  

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them.  Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

fixed some typo errors. 

That may be because I added the header file at time 1 and then refactored
the code but forget to remove the header file when it is not necessary.  
Do we need to relay on experience to tell if the header file is needed or not, 
or do we have any tool to tell it automatically? 
  

 regards,  Andy Fan

 
* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step.  That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change.  I'm also
wondering what caused the plan change in expected/join.out.

Per my purpose it should be in the same patch,  the logical here is we 
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited from 
the original query_is_distinct_for clause).  

I think I was right until I come into contrib/postgres_fdw/sql/postgres_fdw.sql. 
Per my understanding,  the query the result of  "select max(a) from t"  is unique
since the aggregation function and has no group clause there.  But in the 
postgres_fdw.sql case,  the Query->hasAggs is true for "select distinct 
(select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) 
from ft2 t2 where t2.c2 % 6 = 0 order by 1;"  This looks very strange to me. 
Is my understanding wrong or there is a bug here?

query->hasAggs was set to true in the following call stack. 

 pstate->p_hasAggs = true;     

.. 

 qry->hasAggs =  pstate->p_hasAggs;    


0  in check_agglevels_and_constraints of parse_agg.c:343
1  in transformAggregateCall of parse_agg.c:236
2  in ParseFuncOrColumn of parse_func.c:805
3  in transformFuncCall of parse_expr.c:1558
4  in transformExprRecurse of parse_expr.c:265
5  in transformExpr of parse_expr.c:155
6  in transformTargetEntry of parse_target.c:105
7  in transformTargetList of parse_target.c:193
8  in transformSelectStmt of analyze.c:1224
9  in transformStmt of analyze.c:301

You can see the new updated patch which should fix all the issues you point out 
except the one for supporting group by.   The another reason for this patch will 
not be the final one is because the changes for postgres_fdw.out is too arbitrary. 
uploading it now just for reference.  (The new introduced guc variable can be
removed at last, keeping it now just make sure the testing is easier.)


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.


OK,  Looks group by is a bit harder  than distinct since the aggregation function. 
I will go through the code to see where to add this logic.  
 

Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we can do
some work on the pathtarget/reltarget by transforming the Aggref to raw expr.
I checked the execution path of the aggregation call, looks it depends on Agg node 
which is the thing we want to remove.  

We can't grantee  any_aggr_func(a) == a  when only 1 row returned, so the above 
method doesn't work.   do you have any suggestion for this? 
Attachment
Upload the newest patch so that the cfbot can pass.  The last patch failed 
because some explain without the (cost off). 

I'm still on the way to figure out how to handle aggregation calls without 
aggregation path.  Probably we can get there by hacking some 
ExprEvalPushStep for Aggref node.  But since the current patch not tied 
with this closely,   so I would put this patch for review first. 



On Wed, Mar 4, 2020 at 9:13 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

 
* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step.  That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change.  I'm also
wondering what caused the plan change in expected/join.out.

Per my purpose it should be in the same patch,  the logical here is we 
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited from 
the original query_is_distinct_for clause).  

I think I was right until I come into contrib/postgres_fdw/sql/postgres_fdw.sql. 
Per my understanding,  the query the result of  "select max(a) from t"  is unique
since the aggregation function and has no group clause there.  But in the 
postgres_fdw.sql case,  the Query->hasAggs is true for "select distinct 
(select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) 
from ft2 t2 where t2.c2 % 6 = 0 order by 1;"  This looks very strange to me. 
Is my understanding wrong or there is a bug here?

query->hasAggs was set to true in the following call stack. 

 pstate->p_hasAggs = true;     

.. 

 qry->hasAggs =  pstate->p_hasAggs;    


0  in check_agglevels_and_constraints of parse_agg.c:343
1  in transformAggregateCall of parse_agg.c:236
2  in ParseFuncOrColumn of parse_func.c:805
3  in transformFuncCall of parse_expr.c:1558
4  in transformExprRecurse of parse_expr.c:265
5  in transformExpr of parse_expr.c:155
6  in transformTargetEntry of parse_target.c:105
7  in transformTargetList of parse_target.c:193
8  in transformSelectStmt of analyze.c:1224
9  in transformStmt of analyze.c:301

You can see the new updated patch which should fix all the issues you point out 
except the one for supporting group by.   The another reason for this patch will 
not be the final one is because the changes for postgres_fdw.out is too arbitrary. 
uploading it now just for reference.  (The new introduced guc variable can be
removed at last, keeping it now just make sure the testing is easier.)


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.


OK,  Looks group by is a bit harder  than distinct since the aggregation function. 
I will go through the code to see where to add this logic.  
 

Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we can do
some work on the pathtarget/reltarget by transforming the Aggref to raw expr.
I checked the execution path of the aggregation call, looks it depends on Agg node 
which is the thing we want to remove.  

We can't grantee  any_aggr_func(a) == a  when only 1 row returned, so the above 
method doesn't work.   do you have any suggestion for this? 
Attachment

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Sat, 7 Mar 2020 at 00:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Upload the newest patch so that the cfbot can pass.  The last patch failed
> because some explain without the (cost off).

I've only really glanced at this patch, but I think we need to do this
in a completely different way.

I've been mentioning UniqueKeys around this mailing list for quite a
while now [1]. To summarise the idea:

1. Add a new List field to RelOptInfo named unique_keys
2. During get_relation_info() process the base relation's unique
indexes and add to the RelOptInfo's unique_keys list the indexed
expressions from each unique index (this may need to be delayed until
check_index_predicates() since predOK is only set there)
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index). If the
condition was t1.unique = t2.unique then we could take the unique keys
from both sides of the join, and with t1.non_unique = t2.non_unique,
we can take neither.
4. When creating the GROUP BY paths (when there are no aggregates),
don't bother doing anything if the input rel's unique keys are a
subset of the GROUP BY clause. Otherwise, create the group by paths
and tag the new unique keys onto the GROUP BY rel.
5. When creating the DISTINCT paths, don't bother if the input rel has
unique keys are a subset of the distinct clause.

4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
non_unique will just uniquify once for the GROUP BY and not for the
distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
anything to uniquify the results.

Because both 4 and 5 require that the uniquekeys are a subset of the
distinct/group by clause, an empty uniquekey set would mean that the
RelOptInfo returns no more than 1 row.  That would allow your:

SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos.

I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

[1] https://www.postgresql.org/search/?m=1&ln=pgsql-hackers&q=uniquekeys



Hi David:
 
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.

I have some concerns about this method,  maybe I misunderstand 
something, if so, please advise.  

In my current implementation, it calculates the uniqueness for each
BaseRel only, but in your way,  looks we need to calculate the
UniquePathKey for both BaseRel and JoinRel.   This makes more 
difference for multi table join.    Another concern is UniquePathKey
is designed for a general purpose,  we need to maintain it no matter
distinctClause/groupbyClause.  
 
 For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index). 

Thanks for raising this.  My current rule requires *every* relation yields a 
unique result and *no matter* with the join method.  Actually I want to make
the rule less strict, for example, we  may just need 1 table yields unique result
and the final result will be unique as well under some join type. 

As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique;  looks we can't 
remove the distinct based on this. 

create table m1(a int primary key, b int);
create table m2(a int primary key, b int);
insert into m1 values(1, 1), (2, 1);
insert into m2 values(1, 1), (2, 1);
select distinct m1.a from m1, m2 where m1.a = m2.b;

 
SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
 
Actually I want to keep the distinct for this case now.  One reason is there are only 1 
row returned, so the distinct erasing can't help much.   The more important reason is
Query->hasAggs is true for "select distinct  (select count(*) filter (where t2.c2 = 6 
and t2.c1 < 10) from ft1 t1 where t1.c1 = 6)  from ft2 t2 where t2.c2 % 6 = 0 order by 1;"
(this sql came from postgres_fdw.sql).  

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos

Thanks for this info.  I guess this patch is not merged so far, but looks the cfbot
for my patch [1]  failed due to this :(   search 
"explain (costs off) select distinct on(pk1) pk1, pk2 from select_distinct_a;"
 
I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

 I can do this after we have agreement on the UniquePath.  

For my cbbot failure, another strange thing is "A" appear ahead of "a" after
the order by..  Still didn't find out why.


Regards 
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:


On Tue, Mar 10, 2020 at 3:51 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 7 Mar 2020 at 00:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Upload the newest patch so that the cfbot can pass.  The last patch failed
> because some explain without the (cost off).

I've only really glanced at this patch, but I think we need to do this
in a completely different way.

I've been mentioning UniqueKeys around this mailing list for quite a
while now [1]. To summarise the idea:

1. Add a new List field to RelOptInfo named unique_keys
2. During get_relation_info() process the base relation's unique
indexes and add to the RelOptInfo's unique_keys list the indexed
expressions from each unique index (this may need to be delayed until
check_index_predicates() since predOK is only set there)
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),

build_*_join_rel() will be a good place for this. The paths created might take advantage of this information for costing.
 
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index).

this is interesting.
 
If the
condition was t1.unique = t2.unique then we could take the unique keys
from both sides of the join, and with t1.non_unique = t2.non_unique,
we can take neither.
4. When creating the GROUP BY paths (when there are no aggregates),
don't bother doing anything if the input rel's unique keys are a
subset of the GROUP BY clause. Otherwise, create the group by paths
and tag the new unique keys onto the GROUP BY rel.
5. When creating the DISTINCT paths, don't bother if the input rel has
unique keys are a subset of the distinct clause.

Thanks for laying this out in more details. Two more cases can be added to this
6. When creating RelOptInfo for a grouped/aggregated result, if all the columns of a group by clause are part of the result i.e. targetlist, the columns in group by clause server as the unique keys of the result. So the corresponding RelOptInfo can be marked as such.
7. The result of DISTINCT is unique for the columns contained in the DISTINCT clause. Hence we can add those columns to the unique_key of the RelOptInfo representing the result of the distinct clause.
8. If each partition of a partitioned table has a unique key with the same columns in it and that happens to be superset of the partition key, then the whole partitioned table gets that unique key as well.

With this we could actually pass the uniqueness information through Subquery scans as well and the overall query will benefit with that.
 

4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
non_unique will just uniquify once for the GROUP BY and not for the
distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
anything to uniquify the results.

Because both 4 and 5 require that the uniquekeys are a subset of the
distinct/group by clause, an empty uniquekey set would mean that the
RelOptInfo returns no more than 1 row.  That would allow your:

SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos.

I haven't looked at that patch, but as discussed upthread, in this case we want the uniqueness associated with the RelOptInfo and not the path.
 

I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

+1.
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:
Hi Andy,

On Tue, Mar 10, 2020 at 1:49 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Hi David:
>
>>
>> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
>> rel itself (I've not looked for the best location in detail),
>> determine if the join can cause rows to be duplicated. If it can't,
>> then add the UniqueKeys from that rel.
>
>
> I have some concerns about this method,  maybe I misunderstand
> something, if so, please advise.
>
> In my current implementation, it calculates the uniqueness for each
> BaseRel only, but in your way,  looks we need to calculate the
> UniquePathKey for both BaseRel and JoinRel.   This makes more
> difference for multi table join.

I didn't understand this concern. I think, it would be better to do it
for all kinds of relation types base, join etc. This way we are sure
that one method works across the planner to eliminate the need for
Distinct or grouping. If we just implement something for base
relations right now and don't do that for joins, there is a chance
that that method may not work for joins when we come to implement it.

> Another concern is UniquePathKey
> is designed for a general purpose,  we need to maintain it no matter
> distinctClause/groupbyClause.

This should be ok. The time spent in annotating a RelOptInfo about
uniqueness is not going to be a lot. But doing so would help generic
elimination of Distinct/Group/Unique operations. What is
UniquePathKey; I didn't find this in your patch or in the code.

>
>
>>
>>  For example: SELECT * FROM t1
>> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
>> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
>> since it's an eqijoin and t1.unique has a unique index).
>
>
> Thanks for raising this.  My current rule requires *every* relation yields a
> unique result and *no matter* with the join method.  Actually I want to make
> the rule less strict, for example, we  may just need 1 table yields unique result
> and the final result will be unique as well under some join type.

That is desirable.

>
> As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique;  looks we can't
> remove the distinct based on this.
>
> create table m1(a int primary key, b int);
> create table m2(a int primary key, b int);
> insert into m1 values(1, 1), (2, 1);
> insert into m2 values(1, 1), (2, 1);
> select distinct m1.a from m1, m2 where m1.a = m2.b;

IIUC, David's rule is other way round. "select distinct m2.a from m1,
m2 where m1.a = m2.b" won't need DISTINCT node since the result of
joining m1 and m2 has unique value of m2.a for each row. In your
example the join will produce two rows (m1.a, m1.b, m2.a, m2.b) (1, 1,
1, 1) and (1, 1, 2, 1) where m2.a is unique key.

--
Best Wishes,
Ashutosh Bapat




Hi Tom & David & Bapat:

Thanks for your review so far.  I want to summarize the current issues to help
our following discussion.

1. Shall we bypass the AggNode as well with the same logic.

I think yes, since the rules to bypass a AggNode and UniqueNode is exactly same.
The difficulty of bypassing AggNode is the current aggregation function call is closely
coupled with AggNode.  In the past few days, I have make the aggregation call can
run without AggNode (at least I tested sum(without finalized  fn),  avg (with finalized fn)).
But there are a few things to do, like acl check,  anynull check and maybe more check.
also there are some MemoryContext mess up need to fix.
I still need some time for this goal,  so I think the complex of it deserves another thread
to discuss it,  any thought?

2. Shall we used the UniquePath as David suggested.

Actually I am trending to this way now.  Daivd, can you share more insights about the 
benefits of UniquePath?  Costing size should be one of them,  another one may be 
changing the semi join to normal join as the current innerrel_is_unique did.  any others?

3. Can we make the rule more general?

Currently it requires every relation yields a unique result.  Daivd & Bapat provides another example: 
select m2.pk from m1, m2 where m1.pk = m2.non_unqiue_key. That's interesting and not easy to 
handle in my current framework.  This is another reason I want to take the UniquePath framework. 

Do we have any other rules to think about before implementing it? 

Thanks for your feedback. 


This should be ok. The time spent in annotating a RelOptInfo about
uniqueness is not going to be a lot. But doing so would help generic
elimination of Distinct/Group/Unique operations. What is
UniquePathKey; I didn't find this in your patch or in the code.

This is a proposal from David,  so not in current patch/code :) 

Regards
Andy Fan 
 

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 1:49 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > In my current implementation, it calculates the uniqueness for each
> > BaseRel only, but in your way,  looks we need to calculate the
> > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > difference for multi table join.
>
> I didn't understand this concern. I think, it would be better to do it
> for all kinds of relation types base, join etc. This way we are sure
> that one method works across the planner to eliminate the need for
> Distinct or grouping. If we just implement something for base
> relations right now and don't do that for joins, there is a chance
> that that method may not work for joins when we come to implement it.

Yeah, it seems to me that we're seeing more and more features that
require knowledge of uniqueness of a RelOptInfo. The skip scans patch
needs to know if a join will cause row duplication so it knows if the
skip scan path can be joined to without messing up the uniqueness of
the skip scan.  Adding more and more places that loop over the rel's
indexlist just does not seem the right way to do it, especially so
when you have to dissect the join rel down to its base rel components
to check which indexes there are. Having the knowledge on-hand at the
RelOptInfo level means we no longer have to look at indexes for unique
proofs.

> > Another concern is UniquePathKey
> > is designed for a general purpose,  we need to maintain it no matter
> > distinctClause/groupbyClause.
>
> This should be ok. The time spent in annotating a RelOptInfo about
> uniqueness is not going to be a lot. But doing so would help generic
> elimination of Distinct/Group/Unique operations. What is
> UniquePathKey; I didn't find this in your patch or in the code.

Possibly a misinterpretation. There is some overlap between this patch
and the skip scan patch, both would like to skip doing explicit work
to implement DISTINCT. Skip scans just go about it by adding new path
types that scan the index and only gathers up unique values.  In that
case, the RelOptInfo won't contain the unique keys, but the skip scan
path will. How I imagine both of these patches working together in
create_distinct_paths() is that we first check if the DISTINCT clause
is a subset of the a set of the RelOptInfo's unique keys (this patch),
else we check if there are any paths with uniquekeys that we can use
to perform a no-op on the DISTINCT clause (the skip scan patch), if
none of those apply, we create the required paths to uniquify the
results.



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Tue, 10 Mar 2020 at 21:19, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>
>
> Actually I want to keep the distinct for this case now.  One reason is there are only 1
> row returned, so the distinct erasing can't help much.   The more important reason is
> Query->hasAggs is true for "select distinct  (select count(*) filter (where t2.c2 = 6
> and t2.c1 < 10) from ft1 t1 where t1.c1 = 6)  from ft2 t2 where t2.c2 % 6 = 0 order by 1;"
> (this sql came from postgres_fdw.sql).

I think that sort of view is part of the problem here.  If you want to
invent some new way to detect uniqueness that does not count that case
then we have more code with more possible places to have bugs.

FWIW, query_is_distinct_for() does detect that case with:

/*
* If we have no GROUP BY, but do have aggregates or HAVING, then the
* result is at most one row so it's surely unique, for any operators.
*/
if (query->hasAggs || query->havingQual)
return true;

which can be seen by the fact that the following find the unique join on t2.

postgres=# explain verbose select * from t1 inner join (select
count(*) c from t1) t2 on t1.a=t2.c;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Hash Join  (cost=41.91..84.25 rows=13 width=12)
   Output: t1.a, (count(*))
   Inner Unique: true
   Hash Cond: (t1.a = (count(*)))
   ->  Seq Scan on public.t1  (cost=0.00..35.50 rows=2550 width=4)
         Output: t1.a
   ->  Hash  (cost=41.89..41.89 rows=1 width=8)
         Output: (count(*))
         ->  Aggregate  (cost=41.88..41.88 rows=1 width=8)
               Output: count(*)
               ->  Seq Scan on public.t1 t1_1  (cost=0.00..35.50
rows=2550 width=0)
                     Output: t1_1.a
(12 rows)

It will be very simple to add an empty List of UniqueKeys to the GROUP
BY's RelOptInfo to indicate that all expressions are unique.  That way
any code that checks if some of the RelOptInfo's unique keys are a
subset of some expressions they'd like to know are unique, then
they'll get a match.

It does not really matter how much effort is saved in your example
above. The UniqueKey infrastructure won't know how much effort
properly adding all the uniquekeys will save. It should just add all
the keys it can and let whichever code cares about that reap the
benefits.





On Wed, Mar 11, 2020 at 6:49 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 1:49 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > In my current implementation, it calculates the uniqueness for each
> > BaseRel only, but in your way,  looks we need to calculate the
> > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > difference for multi table join.
>
> I didn't understand this concern. I think, it would be better to do it
> for all kinds of relation types base, join etc. This way we are sure
> that one method works across the planner to eliminate the need for
> Distinct or grouping. If we just implement something for base
> relations right now and don't do that for joins, there is a chance
> that that method may not work for joins when we come to implement it.

Yeah, it seems to me that we're seeing more and more features that
require knowledge of uniqueness of a RelOptInfo. The skip scans patch
needs to know if a join will cause row duplication so it knows if the
skip scan path can be joined to without messing up the uniqueness of
the skip scan.  Adding more and more places that loop over the rel's
indexlist just does not seem the right way to do it, especially so
when you have to dissect the join rel down to its base rel components
to check which indexes there are. Having the knowledge on-hand at the
RelOptInfo level means we no longer have to look at indexes for unique
proofs.

> > Another concern is UniquePathKey
> > is designed for a general purpose,  we need to maintain it no matter
> > distinctClause/groupbyClause.
>
> This should be ok. The time spent in annotating a RelOptInfo about
> uniqueness is not going to be a lot. But doing so would help generic
> elimination of Distinct/Group/Unique operations. What is
> UniquePathKey; I didn't find this in your patch or in the code.

Possibly a misinterpretation. There is some overlap between this patch
and the skip scan patch, both would like to skip doing explicit work
to implement DISTINCT. Skip scans just go about it by adding new path
types that scan the index and only gathers up unique values.  In that
case, the RelOptInfo won't contain the unique keys, but the skip scan
path will. How I imagine both of these patches working together in
create_distinct_paths() is that we first check if the DISTINCT clause
is a subset of the a set of the RelOptInfo's unique keys (this patch),
else we check if there are any paths with uniquekeys that we can use
to perform a no-op on the DISTINCT clause (the skip scan patch), if
none of those apply, we create the required paths to uniquify the
results.

Now I am convinced that we should maintain UniquePath on RelOptInfo,
I would see how to work with "Index Skip Scan" patch.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:
On Tue, Mar 10, 2020 at 9:12 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> Hi Tom & David & Bapat:
>
> Thanks for your review so far.  I want to summarize the current issues to help
> our following discussion.
>
> 1. Shall we bypass the AggNode as well with the same logic.
>
> I think yes, since the rules to bypass a AggNode and UniqueNode is exactly same.
> The difficulty of bypassing AggNode is the current aggregation function call is closely
> coupled with AggNode.  In the past few days, I have make the aggregation call can
> run without AggNode (at least I tested sum(without finalized  fn),  avg (with finalized fn)).
> But there are a few things to do, like acl check,  anynull check and maybe more check.
> also there are some MemoryContext mess up need to fix.
> I still need some time for this goal,  so I think the complex of it deserves another thread
> to discuss it,  any thought?

I think if the relation underlying an Agg node is know to be unique
for given groupByClause, we could safely use AGG_SORTED strategy.
Though the input is not ordered, it's sorted thus for every row Agg
node will combine/finalize the aggregate result.

-- 
Best Wishes,
Ashutosh Bapat



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
Ashutosh Bapat
Date:
On Wed, Mar 11, 2020 at 4:19 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > > In my current implementation, it calculates the uniqueness for each
> > > BaseRel only, but in your way,  looks we need to calculate the
> > > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > > difference for multi table join.
> >
> > I didn't understand this concern. I think, it would be better to do it
> > for all kinds of relation types base, join etc. This way we are sure
> > that one method works across the planner to eliminate the need for
> > Distinct or grouping. If we just implement something for base
> > relations right now and don't do that for joins, there is a chance
> > that that method may not work for joins when we come to implement it.
>
> Yeah, it seems to me that we're seeing more and more features that
> require knowledge of uniqueness of a RelOptInfo. The skip scans patch
> needs to know if a join will cause row duplication so it knows if the
> skip scan path can be joined to without messing up the uniqueness of
> the skip scan.  Adding more and more places that loop over the rel's
> indexlist just does not seem the right way to do it, especially so
> when you have to dissect the join rel down to its base rel components
> to check which indexes there are. Having the knowledge on-hand at the
> RelOptInfo level means we no longer have to look at indexes for unique
> proofs.

+1. Yep. When we break join down to the base relation, partitioned
relation pose another challenge that the partitioned relation may not
have an index on it per say but each partition may have it and the
index key happens to be part of the partition key. That case would be
easy to track through RelOptInfo instead of breaking a base rel down
into its child rels.

>
> > > Another concern is UniquePathKey
> > > is designed for a general purpose,  we need to maintain it no matter
> > > distinctClause/groupbyClause.
> >
> > This should be ok. The time spent in annotating a RelOptInfo about
> > uniqueness is not going to be a lot. But doing so would help generic
> > elimination of Distinct/Group/Unique operations. What is
> > UniquePathKey; I didn't find this in your patch or in the code.
>
> Possibly a misinterpretation. There is some overlap between this patch
> and the skip scan patch, both would like to skip doing explicit work
> to implement DISTINCT. Skip scans just go about it by adding new path
> types that scan the index and only gathers up unique values.  In that
> case, the RelOptInfo won't contain the unique keys, but the skip scan
> path will. How I imagine both of these patches working together in
> create_distinct_paths() is that we first check if the DISTINCT clause
> is a subset of the a set of the RelOptInfo's unique keys (this patch),
> else we check if there are any paths with uniquekeys that we can use
> to perform a no-op on the DISTINCT clause (the skip scan patch), if
> none of those apply, we create the required paths to uniquify the
> results.

Looks good to me. But I have not seen index skip patch.

-- 
Best Wishes,
Ashutosh Bapat



Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns.  I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t1.a
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Seq Scan on t1
(7 rows)


postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.b = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t2.a
   ->  Hash Join
         Hash Cond: (t1.b = t2.a)
         ->  Seq Scan on t1
         ->  Hash
               ->  Seq Scan on t2
(7 rows)


postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it.  I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

Attachment
Hi David:

On Thu, Mar 12, 2020 at 3:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns.  I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t1.a
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Seq Scan on t1
(7 rows)


postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.b = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t2.a
   ->  Hash Join
         Hash Cond: (t1.b = t2.a)
         ->  Seq Scan on t1
         ->  Hash
               ->  Seq Scan on t2
(7 rows)


postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it.  I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

Thanks for the code!   Here is some points from me. 

1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as well. 
(relation_has_unqiue_for has a similar logic) currently the following distinct path is still
 there. 

postgres=# explain select distinct b from t100 where pk = 1;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Unique  (cost=8.18..8.19 rows=1 width=4)
   ->  Sort  (cost=8.18..8.19 rows=1 width=4)
         Sort Key: b
         ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 width=4)
               Index Cond: (pk = 1)
(5 rows)

I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If so we
can get more opportunities to reach our goal.  

2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more UniquePath as 
(rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply. 
or else the following cases can't be handled.

postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
   ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
         Sort Key: t100.pk, t101.pk
         ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
               ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
               ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
                     ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)

But if we add such rule,  the unique paths probably become much longer,  so we need
a strategy to tell if the UniquePath is useful for our query,  if not,  we can ignore that.
rel->reltarget maybe a good info for such optimization.   I think we can take this into 
consideration for pupulate_baserel_uniquekeys as well. 

For the non_null info,  Tom suggested to add maintain such info RelOptInfo, 
I have done that for the not_null_info for basic relation catalog,  I think we can
maintain the same flag for joinrel and the not null info from find_nonnullable_vars as 
well, but I still didn't find a good place to add that so far. 


A small question about the following code:

+       if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
+       {
+
+               add_path(distinct_rel, (Path *)cheapest_input_path);
+
+               /* XXX yeah yeah, need to call the hooks etc. */
+
+               /* Now choose the best path(s) */
+               set_cheapest(distinct_rel);
+
+               return distinct_rel;
+       }

Since we don't create new RelOptInfo/Path,  do we need to call add_path and set_cheapest?


Best Regards
Andy Fan 

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Fri, 13 Mar 2020 at 14:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as well.
> (relation_has_unqiue_for has a similar logic) currently the following distinct path is still
>  there.

Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
mention that still needs to be done.

> postgres=# explain select distinct b from t100 where pk = 1;
>                                     QUERY PLAN
> ----------------------------------------------------------------------------------
>  Unique  (cost=8.18..8.19 rows=1 width=4)
>    ->  Sort  (cost=8.18..8.19 rows=1 width=4)
>          Sort Key: b
>          ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 width=4)
>                Index Cond: (pk = 1)
> (5 rows)
>
> I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If so we
> can get more opportunities to reach our goal.

The UniqueKeySet containing "b" could only be added in the
distinct_rel in the upper planner. It must not change the input_rel
for the distinct.

It's likely best to steer clear of calling UniqueKeys UniquePaths as
it might confuse people.  The term "path" is used in PostgreSQL as a
lightweight representation containing all the information required to
build a plan node in createplan.c. More details in
src/backend/optimizer/README.

> 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more UniquePath as
> (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
> or else the following cases can't be handled.
>
> postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
>    ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
>          Sort Key: t100.pk, t101.pk
>          ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
>                ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
>                ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
>                      ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 width=4)
> (7 rows)

I don't really follow what you mean here. It seems to me there's no
way we can skip doing DISTINCT in the case above.  If you've just
missed out the join clause and you meant to have "WHERE t100.pk =
t101.pk", then we can likely fix that later with some sort of
functional dependency tracking. Likely we can just add a Relids field
to UniqueKeySet to track which relids are functionally dependant on a
row from the UniqueKeySet's uk_exprs.  That might be as simple as just
pull_varnos from the non-matched exprs and checking to ensure the
result is a subset of functionally dependant rels.  I'd need to give
that more thought.

Was this a case you had working in your patch?

> But if we add such rule,  the unique paths probably become much longer,  so we need
> a strategy to tell if the UniquePath is useful for our query,  if not,  we can ignore that.
> rel->reltarget maybe a good info for such optimization.   I think we can take this into
> consideration for pupulate_baserel_uniquekeys as well.

I don't really think the number of unique indexes in a base rel will
really ever get out of hand for legitimate cases.
propagate_unique_keys_to_joinrel is just concatenating baserel
UniqueKeySets to the joinrel. They're not copied, so it's just tagging
pointers onto the end of an array, which is at best a memcpy(), or at
worst a realloc() then memcpy(). That's not so costly.

> For the non_null info,  Tom suggested to add maintain such info RelOptInfo,
> I have done that for the not_null_info for basic relation catalog,  I think we can
> maintain the same flag for joinrel and the not null info from find_nonnullable_vars as
> well, but I still didn't find a good place to add that so far.

I'd considered just adding a get_notnull() function to lsyscache.c.
Just below get_attname() looks like a good spot.  I imagined just
setting the bit in the UniqueKeySet's non_null_keys field
corresponding to the column position from the index.  I could see the
benefit of having a field in RelOptInfo if there was some way to
determine the not-null properties of all columns in the table at once,
but there's not, so we're likely best just looking at the ones that
there are unique indexes on.

> A small question about the following code:
>
> +       if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause,
parse->targetList),false))
 
> +       {
> +
> +               add_path(distinct_rel, (Path *)cheapest_input_path);
> +
> +               /* XXX yeah yeah, need to call the hooks etc. */
> +
> +               /* Now choose the best path(s) */
> +               set_cheapest(distinct_rel);
> +
> +               return distinct_rel;
> +       }
>
> Since we don't create new RelOptInfo/Path,  do we need to call add_path and set_cheapest?

The distinct_rel already exists. add_path() is the standard way we
have of adding paths to the rel's pathlist.  Why would you want to
bypass that? set_cheapest() is our standard way of looking at the
pathlist and figuring out the least costly one. It's not a very hard
job to do when there's just 1 path. Not sure why you'd want to do it
another way.





On Fri, Mar 13, 2020 at 11:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 13 Mar 2020 at 14:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as well.
> (relation_has_unqiue_for has a similar logic) currently the following distinct path is still
>  there.

Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
mention that still needs to be done. 
> postgres=# explain select distinct b from t100 where pk = 1;
>                                     QUERY PLAN
> ----------------------------------------------------------------------------------
>  Unique  (cost=8.18..8.19 rows=1 width=4)
>    ->  Sort  (cost=8.18..8.19 rows=1 width=4)
>          Sort Key: b
>          ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 width=4)
>                Index Cond: (pk = 1)
> (5 rows)
>
> I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If so we
> can get more opportunities to reach our goal.

The UniqueKeySet containing "b" could only be added in the
distinct_rel in the upper planner. It must not change the input_rel
for the distinct.

I think we maintain UniqueKey even without distinct_rel,  so at this stage, 
Can we say b is unique for this(no is possible)?  If yes,  we probably 
need to set that information without consider the distinct clause. 

It's likely best to steer clear of calling UniqueKeys UniquePaths as
it might confuse people.  The term "path" is used in PostgreSQL as a
lightweight representation containing all the information required to
build a plan node in createplan.c. More details in
src/backend/optimizer/README.

 
OK. 
 
> 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more UniquePath as
> (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
> or else the following cases can't be handled.
>
> postgres=# explain select distinct t100.pkt101.pk from t100, t101;
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
>    ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
>          Sort Key: t100.pk, t101.pk
>          ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
>                ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
>                ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
>                      ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 width=4)
> (7 rows)

I don't really follow what you mean here. It seems to me there's no
way we can skip doing DISTINCT in the case above.  If you've just
missed out the join clause and you meant to have "WHERE t100.pk =
t101.pk", then we can likely fix that later with some sort of
functional dependency tracking.

In the above case the result should be unique, the knowledge behind that
is if *we join 2 unique results in any join method, the result is unique as well*
in the above example,  the final unique Key is (t100.pk, t101.pk).   
 
Likely we can just add a Relids field
to UniqueKeySet to track which relids are functionally dependant on a
row from the UniqueKeySet's uk_exprs.  That might be as simple as just
pull_varnos from the non-matched exprs and checking to ensure the
result is a subset of functionally dependant rels.  I'd need to give
that more thought.

Was this a case you had working in your patch?

I think we can do that after I get your UniqueKey idea,  so, no,  my previous patch is not as smart
as yours:)  


> But if we add such rule,  the unique paths probably become much longer,  so we need
> a strategy to tell if the UniquePath is useful for our query,  if not,  we can ignore that.
> rel->reltarget maybe a good info for such optimization.   I think we can take this into
> consideration for pupulate_baserel_uniquekeys as well.

I don't really think the number of unique indexes in a base rel will
really ever get out of hand for legitimate cases.
propagate_unique_keys_to_joinrel is just concatenating baserel
UniqueKeySets to the joinrel. They're not copied, so it's just tagging
pointers onto the end of an array, which is at best a memcpy(), or at
worst a realloc() then memcpy(). That's not so costly.

The memcpy is not key concerns here.  My main point is we need
to focus on the length of RelOptInfo->uniquekeys.  For example:  
t  has 3 uk like this (uk1),  (uk2), (uk3).  And the query is
select b from t where m = 1;  If so there is no need to add these 3
to UniqueKeys so that we can keep rel->uniquekeys shorter.   

The the length of rel->uniquekeys maybe a  concern if we add the rule 
I suggested above , the (t100.pk, t101.pk) case.   Think about this
for example:

1. select .. from t1, t2, t3, t4...;  
2. suppose each table has 2 UniqueKeys, named (t{m}_uk{n}) 
3. follow my above rule (t1.pk1, t2.pk)is a UniqueKey for joinrel. 
4. suppose we join with the following order (t1 vs t2 vs t3 vs t4)

For for (t1 vs t2),  we need to add 4 more UniqueKeys for this joinrel. 
(t1_uk1 , t2_uk1), (t1_uk1, t2_uk2), (t1_uk2 , t2_uk1), (t1_uk2, t2_uk2)

After we come to join the last one, the joinrel->uniquekey will be much longer
which makes the scan of it less efficient. 

But this will not be an issue if my above rule should not be considered.  so 
we need to talk about that first. 

> For the non_null info,  Tom suggested to add maintain such info RelOptInfo,
> I have done that for the not_null_info for basic relation catalog,  I think we can
> maintain the same flag for joinrel and the not null info from find_nonnullable_vars as
> well, but I still didn't find a good place to add that so far.

I'd considered just adding a get_notnull() function to lsyscache.c.
Just below get_attname() looks like a good spot.  I imagined just
setting the bit in the UniqueKeySet's non_null_keys field
corresponding to the column position from the index.  I could see the
benefit of having a field in RelOptInfo if there was some way to
determine the not-null properties of all columns in the table at once,

do you mean get the non-null  properties from catalog or restrictinfo? 
if you mean catalog,  get_relation_info may be a good place for that. 

but there's not, so we're likely best just looking at the ones that
there are unique indexes on.
 
> A small question about the following code:
>
> +       if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
> +       {
> +
> +               add_path(distinct_rel, (Path *)cheapest_input_path);
> +
> +               /* XXX yeah yeah, need to call the hooks etc. */
> +
> +               /* Now choose the best path(s) */
> +               set_cheapest(distinct_rel);
> +
> +               return distinct_rel;
> +       }
>
> Since we don't create new RelOptInfo/Path,  do we need to call add_path and set_cheapest?

The distinct_rel already exists. add_path() is the standard way we
have of adding paths to the rel's pathlist.  Why would you want to
bypass that? set_cheapest() is our standard way of looking at the
pathlist and figuring out the least costly one. It's not a very hard
job to do when there's just 1 path. Not sure why you'd want to do it
another way.

I got the point now.  In this case,  you create an new RelOptInfo  
named distinct_rel, so we *must* set it.  Can we just return the input_rel 
in this case? if we can,  we don't need that. 
Hi All:

I have re-implemented the patch based on David's suggestion/code,  Looks it
works well.   The updated patch mainly includes: 

1. Maintain the not_null_colno in RelOptInfo, which includes the not null from 
    catalog and the not null from vars.
2. Add the restictinfo check at populate_baserel_uniquekeys. If we are sure 
  about only 1 row returned, I add each expr in rel->reltarget->expr as a unique key.
  like (select a, b, c from t where pk = 1),  the uk will be ( (a), (b), (c) )
3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths
  since we know jointype at that time. so we can handle the semi/anti join specially.
4. Add the rule I suggested above,  if both of the 2 relation yields the a unique result, 
  the join result will be unique as well. the UK can be  ( (rel1_uk1, rel1_uk2).. )
5. If the unique key is impossible to be referenced by others,  we can safely ignore 
    it in order  to keep the (join)rel->unqiuekeys short.
6. I only consider the not null check/opfamily check for the uniquekey which comes 
   from UniqueIndex.  I think that should be correct.
7. I defined each uniquekey as List of Expr,  so I didn't introduce new node type.
8. checked the uniquekeys's information before create_distinct_paths and 
   create_group_paths ignore the new paths to be created if the sortgroupclauses
    is unique already or else create it and add the new uniquekey to the 
    distinctrel/grouprel.

There are some things I still be in-progress, like:
1. Partition table.
2. union/union all
3. maybe refactor the is_innerrel_unqiue_for/query_is_distinct_for to use UniqueKey
4. if we are sure the groupby clause is unique, and we have aggregation call, maybe we 
should try Bapat's suggestion,  we can use sort rather than hash.  The strategy sounds
awesome,  but I didn't check the details so far. 
5. more clearer commit message. 
6. any more ? 

Any feedback is welcome, Thanks for you for your any ideas, suggestions, demo code!

Best Regards
Andy Fan


Attachment

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Mon, 16 Mar 2020 at 06:01, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Hi All:
>
> I have re-implemented the patch based on David's suggestion/code,  Looks it
> works well.   The updated patch mainly includes:
>
> 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from
>     catalog and the not null from vars.

What about non-nullability that we can derive from means other than
NOT NULL constraints. Where will you track that now that you've
removed the UniqueKeySet type?

Traditionally we use attno or attnum rather than colno for variable
names containing attribute numbers

> 3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths
>   since we know jointype at that time. so we can handle the semi/anti join specially.

ok, but the join type was known already where I was calling the
function from. It just wasn't passed to the function.

> 4. Add the rule I suggested above,  if both of the 2 relation yields the a unique result,
>   the join result will be unique as well. the UK can be  ( (rel1_uk1, rel1_uk2).. )

I see. So basically you're saying that the joinrel's uniquekeys should
be the cartesian product of the unique rels from either side of the
join.  I wonder if that's a special case we need to worry about too
much. Surely it only applies for clauseless joins.

> 5. If the unique key is impossible to be referenced by others,  we can safely ignore
>     it in order  to keep the (join)rel->unqiuekeys short.

You could probably have an equivalent of has_useful_pathkeys() and
pathkeys_useful_for_ordering()

> 6. I only consider the not null check/opfamily check for the uniquekey which comes
>    from UniqueIndex.  I think that should be correct.
> 7. I defined each uniquekey as List of Expr,  so I didn't introduce new node type.

Where will you store the collation Oid? I left comments to mention
that needed to be checked but just didn't wire it up.



Hi David:

Thanks for your time. 

On Wed, Mar 18, 2020 at 9:56 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 16 Mar 2020 at 06:01, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Hi All:
>
> I have re-implemented the patch based on David's suggestion/code,  Looks it
> works well.   The updated patch mainly includes:
>
> 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from
>     catalog and the not null from vars.

What about non-nullability that we can derive from means other than
NOT NULL constraints. Where will you track that now that you've
removed the UniqueKeySet type?

I tracked it in 'deconstruct_recurse',  just before the distribute_qual_to_rels call. 

+                       ListCell        *lc;
+                       foreach(lc, find_nonnullable_vars(qual))
+                       {
+                               Var *var = lfirst_node(Var, lc);
+                               RelOptInfo *rel = root->simple_rel_array[var->varno];
+                               if (var->varattno > InvalidAttrNumber)
+                                       rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
+                       }


Traditionally we use attno or attnum rather than colno for variable
names containing attribute numbers

Currently I use a list of Var for a UnqiueKey,   I guess it is ok? 
 

> 3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths
>   since we know jointype at that time. so we can handle the semi/anti join specially.

ok, but the join type was known already where I was calling the
function from. It just wasn't passed to the function.

> 4. Add the rule I suggested above,  if both of the 2 relation yields the a unique result,
>   the join result will be unique as well. the UK can be  ( (rel1_uk1, rel1_uk2).. )

I see. So basically you're saying that the joinrel's uniquekeys should
be the cartesian product of the unique rels from either side of the
join.  I wonder if that's a special case we need to worry about too
much. Surely it only applies for clauseless joins

Some other cases we may need this as well:).   like select m1.pk, m2.pk  from m1, m2 
where m1.b = m2.b; 

The cartesian product of the unique rels will make the unqiue keys too long,  so I maintain
the UnqiueKeyContext to make it short.  The idea is if  (UK1) is unique already,  no bother
to add another UK as (UK1, UK2) which is just a superset of it. 
 
 
> 5. If the unique key is impossible to be referenced by others,  we can safely ignore
>     it in order  to keep the (join)rel->unqiuekeys short.

You could probably have an equivalent of has_useful_pathkeys() and
pathkeys_useful_for_ordering()

 
Thanks for suggestion,  I will do so  in the v5-xx.patch.  
 
> 6. I only consider the not null check/opfamily check for the uniquekey which comes
>    from UniqueIndex.  I think that should be correct.
> 7. I defined each uniquekey as List of Expr,  so I didn't introduce new node type.

Where will you store the collation Oid? I left comments to mention
that needed to be checked but just didn't wire it up.

This is too embarrassed,  I am not sure if it is safe to ignore it.  I removed it due to
the following reasons (sorry for that I didn't explain it carefully for the last email). 

1.  When we choose if an UK is usable,  we have chance to compare the collation info
for  restrictinfo  (where uk = 1) or target list (select uk from t) with the indexinfo's collation,
the targetlist one has more trouble since we need to figure out the default collation  for it.
However relation_has_unique_index_for has the same situation as us,  it ignores it as well.
See comment /* XXX at some point we may need to check collations here too. */. It think
if there are some reasons we can ignore that.  

2. What we expect from UK is:
a).  Where m1.uniquekey = m2.b   m2.uk will not be duplicated by this joinclause.   Here
if m1.uk has different collation, it will raise runtime error. 
b).  Where m1.uniquekey collate 'xxxx' = m2.b.   We may can't depends on 
the run-time error this time. But if we are sure that  *if uk is uk at default collation is unique,  
then (uk collcate 'other-colation') is unique as well**.  if so we may safe ignore it as well.
c).  select uniquekey from t / select uniquekey collate 'xxxx'  from t.  This have the same
requirement as item b). 

3).  Looks maintain the collation for our case totally is a big effort,  and user rarely use it,  If
my expectation for 2.b is not true,  I prefer to detect such case (user use a different collation),
we can just ignore the UK for that. 

But After all, I think this should be an open question for now. 

---
At last,  I am so grateful for your suggestion/feedback,  that's really heuristic and constructive. 
And so thanks Tom's for the quick review and suggest to add a new fields for RelOptInfo, 
without it I don't think I can add a new field to a so important struct.  And also thanks Bapat who 
explains the thing more detailed.   I'm now writing the code for partition index stuff, which 
is a bit of boring, since every partition may have different unique index.  I am expecting that 
I can finish it in the following 2 days,  and hope you can have another round of review again.

Thanks for your feedback!

Best Regards
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From
David Rowley
Date:
On Wed, 18 Mar 2020 at 15:57, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> I'm now writing the code for partition index stuff, which
> is a bit of boring, since every partition may have different unique index.

Why is that case so different?

For a partitioned table to have a valid unique index, a unique index
must exist on each partition having columns that are a superset of the
partition key columns. An IndexOptInfo will exist on the partitioned
table's RelOptInfo, in this case.

At the leaf partition level, wouldn't you just add the uniquekeys the
same as we do for base rels? Maybe only do it if
enable_partitionwise_aggregation is on. Otherwise, I don't think we'll
currently have a need for them.  Currently, we don't do unique joins
for partition-wise joins. Perhaps uniquekeys will be a good way to fix
that omission in the future.



Hi David:

On Wed, Mar 18, 2020 at 12:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 18 Mar 2020 at 15:57, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> I'm now writing the code for partition index stuff, which
> is a bit of boring, since every partition may have different unique index.

Why is that case so different?

For a partitioned table to have a valid unique index, a unique index
must exist on each partition having columns that are a superset of the
partition key columns. An IndexOptInfo will exist on the partitioned
table's RelOptInfo, in this case.

The main difference are caused:

1.  we can create unique index on some of  partition only.

create table q100 (a int, b int, c int) partition by range (b);
create table q100_1 partition of q100 for values from (1) to (10);
create table q100_2 partition of q100 for values from (11) to (20);
create unique index q100_1_c on q100_1(c);   // user may create this index on q100_1 only 

2.  The unique index may not contains part key as above.

For the above case, even the same index on all the partition as well, we still can't
use it since the it unique on local partition only.   

3.  So the unique index on a partition table can be used only if it contains the partition key
AND exists on all the partitions.

4.  When we apply the uniquekey_is_useful_for_rel,  I compare the information between ind->indextlist
and rel->reltarget,  but the indextlist has a wrong varno,  we we have to change the varno with 
ChangeVarNodes for the indextlist from childrel since the varno is for childrel.   

5.  When we detect the  uk = 1 case,  the uk is also present with parentrel->relid information, which
we may requires the ChangeVarNodes on childrel->indexinfo->indextlist  as well. 

Even the rules looks long,   The run time should be very short since usually we don't have 
many unique index on partition table. 
 
At the leaf partition level, wouldn't you just add the uniquekeys the
same as we do for base rels?

Yes, But due to the uk of a childrel may be not useful for parent rel (the uk only exist
in one partiton),  so I think we can bypass if it is a child rel case? 
 

Best Regards
Andy Fan
I have started the new thread [1] to continue talking about this.
Mr. cfbot is happy now. 


Thanks