Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From Andy Fan
Subject Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date
Msg-id CAKU4AWpe=mNMy_xs0JMQ=3_SJqAku4RiSZiGEuUwsThHPmUqKw@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Erase the distinctClause if the result is unique by definition  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: [PATCH] Erase the distinctClause if the result is unique by definition
List pgsql-hackers
Hi Ashutosh:
   Thanks for your time. 

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi Andy,
What might help is to add more description to your email message like giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN          
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct?

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now. 

[PATCH] Erase the distinctClause if the result is unique by
 definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted" 
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the patch
to continue discussing of this issue. 
 
PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged.
 
I checked the comments of check_functional_grouping,  the reason is 

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able 
to get the not null information from relation->rd_attr->attrs[n].attnotnull which 
is just what this patch did.   

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well

This is a good point.   The rules may have some different for join,  so I prefer 
to to focus on the current one so far. 
 
4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys.
 
Do you mean adding some information into PlannerInfo,  and when we create 
a node for Unique/HashAggregate/Group,  we can just create a dummy node? 
 
5. Have you tested this OUTER joins, which can render inner side nullable?

Yes, that part was missed in the test case.  I just added them.   

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
update the patch with considering the semi/anti join. 

Can anyone help to review this patch?  

Thanks


On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

I wrote a patch to erase the distinctClause if the result is unique by 
definition,  I find this because a user switch this code from oracle 
to PG and find the performance is bad due to this,  so I adapt pg for
this as well. 

This patch doesn't work for a well-written SQL,  but some drawback 
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add.. 

Please see the patch for details.   

Thank you. 


--
--
Best Wishes,
Ashutosh Bapat
Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Internal key management system
Next
From: legrand legrand
Date:
Subject: Re: POC: rational number type (fractions)