[PATCH] query rewrite for distinct stage under some cases - Mailing list pgsql-hackers

From Andy Fan
Subject [PATCH] query rewrite for distinct stage under some cases
Date
Msg-id CAKU4AWrX6t8MEG34wvRvHZwriyaHgGQ-x6ZycKMhSHYTkUuKuw@mail.gmail.com
Whole thread Raw
List pgsql-hackers

Hi Hackers:

   

This is a patch for unique elimination rewrite for distinct query. 

it will cost much for a big result set and some times it is not 

necessary. The basic idea is the unique node like in the following

can be eliminated.


1. select distinct pk, ... from t;   

2. select distinct uk-col1, uk-col2, ... 

    from t where uk-col1 is not null and uk-col2 is not null;

3. select distinct a, b ....   from t group by a, b; 

4. select distinct t1.pkt2.pk, ... from t1, t2.    


The distinct keyword in above sql is obviously redundant,  

But a SQL may join a lot of tables with tens of columns in target 

list and a number of indexes. Finally the sql is hidden in 

hundreds of SQL in system, it will be hard to find it out.

That's why I want the kernel can keep watching it, 

based on that it will not cost too much.  Oracle has similar rewrite 

as well.


The rule for single relation is:

a). The primary key is choose in target list.

b). The unique key is choose in the target list,  

     and we can tell the result of the related column is not nullable.  

     we can tell it by catalog and qual.

c). The group-by columns is choose in target list.

d). The target list in subquery has a distinct already.  

     (select distinct xxx from (select distinct xxx from t2));  


The rule for multi-relations join is:

e). if any relation yield a unique result,  then the result of join will be 

     unique as well  


If an sql matches any rule of above, we can remove the unique node. 

Rule d) is not so common and complex to implement,  so it is not 

included in this patch. 


Implementation:

f). I choose the target list per table,  if there is hasDistinctOn,  the source

    is the target list intersect distinctClause. or else,  the source is target list only. 

g). the pk/uk columns information is gathered by RelationGetIndexAttrBitmap.  

     a new filed RelationData.plain_uk_ukattrs is added and gathered as well.  

h). As last if any rule matches,   Query->distinctClause & Query->hasDistinctOn 

     will be cleared to avoid generating the related path. 


There are also some fast paths to return earlier:

i). If a table in join-list, but no columns is choose in target list.

j). The join-list contains sub-query.  (this rewrite happens after sub-query pull-up)

k). Based on the cost of the checking, we check group by first and 

     then PK and then UK + not null. 


There is no impact for non-distinct query, as for distinct query, this rule will 

increase the total cost a bit if the distinct can't be removed.  The unique 

check is most expensive, so here is the data to show the impact, a 4 

columns table, no pk,  1 uk with 2 columns.


With this feature disable:   avg plan time:  0.095ms

With this feature enabled:  avg plan time:   0.102ms  


Basically I think the cost would be ok. 


Concurrency:

l).  When we see a pk or uk index, so we remove the index on another session,

I think this would be ok because of MVCC rules.

m). When we are creating an index in another session but it is not completed,  

 suppose we can't get it with RelationGetIndexAttrBitmap.  so it should be ok 

as well. 

 

The behavior can be changed online with enable_unique_elimination, 

it is true by default.


The patch is generated with the latest code on github, 

and the current HEAD is 34a0a81bfb388504deaa51b16a8bb531b827e519.


The make installcheck-world & check-world has pass.  

Test case join.sql and sysview.sql are impacted by this change 

and they are expected,  the changed expected.out file is included in this patch.


Please let me know if you have any questions. 


Thank you 

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Block level parallel vacuum