Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From Andy Fan
Subject Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date
Msg-id CAKU4AWp8E6En7-Ox0op4v3-yViGhKA4ic7KCOdAbaVcC0s52DQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: [PATCH] Erase the distinctClause if the result is unique by definition  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi David:
 
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.

I have some concerns about this method,  maybe I misunderstand 
something, if so, please advise.  

In my current implementation, it calculates the uniqueness for each
BaseRel only, but in your way,  looks we need to calculate the
UniquePathKey for both BaseRel and JoinRel.   This makes more 
difference for multi table join.    Another concern is UniquePathKey
is designed for a general purpose,  we need to maintain it no matter
distinctClause/groupbyClause.  
 
 For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index). 

Thanks for raising this.  My current rule requires *every* relation yields a 
unique result and *no matter* with the join method.  Actually I want to make
the rule less strict, for example, we  may just need 1 table yields unique result
and the final result will be unique as well under some join type. 

As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique;  looks we can't 
remove the distinct based on this. 

create table m1(a int primary key, b int);
create table m2(a int primary key, b int);
insert into m1 values(1, 1), (2, 1);
insert into m2 values(1, 1), (2, 1);
select distinct m1.a from m1, m2 where m1.a = m2.b;

 
SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
 
Actually I want to keep the distinct for this case now.  One reason is there are only 1 
row returned, so the distinct erasing can't help much.   The more important reason is
Query->hasAggs is true for "select distinct  (select count(*) filter (where t2.c2 = 6 
and t2.c1 < 10) from ft1 t1 where t1.c1 = 6)  from ft2 t2 where t2.c2 % 6 = 0 order by 1;"
(this sql came from postgres_fdw.sql).  

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos

Thanks for this info.  I guess this patch is not merged so far, but looks the cfbot
for my patch [1]  failed due to this :(   search 
"explain (costs off) select distinct on(pk1) pk1, pk2 from select_distinct_a;"
 
I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

 I can do this after we have agreement on the UniquePath.  

For my cbbot failure, another strange thing is "A" appear ahead of "a" after
the order by..  Still didn't find out why.


Regards 
Andy Fan

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: reindex concurrently and two toast indexes
Next
From: Peter Eisentraut
Date:
Subject: Re: adding partitioned tables to publications