Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From Andy Fan
Subject Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date
Msg-id CAKU4AWpOgpXqw__4mcXgtZm1rQoVEr3G0na2Oh_UCNcHOxdxqA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi David:

On Thu, Mar 12, 2020 at 3:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns.  I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t1.a
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Seq Scan on t1
(7 rows)


postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.b = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t2.a
   ->  Hash Join
         Hash Cond: (t1.b = t2.a)
         ->  Seq Scan on t1
         ->  Hash
               ->  Seq Scan on t2
(7 rows)


postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it.  I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

Thanks for the code!   Here is some points from me. 

1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as well. 
(relation_has_unqiue_for has a similar logic) currently the following distinct path is still
 there. 

postgres=# explain select distinct b from t100 where pk = 1;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Unique  (cost=8.18..8.19 rows=1 width=4)
   ->  Sort  (cost=8.18..8.19 rows=1 width=4)
         Sort Key: b
         ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 width=4)
               Index Cond: (pk = 1)
(5 rows)

I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If so we
can get more opportunities to reach our goal.  

2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more UniquePath as 
(rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply. 
or else the following cases can't be handled.

postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
   ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
         Sort Key: t100.pk, t101.pk
         ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
               ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
               ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
                     ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)

But if we add such rule,  the unique paths probably become much longer,  so we need
a strategy to tell if the UniquePath is useful for our query,  if not,  we can ignore that.
rel->reltarget maybe a good info for such optimization.   I think we can take this into 
consideration for pupulate_baserel_uniquekeys as well. 

For the non_null info,  Tom suggested to add maintain such info RelOptInfo, 
I have done that for the not_null_info for basic relation catalog,  I think we can
maintain the same flag for joinrel and the not null info from find_nonnullable_vars as 
well, but I still didn't find a good place to add that so far. 


A small question about the following code:

+       if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
+       {
+
+               add_path(distinct_rel, (Path *)cheapest_input_path);
+
+               /* XXX yeah yeah, need to call the hooks etc. */
+
+               /* Now choose the best path(s) */
+               set_cheapest(distinct_rel);
+
+               return distinct_rel;
+       }

Since we don't create new RelOptInfo/Path,  do we need to call add_path and set_cheapest?


Best Regards
Andy Fan 

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: [PATCH] Skip llvm bytecode generation if LLVM is missing
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager