Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From David Rowley
Subject Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date
Msg-id CAApHDvrBXjAvH45dEAZFOk-hzOt1mJC7-fxZ2v49mc5njtA7VQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Erase the distinctClause if the result is unique by definition  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: [PATCH] Erase the distinctClause if the result is unique by definition  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers
On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns.  I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t1.a
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on t2
         ->  Hash
               ->  Seq Scan on t1
(7 rows)


postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t1.b = t2.a)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
(5 rows)


postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
            QUERY PLAN
----------------------------------
 HashAggregate
   Group Key: t2.a
   ->  Hash Join
         Hash Cond: (t1.b = t2.a)
         ->  Seq Scan on t1
         ->  Hash
               ->  Seq Scan on t2
(7 rows)


postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
         QUERY PLAN
----------------------------
 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it.  I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add an optional timeout clause to isolationtester step.
Next
From: Pengzhou Tang
Date:
Subject: Additional size of hash table is alway zero for hash aggregates