Thread: Use "unique keys" to enhance outer join removal

Use "unique keys" to enhance outer join removal

From
Antonin Houska
Date:
29 April 2024, 08:10:36
While the "unique keys" feature [1] is still under development, I'm thinking
how it could be used to enhance the removal of useless outer joins. Is
something really bad about the 0002 patch attached?

I recognize it may be weird that a join relation possibly produces non-join
paths (e.g. SeqScan), but right now don't have better idea where the new code
should appear. I considered planning the subqueries before the existing call
of remove_useless_joins(), to make the unique keys available earlier. However
it seems that more items can be added to 'baserestrictinfo' of the subquery
relation after that. Thus by planning the subquery too early we could miss
some opportunities to push clauses down to the subquery.

Please note that this patch depends on [1], which enhances
rel_is_distinct_for() and thus makes join_is_removable() a bit smareter. Also
note that 0001 is actually a minor fix to [1].

[1] https://www.postgresql.org/message-id/7971.1713526758%40antos
-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

From a0be4ee7698ff03d6c22398f20fd2c7efadbff45 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 29 Apr 2024 07:53:00 +0200
Subject: [PATCH 1/2] Undo comment changes.

These belong to the following patch.
---
 src/test/regress/expected/join.out | 11 +++++------
 src/test/regress/sql/join.sql      | 13 ++++++-------
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 571198a86a..fa407d37f5 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5648,12 +5648,11 @@ select d.* from d left join (select distinct * from b) s
  Seq Scan on d
 (1 row)
 
--- when the GROUP BY contains a column that is not in the join condition, join
--- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
--- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
--- resulting plan; but this happens too late for join removal in the outer
--- plan level.)
-explain (costs off, verbose on)
+-- join removal is not possible when the GROUP BY contains a column that is
+-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
+-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
+-- but this happens too late for join removal in the outer plan level.)
+explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
                    QUERY PLAN                   
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index bf8fb27072..c4c6c7b8ba 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2047,17 +2047,16 @@ explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id and d.b = s.c_id;
 
--- when the GROUP BY contains a column that is not in the join condition, join
--- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
--- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
--- resulting plan; but this happens too late for join removal in the outer
--- plan level.)
-explain (costs off, verbose on)
+-- join removal is not possible when the GROUP BY contains a column that is
+-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
+-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
+-- but this happens too late for join removal in the outer plan level.)
+explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
 -- similarly, but keying off a DISTINCT clause
-explain (costs off, verbose on)
+explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
-- 
2.44.0

From cb0166a3e1f1f5ff88634c38c2474de16825625a Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 29 Apr 2024 08:24:42 +0200
Subject: [PATCH 2/2] Use Unique Keys to remove an useless left join.

To consider an outer join useless, we need to prove that the inner relation is
unique for given join clause(s). We may be unable do that at early stage of
planning, especially if the inner relation is a subquery.

The new "unique keys" feature (still under development) can resolve this
problem by re-trying the removal at later stage, when the unique keys have
(possibly) been initialized.
---
 src/backend/optimizer/path/joinpath.c     | 35 ++++++++++++++++
 src/backend/optimizer/path/joinrels.c     | 10 +++++
 src/backend/optimizer/plan/analyzejoins.c |  4 +-
 src/include/optimizer/paths.h             |  5 +++
 src/include/optimizer/planmain.h          |  1 +
 src/test/regress/expected/join.out        | 50 +++++++----------------
 src/test/regress/sql/join.sql             | 13 +++---
 7 files changed, 73 insertions(+), 45 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 5be8da9e09..3dc9dc7856 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -344,6 +344,41 @@ add_paths_to_joinrel(PlannerInfo *root,
                                jointype, &extra);
 }
 
+/*
+ * add_outer_paths_to_joinrel
+ *
+ *      If the inner relation is not needed but remove_useless_joins() didn't
+ *      have enough information to remove join, try to check again now.
+ */
+void
+add_outer_paths_to_joinrel(PlannerInfo *root, RelOptInfo *outerrel,
+                           RelOptInfo *innerrel, RelOptInfo *joinrel,
+                           SpecialJoinInfo *sjinfo)
+{
+    ListCell    *lc;
+
+    Assert(innerrel->uniquekeys);
+
+    if (!join_is_removable(root, sjinfo))
+        return;
+
+    /* joinrel target should not reference innerrel. */
+    Assert(!bms_overlap(pull_varnos(root, (Node *) joinrel->reltarget->exprs),
+                        pull_varnos(root, (Node *) innerrel->reltarget->exprs)));
+
+    /*
+     * The innerrel cannot cause any row of outerrel to appear multiple times
+     * in the join output, so it's not needed. We ignore innerrel by adding
+     * the outerrel path to the join relation.
+     */
+    foreach(lc, outerrel->pathlist)
+    {
+        Path    *path    = (Path *) lfirst(lc);
+
+        add_path(joinrel, path);
+    }
+}
+
 /*
  * We override the param_source_rels heuristic to accept nestloop paths in
  * which the outer rel satisfies some but not all of the inner path's
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index b5c2b7d084..aa31e400e7 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -940,6 +940,16 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
             if (restriction_is_constant_false(restrictlist, joinrel, false) &&
                 bms_is_subset(rel2->relids, sjinfo->syn_righthand))
                 mark_dummy_rel(rel2);
+
+            /*
+             * If remove_useless_joins() could not remove the join because it
+             * could not find out whether the subquery on its inner side
+             * generates unique rows, try again. Unique keys might have been
+             * generated in between.
+             */
+            if (rel2->rtekind == RTE_SUBQUERY && rel2->uniquekeys)
+                add_outer_paths_to_joinrel(root, rel1, rel2, joinrel, sjinfo);
+
             add_paths_to_joinrel(root, joinrel, rel1, rel2,
                                  JOIN_LEFT, sjinfo,
                                  restrictlist);
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 3029d1a5b0..11e7d7b2ca 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -44,8 +44,6 @@ typedef struct
 bool        enable_self_join_removal;
 
 /* local functions */
-static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-
 static void remove_leftjoinrel_from_query(PlannerInfo *root, int relid,
                                           SpecialJoinInfo *sjinfo);
 static void remove_rel_from_restrictinfo(RestrictInfo *rinfo,
@@ -173,7 +171,7 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
  * have to check that the inner side doesn't generate any variables needed
  * above the join.
  */
-static bool
+bool
 join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
     int            innerrelid;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 4b99ce6557..a32752884b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -98,6 +98,11 @@ extern void add_paths_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
                                  RelOptInfo *outerrel, RelOptInfo *innerrel,
                                  JoinType jointype, SpecialJoinInfo *sjinfo,
                                  List *restrictlist);
+extern void add_outer_paths_to_joinrel(PlannerInfo *root,
+                                       RelOptInfo *outerrel,
+                                       RelOptInfo *innerrel,
+                                       RelOptInfo *joinrel,
+                                       SpecialJoinInfo *sjinfo);
 
 /*
  * joinrels.c
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index f2e3fa4c2e..cf5bee5997 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -103,6 +103,7 @@ extern void match_foreign_keys_to_quals(PlannerInfo *root);
  * prototypes for plan/analyzejoins.c
  */
 extern List *remove_useless_joins(PlannerInfo *root, List *joinlist);
+extern bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 extern void reduce_unique_semijoins(PlannerInfo *root);
 extern bool query_supports_distinctness(Query *query);
 extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index fa407d37f5..08218dc77b 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5648,51 +5648,29 @@ select d.* from d left join (select distinct * from b) s
  Seq Scan on d
 (1 row)
 
--- join removal is not possible when the GROUP BY contains a column that is
--- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
--- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
--- but this happens too late for join removal in the outer plan level.)
-explain (costs off)
+-- when the GROUP BY contains a column that is not in the join condition, join
+-- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
+-- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
+-- resulting plan; but this happens too late for join removal in the outer
+-- plan level.)
+explain (costs off, verbose on)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
-                   QUERY PLAN                   
-------------------------------------------------
- Hash Left Join
+      QUERY PLAN       
+-----------------------
+ Seq Scan on pg_temp.d
    Output: d.a, d.b
-   Inner Unique: true
-   Hash Cond: (d.a = s.id)
-   ->  Seq Scan on pg_temp.d
-         Output: d.a, d.b
-   ->  Hash
-         Output: s.id
-         ->  Subquery Scan on s
-               Output: s.id
-               ->  HashAggregate
-                     Output: b.id, b.c_id
-                     Group Key: b.id
-                     ->  Seq Scan on pg_temp.b
-                           Output: b.id, b.c_id
-(15 rows)
+(2 rows)
 
 -- similarly, but keying off a DISTINCT clause
 explain (costs off, verbose on)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
-                QUERY PLAN                
-------------------------------------------
- Hash Left Join
+      QUERY PLAN       
+-----------------------
+ Seq Scan on pg_temp.d
    Output: d.a, d.b
-   Inner Unique: true
-   Hash Cond: (d.a = s.id)
-   ->  Seq Scan on pg_temp.d
-         Output: d.a, d.b
-   ->  Hash
-         Output: s.id
-         ->  Subquery Scan on s
-               Output: s.id
-               ->  Seq Scan on pg_temp.b
-                     Output: b.id, b.c_id
-(12 rows)
+(2 rows)
 
 -- join removal is not possible here
 explain (costs off)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index c4c6c7b8ba..bf8fb27072 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2047,16 +2047,17 @@ explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id and d.b = s.c_id;
 
--- join removal is not possible when the GROUP BY contains a column that is
--- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
--- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
--- but this happens too late for join removal in the outer plan level.)
-explain (costs off)
+-- when the GROUP BY contains a column that is not in the join condition, join
+-- removal takes place too, due to "unique keys" (Note: as of 9.6, we notice
+-- that b.id is a primary key and so drop b.c_id from the GROUP BY of the
+-- resulting plan; but this happens too late for join removal in the outer
+-- plan level.)
+explain (costs off, verbose on)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
 -- similarly, but keying off a DISTINCT clause
-explain (costs off)
+explain (costs off, verbose on)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
-- 
2.44.0