Re: A new strategy for pull-up correlated ANY_SUBLINK - Mailing list pgsql-hackers

From Tom Lane
Subject Re: A new strategy for pull-up correlated ANY_SUBLINK
Date
Msg-id 247851.1668293143@sss.pgh.pa.us
Whole thread Raw
In response to A new strategy for pull-up correlated ANY_SUBLINK  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: A new strategy for pull-up correlated ANY_SUBLINK
Re: A new strategy for pull-up correlated ANY_SUBLINK
Re: A new strategy for pull-up correlated ANY_SUBLINK
List pgsql-hackers
Andy Fan <zhihui.fan1213@gmail.com> writes:
> In the past we pull-up the ANY-sublink with 2 steps, the first step is to
> pull up the sublink as a subquery, and the next step is to pull up the
> subquery if it is allowed.  The benefits of this method are obvious,
> pulling up the subquery has more requirements, even if we can just finish
> the first step, we still get huge benefits. However the bad stuff happens
> if varlevelsup = 1 involves, things fail at step 1.

> convert_ANY_sublink_to_join ...

>     if (contain_vars_of_level((Node *) subselect, 1))
>         return NULL;

> In this patch we distinguish the above case and try to pull-up it within
> one step if it is helpful, It looks to me that what we need to do is just
> transform it to EXIST-SUBLINK.

This patch seems awfully messy to me.  The fact that you're having to
duplicate stuff done elsewhere suggests at the least that you've not
plugged the code into the best place.

Looking again at that contain_vars_of_level restriction, I think the
reason for it was just to avoid making a FROM subquery that has outer
references, and the reason we needed to avoid that was merely that we
didn't have LATERAL at the time.  So I experimented with the attached.
It seems to work, in that we don't get wrong answers from any of the
small number of places that are affected.  (I wonder though whether
those test cases still test what they were intended to, particularly
the postgres_fdw one.  We might have to try to hack them some more
to not get affected by this optimization.)  Could do with more test
cases, no doubt.

One thing I'm not at all clear about is whether we need to restrict
the optimization so that it doesn't occur if the subquery contains
outer references falling outside available_rels.  I think that that
case is covered by is_simple_subquery() deciding later to not pull up
the subquery based on LATERAL restrictions, but maybe that misses
something.

I'm also wondering whether the similar restriction in
convert_EXISTS_sublink_to_join could be removed similarly.
In this light it was a mistake for convert_EXISTS_sublink_to_join
to manage the pullup itself rather than doing it in the two-step
fashion that convert_ANY_sublink_to_join does it.

            regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 558e94b845..c07280d836 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -11377,19 +11377,19 @@ CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
-                                 QUERY PLAN
------------------------------------------------------------------------------
- Seq Scan on public.base_tbl
+                                QUERY PLAN
+---------------------------------------------------------------------------
+ Nested Loop Semi Join
    Output: base_tbl.a
-   Filter: (SubPlan 1)
-   SubPlan 1
-     ->  Result
-           Output: base_tbl.a
-           ->  Append
-                 ->  Async Foreign Scan on public.foreign_tbl foreign_tbl_1
-                       Remote SQL: SELECT NULL FROM public.base_tbl
-                 ->  Async Foreign Scan on public.foreign_tbl2 foreign_tbl_2
-                       Remote SQL: SELECT NULL FROM public.base_tbl
+   ->  Seq Scan on public.base_tbl
+         Output: base_tbl.a, base_tbl.b
+         Filter: (base_tbl.a IS NOT NULL)
+   ->  Materialize
+         ->  Append
+               ->  Async Foreign Scan on public.foreign_tbl foreign_tbl_1
+                     Remote SQL: SELECT NULL FROM public.base_tbl
+               ->  Async Foreign Scan on public.foreign_tbl2 foreign_tbl_2
+                     Remote SQL: SELECT NULL FROM public.base_tbl
 (11 rows)

 SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 92e3338584..3d4645a154 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1271,6 +1271,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     JoinExpr   *result;
     Query       *parse = root->parse;
     Query       *subselect = (Query *) sublink->subselect;
+    bool        has_level_1_vars;
     Relids        upper_varnos;
     int            rtindex;
     ParseNamespaceItem *nsitem;
@@ -1283,11 +1284,10 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     Assert(sublink->subLinkType == ANY_SUBLINK);

     /*
-     * The sub-select must not refer to any Vars of the parent query. (Vars of
-     * higher levels should be okay, though.)
+     * If the sub-select refers to any Vars of the parent query, we have to
+     * treat it as LATERAL.  (Vars of higher levels don't matter here.)
      */
-    if (contain_vars_of_level((Node *) subselect, 1))
-        return NULL;
+    has_level_1_vars = contain_vars_of_level((Node *) subselect, 1);

     /*
      * The test expression must contain some Vars of the parent query, else
@@ -1324,7 +1324,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     nsitem = addRangeTableEntryForSubquery(pstate,
                                            subselect,
                                            makeAlias("ANY_subquery", NIL),
-                                           false,
+                                           has_level_1_vars,
                                            false);
     rte = nsitem->p_rte;
     parse->rtable = lappend(parse->rtable, rte);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9358371072..cdaedb92b2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4657,17 +4657,17 @@ explain (costs off)
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
   where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
-                        QUERY PLAN
-----------------------------------------------------------
- Hash Join
-   Hash Cond: (b.unique2 = a.unique1)
-   ->  Seq Scan on onek b
-         Filter: (SubPlan 1)
-         SubPlan 1
-           ->  Seq Scan on int8_tbl c
-                 Filter: (q1 < b.unique1)
-   ->  Hash
-         ->  Index Only Scan using onek_unique1 on onek a
+                     QUERY PLAN
+----------------------------------------------------
+ Nested Loop
+   ->  Hash Semi Join
+         Hash Cond: (b.unique2 = c.q1)
+         Join Filter: (c.q1 < b.unique1)
+         ->  Seq Scan on onek b
+         ->  Hash
+               ->  Seq Scan on int8_tbl c
+   ->  Index Only Scan using onek_unique1 on onek a
+         Index Cond: (unique1 = b.unique2)
 (9 rows)

 select a.unique1, b.unique2
@@ -6074,8 +6074,8 @@ lateral (select * from int8_tbl t1,
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
-                           QUERY PLAN
------------------------------------------------------------------
+                             QUERY PLAN
+---------------------------------------------------------------------
  Nested Loop
    Output: "*VALUES*".column1, t1.q1, t1.q2, ss2.q1, ss2.q2
    ->  Seq Scan on public.int8_tbl t1
@@ -6087,23 +6087,24 @@ lateral (select * from int8_tbl t1,
          ->  Subquery Scan on ss2
                Output: ss2.q1, ss2.q2
                Filter: (t1.q1 = ss2.q2)
-               ->  Seq Scan on public.int8_tbl t2
+               ->  Result
                      Output: t2.q1, t2.q2
-                     Filter: (SubPlan 3)
-                     SubPlan 3
+                     One-Time Filter: $3
+                     InitPlan 2 (returns $3)
                        ->  Result
-                             Output: t3.q2
-                             One-Time Filter: $4
-                             InitPlan 1 (returns $2)
-                               ->  Result
-                                     Output: GREATEST($0, t2.q2)
-                             InitPlan 2 (returns $4)
-                               ->  Result
-                                     Output: ($3 = 0)
-                             ->  Seq Scan on public.int8_tbl t3
-                                   Output: t3.q1, t3.q2
-                                   Filter: (t3.q2 = $2)
-(27 rows)
+                             Output: ($2 = 0)
+                     ->  Nested Loop Semi Join
+                           Output: t2.q1, t2.q2
+                           Join Filter: (t2.q1 = t3.q2)
+                           ->  Seq Scan on public.int8_tbl t2
+                                 Output: t2.q1, t2.q2
+                                 Filter: ((SubPlan 1) = t2.q1)
+                                 SubPlan 1
+                                   ->  Result
+                                         Output: GREATEST($0, t2.q2)
+                           ->  Seq Scan on public.int8_tbl t3
+                                 Output: t3.q1, t3.q2
+(28 rows)

 select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,

pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: Use fadvise in wal replay
Next
From: vignesh C
Date:
Subject: Re: PGDOCS - Logical replication GUCs - added some xrefs