Thread: A new strategy for pull-up correlated ANY_SUBLINK

A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

02 November 2022, 03:02:58

In the past we pull-up the ANY-sublink with 2 steps, the first step is to
pull up the sublink as a subquery, and the next step is to pull up the
subquery if it is allowed. The benefits of this method are obvious,
pulling up the subquery has more requirements, even if we can just finish
the first step, we still get huge benefits. However the bad stuff happens
if varlevelsup = 1 involves, things fail at step 1.

convert_ANY_sublink_to_join ...

if (contain_vars_of_level((Node *) subselect, 1))
return NULL;

In this patch we distinguish the above case and try to pull-up it within
one step if it is helpful, It looks to me that what we need to do is just
transform it to EXIST-SUBLINK.

The only change is transforming the format of SUBLINK, so outer-join /
pull-up as semi-join is unrelated, so the correctness should not be an
issue.

I can help with the following query very much.

master:
explain (costs off, analyze) select * from tenk1 t1
where hundred in (select hundred from tenk2 t2
where t2.odd = t1.odd
and even in (select even from tenk1 t3
where t3.fivethous = t2.fivethous))
and even > 0;
QUERY PLAN
------------------------------------------------------------------------------------
Seq Scan on tenk1 t1 (actual time=0.023..234.955 rows=10000 loops=1)
Filter: ((even > 0) AND (SubPlan 2))
SubPlan 2
-> Seq Scan on tenk2 t2 (actual time=0.023..0.023 rows=1 loops=10000)
Filter: ((odd = t1.odd) AND (SubPlan 1))
Rows Removed by Filter: 94
SubPlan 1
-> Seq Scan on tenk1 t3 (actual time=0.011..0.011 rows=1 loops=10000)
Filter: (fivethous = t2.fivethous)
Rows Removed by Filter: 94
Planning Time: 0.169 ms
Execution Time: 235.488 ms
(12 rows)

patched:

explain (costs off, analyze) select * from tenk1 t1
where hundred in (select hundred from tenk2 t2
where t2.odd = t1.odd
and even in (select even from tenk1 t3
where t3.fivethous = t2.fivethous))
and even > 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Hash Join (actual time=13.102..17.676 rows=10000 loops=1)
Hash Cond: ((t1.odd = t2.odd) AND (t1.hundred = t2.hundred))
-> Seq Scan on tenk1 t1 (actual time=0.014..1.702 rows=10000 loops=1)
Filter: (even > 0)
-> Hash (actual time=13.080..13.082 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> HashAggregate (actual time=13.041..13.060 rows=100 loops=1)
Group Key: t2.odd, t2.hundred
Batches: 1 Memory Usage: 73kB
-> Hash Join (actual time=8.044..11.296 rows=10000 loops=1)
Hash Cond: ((t3.fivethous = t2.fivethous) AND (t3.even = t2.even))
-> HashAggregate (actual time=4.054..4.804 rows=5000 loops=1)
Group Key: t3.fivethous, t3.even
Batches: 1 Memory Usage: 465kB
-> Seq Scan on tenk1 t3 (actual time=0.002..0.862 rows=10000 loops=1)
-> Hash (actual time=3.962..3.962 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 597kB
-> Seq Scan on tenk2 t2 (actual time=0.004..2.289 rows=10000 loops=1)
Planning Time: 0.426 ms
Execution Time: 18.129 ms
(20 rows)

The execution time is 33ms (patched) VS 235ms (master).
The planning time is 0.426ms (patched) VS 0.169ms (master).

I think the extra planning time comes from the search space increasing a
lot and that's where the better plan comes.

I used below queries to measure how much effort we made but got nothing:
run twice in 1 session and just count the second planning time.

explain (costs off, analyze) select * from tenk1 t1
where
(hundred, odd) in (select hundred, odd from tenk2 t2
where (even, fivethous) in
(select even, fivethous from tenk1 t3));

psql regression -f 1.sql | grep 'Planning Time' | tail -1

master:

Planning Time: 0.430 ms
Planning Time: 0.551 ms
Planning Time: 0.316 ms
Planning Time: 0.342 ms
Planning Time: 0.390 ms

patched:

Planning Time: 0.405 ms
Planning Time: 0.406 ms
Planning Time: 0.433 ms
Planning Time: 0.371 ms
Planning Time: 0.425 ms

I think this can show us the extra planning effort is pretty low.

This topic has been raised many times, at least at [1] [2]. and even MySQL
can support some simple but common cases. I think we can do something
helpful as well. Any feedback is welcome.

[1] https://www.postgresql.org/message-id/3691.1342650974%40sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/CAN_9JTx7N+CxEQLnu_uHxx+EscSgxLLuNgaZT6Sjvdpt7toy3w@mail.gmail.com

--
Best Regards
Andy Fan

Attachment

v1-0001-Pull-up-the-direct-correlated-ANY_SUBLINK-like-EX.patch

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andrey Lepikhov

Date:

02 November 2022, 03:42:28

On 2/11/2022 09:02, Andy Fan wrote:
> In the past we pull-up the ANY-sublink with 2 steps, the first step is to
> pull up the sublink as a subquery, and the next step is to pull up the
> subquery if it is allowed.  The benefits of this method are obvious,
> pulling up the subquery has more requirements, even if we can just finish
> the first step, we still get huge benefits. However the bad stuff happens
> if varlevelsup = 1 involves, things fail at step 1.
> 
> convert_ANY_sublink_to_join ...
> 
>      if (contain_vars_of_level((Node *) subselect, 1))
>          return NULL;
> 
> In this patch we distinguish the above case and try to pull-up it within
> one step if it is helpful, It looks to me that what we need to do is just
> transform it to EXIST-SUBLINK.
Maybe code [1] would be useful for your purposes/tests.
We implemented flattening of correlated subqueries for simple N-J case, 
but found out that in some cases the flattening isn't obvious the best 
solution - we haven't info about cardinality/cost estimations and can do 
worse.
I guess, for more complex flattening procedure (with aggregate function 
in a targetlist of correlated subquery) situation can be even worse.
Maybe your idea has such corner cases too ?

[1] 
https://www.postgresql.org/message-id/flat/CALNJ-vTa5VgvV1NPRHnypdnbx-fhDu7vWp73EkMUbZRpNHTYQQ%40mail.gmail.com

-- 
regards,
Andrey Lepikhov
Postgres Professional

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

02 November 2022, 05:34:54

Hi Andrey:

> In this patch we distinguish the above case and try to pull-up it within
> one step if it is helpful, It looks to me that what we need to do is just
> transform it to EXIST-SUBLINK.
Maybe code [1] would be useful for your purposes/tests.

Looks like we are resolving the same problem, IIUC, great that more

people are interested in it!

We implemented flattening of correlated subqueries for simple N-J case,

I went through the code, and it looks like you tried to do the pull-up by

yourself, which would have many troubles to think about. but I just transformed

it into EXIST sublink after I distinguish it as the case I can improve.

> The only change is transforming the format of SUBLINK, so outer-join /

> pull-up as semi-join is unrelated, so the correctness should not be an
> issue.

That is just a difference, no matter which one is better.

but found out that in some cases the flattening isn't obvious the best
solution - we haven't info about cardinality/cost estimations and can do
worse.

I guess, for more complex flattening procedure (with aggregate function
in a targetlist of correlated subquery) situation can be even worse.
Maybe your idea has such corner cases too ?

In my case, since aggregate function can't be handled by

covert_EXISTS_sublink_to_join, so it is not the target I want to optimize in

this patch. More testing/review on my method would be pretty appreciated.

but I'm not insisting on my method at all. Link [2] might be useful as well.

[2] https://www.postgresql.org/message-id/CAKU4AWpi9oztiomUQt4JCxXEr6EaQ2thY-7JYDm6c9he0A7oCA%40mail.gmail.com

Best Regards

Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Tom Lane

Date:

12 November 2022, 22:45:43

Andy Fan <zhihui.fan1213@gmail.com> writes:
> In the past we pull-up the ANY-sublink with 2 steps, the first step is to
> pull up the sublink as a subquery, and the next step is to pull up the
> subquery if it is allowed.  The benefits of this method are obvious,
> pulling up the subquery has more requirements, even if we can just finish
> the first step, we still get huge benefits. However the bad stuff happens
> if varlevelsup = 1 involves, things fail at step 1.

> convert_ANY_sublink_to_join ...

>     if (contain_vars_of_level((Node *) subselect, 1))
>         return NULL;

> In this patch we distinguish the above case and try to pull-up it within
> one step if it is helpful, It looks to me that what we need to do is just
> transform it to EXIST-SUBLINK.

This patch seems awfully messy to me.  The fact that you're having to
duplicate stuff done elsewhere suggests at the least that you've not
plugged the code into the best place.

Looking again at that contain_vars_of_level restriction, I think the
reason for it was just to avoid making a FROM subquery that has outer
references, and the reason we needed to avoid that was merely that we
didn't have LATERAL at the time.  So I experimented with the attached.
It seems to work, in that we don't get wrong answers from any of the
small number of places that are affected.  (I wonder though whether
those test cases still test what they were intended to, particularly
the postgres_fdw one.  We might have to try to hack them some more
to not get affected by this optimization.)  Could do with more test
cases, no doubt.

One thing I'm not at all clear about is whether we need to restrict
the optimization so that it doesn't occur if the subquery contains
outer references falling outside available_rels.  I think that that
case is covered by is_simple_subquery() deciding later to not pull up
the subquery based on LATERAL restrictions, but maybe that misses
something.

I'm also wondering whether the similar restriction in
convert_EXISTS_sublink_to_join could be removed similarly.
In this light it was a mistake for convert_EXISTS_sublink_to_join
to manage the pullup itself rather than doing it in the two-step
fashion that convert_ANY_sublink_to_join does it.

            regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 558e94b845..c07280d836 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -11377,19 +11377,19 @@ CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
-                                 QUERY PLAN
------------------------------------------------------------------------------
- Seq Scan on public.base_tbl
+                                QUERY PLAN
+---------------------------------------------------------------------------
+ Nested Loop Semi Join
    Output: base_tbl.a
-   Filter: (SubPlan 1)
-   SubPlan 1
-     ->  Result
-           Output: base_tbl.a
-           ->  Append
-                 ->  Async Foreign Scan on public.foreign_tbl foreign_tbl_1
-                       Remote SQL: SELECT NULL FROM public.base_tbl
-                 ->  Async Foreign Scan on public.foreign_tbl2 foreign_tbl_2
-                       Remote SQL: SELECT NULL FROM public.base_tbl
+   ->  Seq Scan on public.base_tbl
+         Output: base_tbl.a, base_tbl.b
+         Filter: (base_tbl.a IS NOT NULL)
+   ->  Materialize
+         ->  Append
+               ->  Async Foreign Scan on public.foreign_tbl foreign_tbl_1
+                     Remote SQL: SELECT NULL FROM public.base_tbl
+               ->  Async Foreign Scan on public.foreign_tbl2 foreign_tbl_2
+                     Remote SQL: SELECT NULL FROM public.base_tbl
 (11 rows)

 SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 92e3338584..3d4645a154 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1271,6 +1271,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     JoinExpr   *result;
     Query       *parse = root->parse;
     Query       *subselect = (Query *) sublink->subselect;
+    bool        has_level_1_vars;
     Relids        upper_varnos;
     int            rtindex;
     ParseNamespaceItem *nsitem;
@@ -1283,11 +1284,10 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     Assert(sublink->subLinkType == ANY_SUBLINK);

     /*
-     * The sub-select must not refer to any Vars of the parent query. (Vars of
-     * higher levels should be okay, though.)
+     * If the sub-select refers to any Vars of the parent query, we have to
+     * treat it as LATERAL.  (Vars of higher levels don't matter here.)
      */
-    if (contain_vars_of_level((Node *) subselect, 1))
-        return NULL;
+    has_level_1_vars = contain_vars_of_level((Node *) subselect, 1);

     /*
      * The test expression must contain some Vars of the parent query, else
@@ -1324,7 +1324,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
     nsitem = addRangeTableEntryForSubquery(pstate,
                                            subselect,
                                            makeAlias("ANY_subquery", NIL),
-                                           false,
+                                           has_level_1_vars,
                                            false);
     rte = nsitem->p_rte;
     parse->rtable = lappend(parse->rtable, rte);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9358371072..cdaedb92b2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4657,17 +4657,17 @@ explain (costs off)
 select a.unique1, b.unique2
   from onek a left join onek b on a.unique1 = b.unique2
   where b.unique2 = any (select q1 from int8_tbl c where c.q1 < b.unique1);
-                        QUERY PLAN
-----------------------------------------------------------
- Hash Join
-   Hash Cond: (b.unique2 = a.unique1)
-   ->  Seq Scan on onek b
-         Filter: (SubPlan 1)
-         SubPlan 1
-           ->  Seq Scan on int8_tbl c
-                 Filter: (q1 < b.unique1)
-   ->  Hash
-         ->  Index Only Scan using onek_unique1 on onek a
+                     QUERY PLAN
+----------------------------------------------------
+ Nested Loop
+   ->  Hash Semi Join
+         Hash Cond: (b.unique2 = c.q1)
+         Join Filter: (c.q1 < b.unique1)
+         ->  Seq Scan on onek b
+         ->  Hash
+               ->  Seq Scan on int8_tbl c
+   ->  Index Only Scan using onek_unique1 on onek a
+         Index Cond: (unique1 = b.unique2)
 (9 rows)

 select a.unique1, b.unique2
@@ -6074,8 +6074,8 @@ lateral (select * from int8_tbl t1,
                                      where q2 = (select greatest(t1.q1,t2.q2))
                                        and (select v.id=0)) offset 0) ss2) ss
          where t1.q1 = ss.q2) ss0;
-                           QUERY PLAN
------------------------------------------------------------------
+                             QUERY PLAN
+---------------------------------------------------------------------
  Nested Loop
    Output: "*VALUES*".column1, t1.q1, t1.q2, ss2.q1, ss2.q2
    ->  Seq Scan on public.int8_tbl t1
@@ -6087,23 +6087,24 @@ lateral (select * from int8_tbl t1,
          ->  Subquery Scan on ss2
                Output: ss2.q1, ss2.q2
                Filter: (t1.q1 = ss2.q2)
-               ->  Seq Scan on public.int8_tbl t2
+               ->  Result
                      Output: t2.q1, t2.q2
-                     Filter: (SubPlan 3)
-                     SubPlan 3
+                     One-Time Filter: $3
+                     InitPlan 2 (returns $3)
                        ->  Result
-                             Output: t3.q2
-                             One-Time Filter: $4
-                             InitPlan 1 (returns $2)
-                               ->  Result
-                                     Output: GREATEST($0, t2.q2)
-                             InitPlan 2 (returns $4)
-                               ->  Result
-                                     Output: ($3 = 0)
-                             ->  Seq Scan on public.int8_tbl t3
-                                   Output: t3.q1, t3.q2
-                                   Filter: (t3.q2 = $2)
-(27 rows)
+                             Output: ($2 = 0)
+                     ->  Nested Loop Semi Join
+                           Output: t2.q1, t2.q2
+                           Join Filter: (t2.q1 = t3.q2)
+                           ->  Seq Scan on public.int8_tbl t2
+                                 Output: t2.q1, t2.q2
+                                 Filter: ((SubPlan 1) = t2.q1)
+                                 SubPlan 1
+                                   ->  Result
+                                         Output: GREATEST($0, t2.q2)
+                           ->  Seq Scan on public.int8_tbl t3
+                                 Output: t3.q1, t3.q2
+(28 rows)

 select * from (values (0), (1)) v(id),
 lateral (select * from int8_tbl t1,

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Richard Guo

Date:

15 November 2022, 01:02:13

On Sun, Nov 13, 2022 at 6:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looking again at that contain_vars_of_level restriction, I think the
reason for it was just to avoid making a FROM subquery that has outer
references, and the reason we needed to avoid that was merely that we
didn't have LATERAL at the time. So I experimented with the attached.
It seems to work, in that we don't get wrong answers from any of the
small number of places that are affected. (I wonder though whether
those test cases still test what they were intended to, particularly
the postgres_fdw one. We might have to try to hack them some more
to not get affected by this optimization.) Could do with more test
cases, no doubt.

Hmm, it seems there were discussions about this change before, such as
in [1].

One thing I'm not at all clear about is whether we need to restrict
the optimization so that it doesn't occur if the subquery contains
outer references falling outside available_rels. I think that that
case is covered by is_simple_subquery() deciding later to not pull up
the subquery based on LATERAL restrictions, but maybe that misses
something.

I think we need to do this, otherwise we'd encounter the problem
described in [2]. In short, the problem is that the constraints imposed
by LATERAL references may make us fail to find any legal join order. As
an example, consider

explain select * from A where exists
(select * from B where A.i in (select C.i from C where C.j = B.j));
ERROR: failed to build any 3-way joins

[1] https://www.postgresql.org/message-id/flat/CAN_9JTx7N%2BCxEQLnu_uHxx%2BEscSgxLLuNgaZT6Sjvdpt7toy3w%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAMbWs49cvkF9akbomz_fCCKS=D5TY=4KGHEQcfHPZCXS1GVhkA@mail.gmail.com

Thanks
Richard

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

vignesh C

Date:

06 January 2023, 06:16:23

On Sun, 13 Nov 2022 at 04:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andy Fan <zhihui.fan1213@gmail.com> writes:
> > In the past we pull-up the ANY-sublink with 2 steps, the first step is to
> > pull up the sublink as a subquery, and the next step is to pull up the
> > subquery if it is allowed.  The benefits of this method are obvious,
> > pulling up the subquery has more requirements, even if we can just finish
> > the first step, we still get huge benefits. However the bad stuff happens
> > if varlevelsup = 1 involves, things fail at step 1.
>
> > convert_ANY_sublink_to_join ...
>
> >     if (contain_vars_of_level((Node *) subselect, 1))
> >         return NULL;
>
> > In this patch we distinguish the above case and try to pull-up it within
> > one step if it is helpful, It looks to me that what we need to do is just
> > transform it to EXIST-SUBLINK.
>
> This patch seems awfully messy to me.  The fact that you're having to
> duplicate stuff done elsewhere suggests at the least that you've not
> plugged the code into the best place.
>
> Looking again at that contain_vars_of_level restriction, I think the
> reason for it was just to avoid making a FROM subquery that has outer
> references, and the reason we needed to avoid that was merely that we
> didn't have LATERAL at the time.  So I experimented with the attached.
> It seems to work, in that we don't get wrong answers from any of the
> small number of places that are affected.  (I wonder though whether
> those test cases still test what they were intended to, particularly
> the postgres_fdw one.  We might have to try to hack them some more
> to not get affected by this optimization.)  Could do with more test
> cases, no doubt.
>
> One thing I'm not at all clear about is whether we need to restrict
> the optimization so that it doesn't occur if the subquery contains
> outer references falling outside available_rels.  I think that that
> case is covered by is_simple_subquery() deciding later to not pull up
> the subquery based on LATERAL restrictions, but maybe that misses
> something.
>
> I'm also wondering whether the similar restriction in
> convert_EXISTS_sublink_to_join could be removed similarly.
> In this light it was a mistake for convert_EXISTS_sublink_to_join
> to manage the pullup itself rather than doing it in the two-step
> fashion that convert_ANY_sublink_to_join does it.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
b82557ecc2ebbf649142740a1c5ce8d19089f620 ===
=== applying patch ./v2-0001-use-LATERAL-for-ANY_SUBLINK.patch
patching file contrib/postgres_fdw/expected/postgres_fdw.out
...
Hunk #2 FAILED at 6074.
Hunk #3 FAILED at 6087.
2 out of 3 hunks FAILED -- saving rejects to file
src/test/regress/expected/join.out.rej

[1] - http://cfbot.cputube.org/patch_41_3941.log

Regards,
Vignesh

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

vignesh C

Date:

31 January 2023, 17:47:31

On Fri, 6 Jan 2023 at 11:46, vignesh C <vignesh21@gmail.com> wrote:
>
> On Sun, 13 Nov 2022 at 04:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Andy Fan <zhihui.fan1213@gmail.com> writes:
> > > In the past we pull-up the ANY-sublink with 2 steps, the first step is to
> > > pull up the sublink as a subquery, and the next step is to pull up the
> > > subquery if it is allowed.  The benefits of this method are obvious,
> > > pulling up the subquery has more requirements, even if we can just finish
> > > the first step, we still get huge benefits. However the bad stuff happens
> > > if varlevelsup = 1 involves, things fail at step 1.
> >
> > > convert_ANY_sublink_to_join ...
> >
> > >     if (contain_vars_of_level((Node *) subselect, 1))
> > >         return NULL;
> >
> > > In this patch we distinguish the above case and try to pull-up it within
> > > one step if it is helpful, It looks to me that what we need to do is just
> > > transform it to EXIST-SUBLINK.
> >
> > This patch seems awfully messy to me.  The fact that you're having to
> > duplicate stuff done elsewhere suggests at the least that you've not
> > plugged the code into the best place.
> >
> > Looking again at that contain_vars_of_level restriction, I think the
> > reason for it was just to avoid making a FROM subquery that has outer
> > references, and the reason we needed to avoid that was merely that we
> > didn't have LATERAL at the time.  So I experimented with the attached.
> > It seems to work, in that we don't get wrong answers from any of the
> > small number of places that are affected.  (I wonder though whether
> > those test cases still test what they were intended to, particularly
> > the postgres_fdw one.  We might have to try to hack them some more
> > to not get affected by this optimization.)  Could do with more test
> > cases, no doubt.
> >
> > One thing I'm not at all clear about is whether we need to restrict
> > the optimization so that it doesn't occur if the subquery contains
> > outer references falling outside available_rels.  I think that that
> > case is covered by is_simple_subquery() deciding later to not pull up
> > the subquery based on LATERAL restrictions, but maybe that misses
> > something.
> >
> > I'm also wondering whether the similar restriction in
> > convert_EXISTS_sublink_to_join could be removed similarly.
> > In this light it was a mistake for convert_EXISTS_sublink_to_join
> > to manage the pullup itself rather than doing it in the two-step
> > fashion that convert_ANY_sublink_to_join does it.
>
> The patch does not apply on top of HEAD as in [1], please post a rebased patch:
> === Applying patches on top of PostgreSQL commit ID
> b82557ecc2ebbf649142740a1c5ce8d19089f620 ===
> === applying patch ./v2-0001-use-LATERAL-for-ANY_SUBLINK.patch
> patching file contrib/postgres_fdw/expected/postgres_fdw.out
> ...
> Hunk #2 FAILED at 6074.
> Hunk #3 FAILED at 6087.
> 2 out of 3 hunks FAILED -- saving rejects to file
> src/test/regress/expected/join.out.rej

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

14 February 2023, 02:11:23

Hi All:

Sorry for the delay. Once I saw Tom's reply on Nov 15, I tried his

suggestion about "whether we need to restrict the optimization so

that it doesn't occur if the subquery contains outer references

falling outside available_rels. " quickly, I'm sure such a restriction

can fix the bad case Richard provided. But even Tom "I'm not at

all clear about ..", I'd like to prepare myself better for this

discussion and that took some time. Then an internal urgent

project occupied my attention, the project is still in-progress

now:(

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Thank you vignesh C for this, I didn't give up yet, probably I can

come back in the following month.

Best Regards

Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

05 April 2023, 07:15:42

Hi Tom:

Sorry for the delayed response! I think my knowledge has been refreshed

for this discussion.

One thing I'm not at all clear about is whether we need to restrict
the optimization so that it doesn't occur if the subquery contains
outer references falling outside available_rels. I think that that
case is covered by is_simple_subquery() deciding later to not pull up
the subquery based on LATERAL restrictions, but maybe that misses
something.

I think we need the restriction and that should be enough for this feature

. Given the query Richard provided before:

explain
select * from tenk1 A where exists
(select 1 from tenk2 B
where A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd));

It first can be converted to the below format without any issue.

SELECT * FROM tenk1 A SEMI JOIN tenk2 B
on A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd);

Then without the restriction, since we only pull the varnos from
sublink->testexpr, then it is {A}, so it convert to

SELECT * FROM

(tenk1 A SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C)
ON c.odd = b.odd AND a.hundred = v.hundred)

SEMI JOIN on tenk2 B ON TRUE;

then the above query is NOT A VALID QUERY since:

1. The above query is *not* same as

SELECT * FROM (tenk1 A SEMI JOIN tenk2 B) on true
SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C) v

ON v.odd = b.odd;

2. The above query requires b.odd when B is not available. So it is

right that an optimizer can't generate a plan for it. The fix would
be to do the restriction before applying this optimization.

I'm not sure pull-up-subquery can play any role here, IIUC, the bad thing
happens before pull-up-subquery.

I also write & analyze more test and found no issue by me

1. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up to rarg of the left join since A.hundred is not
available.

2. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
==> should not be pull-up to rarg of the left join since A.odd is not
available.

3. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should be pull-up to rarg of left join.

4. SELECT * FROM tenk1 A INNER JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> pull-up as expected.

5. SELECT * FROM tenk1 A RIGHT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up into larg of left join since b.odd is not
available.

About the existing test case changes because of this patch, they do
requires on the sublink is planned to a subPlan, so I introduces the below

changes to keep the original intention.

Changes
A in (SELECT A FROM ..)
To

(A, random() > 0) in (SELECT a, random() > 0 FROM ..);

I'm also wondering whether the similar restriction in
convert_EXISTS_sublink_to_join could be removed similarly.
In this light it was a mistake for convert_EXISTS_sublink_to_join
to manage the pullup itself rather than doing it in the two-step
fashion that convert_ANY_sublink_to_join does it.

Yes, it is true! I prefer to believe this deserves a separate patch.

Any feedback is welcome!

Best Regards

Andy Fan

Attachment

v3-0001-Pull-up-direct-correlated-ANY_SUBLINK-using-later.patch

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Alena Rybakina

Date:

11 October 2023, 21:01:37

Hi!

I reviewed your patch and it was interesting for me!

Thank you for the explanation. It was really informative for me!

I think we need the restriction and that should be enough for this feature
. Given the query Richard provided before:

explain
select * from tenk1 A where exists
(select 1 from tenk2 B
where A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd));

It first can be converted to the below format without any issue.

SELECT * FROM tenk1 A SEMI JOIN tenk2 B
on A.hundred in (select C.hundred FROM tenk2 C
WHERE c.odd = b.odd);

Then without the restriction, since we only pull the varnos from
sublink->testexpr, then it is {A}, so it convert to

SELECT * FROM
(tenk1 A SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C)
ON c.odd = b.odd AND a.hundred = v.hundred)
SEMI JOIN on tenk2 B ON TRUE;

then the above query is NOT A VALID QUERY since:
1. The above query is *not* same as

SELECT * FROM (tenk1 A SEMI JOIN tenk2 B) on true
SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C) v
ON v.odd = b.odd;

2. The above query requires b.odd when B is not available. So it is
right that an optimizer can't generate a plan for it. The fix would
be to do the restriction before applying this optimization.

I'm not sure pull-up-subquery can play any role here, IIUC, the bad thing
happens before pull-up-subquery.

I also write & analyze more test and found no issue by me

1. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up to rarg of the left join since A.hundred is not
available.

2. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
==> should not be pull-up to rarg of the left join since A.odd is not
available.

3. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should be pull-up to rarg of left join.

4. SELECT * FROM tenk1 A INNER JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> pull-up as expected.

5. SELECT * FROM tenk1 A RIGHT JOIN tenk2 B
ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
==> should not be pull-up into larg of left join since b.odd is not
available.

After reviewing, I want to suggest some changes related to the code and tests.

First of all, I think, it would be better to "treat" change to "consider" and rewrite the pull-up check condition in two lines:/*
* If the sub-select refers to any Vars of the parent query, we so let's
* considering it as LATERAL. (Vars of higher levels don't matter here.)
*/

use_lateral = !bms_is_empty(sub_ref_outer_relids) &&
bms_is_subset(sub_ref_outer_relids, available_rels);if (!use_lateral && !bms_is_empty(sub_ref_outer_relids))
return NULL;

Secondly, I noticed another interesting feature in your patch and I think it could be added to the test.

If we get only one row from the aggregated subquery, we can pull-up it in the subquery scan filter.

postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);                          QUERY PLAN
--------------------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Nested Loop
               -> Seq Scan on tenk2 b
               -> Subquery Scan on "ANY_subquery"
                     Filter: (b.hundred = "ANY_subquery".min)
                     -> Aggregate
                           -> Seq Scan on tenk2 c
                                 Filter: (odd = b.odd)
(10 rows)

It was impossible without your patch:postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                    QUERY PLAN
---------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Seq Scan on tenk2 b
               Filter: (SubPlan 1)
               SubPlan 1
                 -> Aggregate
                       -> Seq Scan on tenk2 c
                             Filter: (odd = b.odd)
(9 rows)

And I found an alternative query, when aggregated sublink will pull-up into JoinExpr condition.explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
                         QUERY PLAN
-------------------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Hash Semi Join
               Hash Cond: (b.hundred = "ANY_subquery".count)
               -> Seq Scan on tenk2 b
               -> Hash
                     -> Subquery Scan on "ANY_subquery"
                           -> HashAggregate
                                 Group Key: c.odd
                                 -> Seq Scan on tenk2 c
(11 rows)

Unfortunately, I found a request when sublink did not pull-up, as in the examples above. I couldn't quite figure out why.create table a (x int, y int, z int, t int);
create table b (x int, t int);
create unique index on a (t, x);
create index on b (t,x);
insert into a select id, id, id, id FROM generate_series(1,100000) As id;
insert into b select id, id FROM generate_series(1,1000) As id;

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a
        on b.x=a.x and
           b.t in
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t = b.t);                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.150..58.512 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (SubPlan 2)
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.023..15.798 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.038..1.042 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.047..0.399 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.018..0.018 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.015..0.016 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan using a_t_x_idx on a a0 (actual time=0.014..0.014 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t = b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.630 ms
Execution Time: 58.941 ms
(23 rows)

I thought it would be:

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a on
        b.x=a.x and
        b.t =
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t <= b.t);                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (b.t = (SubPlan 2))
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.065..1.068 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.025..0.025 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.024..0.024 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan Backward using a_t_x_idx on a a0 (actual time=0.023..0.023 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.689 ms
Execution Time: 68.220 ms
(23 rows)

If you noticed, it became possible after replacing the "in" operator with "=".

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.

-- 
Regards,
Alena Rybakina

Attachment

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

12 October 2023, 07:52:07

Hi Alena,

On Thu, Oct 12, 2023 at 5:01 AM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Hi!
I reviewed your patch and it was interesting for me!
Thank you for the explanation. It was really informative for me!

Thanks for your interest in this, and I am glad to know it is informative.

Unfortunately, I found a request when sublink did not pull-up, as in the

examples above. I couldn't quite figure out why.

I'm not sure what you mean with the "above", I guess it should be the "below"?

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a
        on b.x=a.x and
           b.t in
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t = b.t);

...

SubPlan 2

Here the sublink can't be pulled up because of its reference to

the LHS of left join, the original logic is that no matter the 'b.t in ..'

returns the true or false, the rows in LHS will be returned. If we

pull it up to LHS, some rows in LHS will be filtered out, which

breaks its original semantics.

I thought it would be:
explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a on
        b.x=a.x and
        b.t =
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t <= b.t);                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (b.t = (SubPlan 2))
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.065..1.068 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.025..0.025 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.024..0.024 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan Backward using a_t_x_idx on a a0 (actual time=0.023..0.023 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.689 ms
Execution Time: 68.220 ms
(23 rows)
If you noticed, it became possible after replacing the "in" operator with "=".

I didn't notice much difference between the 'in' and '=', maybe I

missed something?

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.

Sure, the patch after your modification looks better than the original.

I'm not sure how the test case around "because of got one row" is

relevant to the current changes. After we reach to some agreement

on the above discussion, I think v4 is good for committer to review!

Best Regards

Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Alena Rybakina

Date:

12 October 2023, 23:14:13

On 12.10.2023 10:52, Andy Fan wrote:

Unfortunately, I found a request when sublink did not pull-up, as in the
examples above. I couldn't quite figure out why.
I'm not sure what you mean with the "above", I guess it should be the "below"?

Yes, you are right)

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a
        on b.x=a.x and
           b.t in
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t = b.t);
...
   SubPlan 2

Here the sublink can't be pulled up because of its reference to
the LHS of left join, the original logic is that no matter the 'b.t in ..'
returns the true or false, the rows in LHS will be returned. If we
pull it up to LHS, some rows in LHS will be filtered out, which
breaks its original semantics.

Thanks for the explanation, it became more clear to me here.

I thought it would be:
explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a on
        b.x=a.x and
        b.t =
            (select max(a0.t)
             from a a0
             where a0.x = b.x and
                   a0.t <= b.t);                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
   Join Filter: (b.t = (SubPlan 2))
   Buffers: shared hit=3546
   -> Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
         Buffers: shared hit=541
   -> Hash (actual time=1.065..1.068 rows=1000 loops=1)
         Buckets: 4096 Batches: 1 Memory Usage: 72kB
         Buffers: shared hit=5
         -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
               Buffers: shared hit=5
   SubPlan 2
     -> Result (actual time=0.025..0.025 rows=1 loops=1000)
           Buffers: shared hit=3000
           InitPlan 1 (returns $2)
             -> Limit (actual time=0.024..0.024 rows=1 loops=1000)
                   Buffers: shared hit=3000
                   -> Index Only Scan Backward using a_t_x_idx on a a0 (actual time=0.023..0.023 rows=1 loops=1000)
                         Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND (x = b.x))
                         Heap Fetches: 1000
                         Buffers: shared hit=3000
Planning Time: 0.689 ms
Execution Time: 68.220 ms
(23 rows)
If you noticed, it became possible after replacing the "in" operator with "=".
I didn't notice much difference between the 'in' and '=', maybe I
missed something?

It seems to me that the expressions "=" and "IN" are equivalent here due to the fact that the aggregated subquery returns only one value, and the result with the "IN" operation can be considered as the intersection of elements on the left and right. In this query, we have some kind of set on the left, among which there will be found or not only one element on the right. In general, this expression can be considered as b=const, so push down will be applied to b and we can filter b during its scanning by the subquery's result.
But I think your explanation is necessary here, that this is all possible, because we can pull up the sublink here, since filtering is allowed on the right side (the nullable side) and does not break the semantics of LHS. But in contrast, I also added two queries where pull-up is impossible and it is not done here. Otherwise if filtering was applied on the left it would be mistake.

To be honest, I'm not sure if this explanation is needed in the test anymore, so I didn't add it.
explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                           QUERY PLAN
-----------------------------------------------------------------
Nested Loop Left Join
   Join Filter: (SubPlan 2)
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Seq Scan on tenk2 b
   SubPlan 2
     -> Result
           InitPlan 1 (returns $1)
             -> Limit
                   -> Index Scan using tenk2_hundred on tenk2 c
                         Index Cond: (hundred IS NOT NULL)
                         Filter: (odd = b.odd)
(12 rows)

explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
            QUERY PLAN
-----------------------------------
Nested Loop Left Join
   Join Filter: (hashed SubPlan 1)
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Seq Scan on tenk2 b
   SubPlan 1
     -> HashAggregate
           Group Key: c.odd
           -> Seq Scan on tenk2 c
(9 rows)

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.
Sure, the patch after your modification looks better than the original.
I'm not sure how the test case around "because of got one row" is
relevant to the current changes. After we reach to some agreement
on the above discussion, I think v4 is good for committer to review!

Thank you!) I am ready to discuss it.

-- 
Regards,
Alena Rybakina

Attachment

pull-up.diff

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

13 October 2023, 07:04:45

It seems to me that the expressions "=" and "IN" are equivalent here due to the fact that the aggregated subquery returns only one value, and the result with the "IN" operation can be considered as the intersection of elements on the left and right. In this query, we have some kind of set on the left, among which there will be found or not only one element on the right.

Yes, they are equivalent at the final result, but there are some

differences at the execution level. the '=' case will be transformed

to a Subplan whose subPlanType is EXPR_SUBLINK, so if there

is more than 1 rows is returned in the subplan, error will be raised.

select * from tenk1 where
ten = (select ten from tenk1 i where i.two = tenk1.two );

ERROR: more than one row returned by a subquery used as an expression

However the IN case would not.

select * from tenk1 where
ten = (select ten from tenk1 i where i.two = tenk1.two ) is OK.

I think the test case you added is not related to this feature. the

difference is there even without the patch. so I kept the code

you changed, but not for the test case.

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.
Sure, the patch after your modification looks better than the original.
I'm not sure how the test case around "because of got one row" is
relevant to the current changes. After we reach to some agreement
on the above discussion, I think v4 is good for committer to review!

Thank you!) I am ready to discuss it.

Actually I meant to discuss the "Unfortunately, I found a request..", looks

we have reached an agreement there:)

Best Regards

Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

13 October 2023, 07:29:25

Hi Tom,

Would you like to have a look at this? The change is not big and the

optimization has also been asked for many times. The attached is the

v5 version and I also try my best to write a good commit message.

Here is the commit fest entry:

https://commitfest.postgresql.org/45/4268/

Attachment

v5-0001-Pull-up-ANY-SUBLINK-with-the-necessary-lateral-su.patch

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Alena Rybakina

Date:

13 October 2023, 08:39:41

On 13.10.2023 10:04, Andy Fan wrote:

It seems to me that the expressions "=" and "IN" are equivalent here due to the fact that the aggregated subquery returns only one value, and the result with the "IN" operation can be considered as the intersection of elements on the left and right. In this query, we have some kind of set on the left, among which there will be found or not only one element on the right.

Yes, they are equivalent at the final result, but there are some
differences at the execution level. the '=' case will be transformed
to a Subplan whose subPlanType is EXPR_SUBLINK, so if there
is more than 1 rows is returned in the subplan, error will be raised.

select * from tenk1 where
ten = (select ten from tenk1 i where i.two = tenk1.two );

ERROR: more than one row returned by a subquery used as an expression

However the IN case would not.
select * from tenk1 where
ten = (select ten from tenk1 i where i.two = tenk1.two ) is OK.

I think the test case you added is not related to this feature. the
difference is there even without the patch. so I kept the code
you changed, but not for the test case.

Yes, I understand and agree with you that we should delete the last queries, except to one.

The query below have a different result compared to master, and it is correct.

Without your patch:

explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                                 QUERY PLAN
-----------------------------------------------------------------------------
Nested Loop Left Join
   -> Seq Scan on tenk1 a
   -> Materialize
         -> Seq Scan on tenk2 b
               Filter: (SubPlan 2)
               SubPlan 2
                 -> Result
                       InitPlan 1 (returns $1)
                         -> Limit
                               -> Index Scan using tenk2_hundred on tenk2 c
                                     Index Cond: (hundred IS NOT NULL)
                                     Filter: (odd = b.odd)
(12 rows)

After your patch:

postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);

                           QUERY PLAN
--------------------------------------------------------------
Nested Loop Left Join
    -> Seq Scan on tenk1 a
    -> Materialize
          -> Nested Loop
                -> Seq Scan on tenk2 b
                -> Subquery Scan on "ANY_subquery"
                      Filter: (b.hundred = "ANY_subquery".min)
                      -> Aggregate
                            -> Seq Scan on tenk2 c
                                  Filter: (odd = b.odd)
(10 rows)

I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.
Sure, the patch after your modification looks better than the original.
I'm not sure how the test case around "because of got one row" is
relevant to the current changes. After we reach to some agreement
on the above discussion, I think v4 is good for committer to review!

Thank you!) I am ready to discuss it.

Actually I meant to discuss the "Unfortunately, I found a request..", looks
we have reached an agreement there:)

Yes, we have)

-- 
Regards,
Alena Rybakina

Attachment

pull-up.diff

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

vignesh C

Date:

26 January 2024, 12:46:02

On Fri, 13 Oct 2023 at 14:09, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> On 13.10.2023 10:04, Andy Fan wrote:
>>
>> It seems to me that the expressions "=" and "IN" are equivalent here due to the fact that the aggregated subquery
returnsonly one value, and the result with the "IN" operation can be considered as the intersection of elements on the
leftand right. In this query, we have some kind of set on the left, among which there will be found or not only one
elementon the right. 
>
>
> Yes, they are equivalent at the final result, but there are some
> differences at the execution level.  the '=' case will be transformed
> to a Subplan whose subPlanType is EXPR_SUBLINK, so if there
> is more than 1 rows is returned in the subplan, error will be raised.
>
> select * from tenk1 where
>   ten =  (select ten from tenk1 i where i.two = tenk1.two );
>
> ERROR:  more than one row returned by a subquery used as an expression
>
> However the IN case would not.
> select * from tenk1 where
>   ten =  (select ten from tenk1 i where i.two = tenk1.two ) is OK.
>
>
> I think the test case you added is not related to this feature. the
> difference is there even without the patch.  so I kept the code
> you changed, but not for the test  case.
>
> Yes, I understand and agree with you that we should delete the last queries, except to one.
>
> The query below have a different result compared to master, and it is correct.
>
>
> Without your patch:
>
> explain (costs off)
> +SELECT * FROM tenk1 A LEFT JOIN tenk2 B
> ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
>                                  QUERY PLAN
> -----------------------------------------------------------------------------
>  Nested Loop Left Join
>    ->  Seq Scan on tenk1 a
>    ->  Materialize
>          ->  Seq Scan on tenk2 b
>                Filter: (SubPlan 2)
>                SubPlan 2
>                  ->  Result
>                        InitPlan 1 (returns $1)
>                          ->  Limit
>                                ->  Index Scan using tenk2_hundred on tenk2 c
>                                      Index Cond: (hundred IS NOT NULL)
>                                      Filter: (odd = b.odd)
> (12 rows)
>
>
> After your patch:
>
> postgres=# explain (costs off)
> SELECT * FROM tenk1 A LEFT JOIN tenk2 B
> ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
>
>                            QUERY PLAN
> --------------------------------------------------------------
>   Nested Loop Left Join
>     ->  Seq Scan on tenk1 a
>     ->  Materialize
>           ->  Nested Loop
>                 ->  Seq Scan on tenk2 b
>                 ->  Subquery Scan on "ANY_subquery"
>                       Filter: (b.hundred = "ANY_subquery".min)
>                       ->  Aggregate
>                             ->  Seq Scan on tenk2 c
>                                   Filter: (odd = b.odd)
> (10 rows)
>
>
>>> I took the liberty of adding this to your patch and added myself as reviewer, if you don't mind.
>>
>> Sure, the patch after your modification looks better than the original.
>> I'm not sure how the test case around "because of got one row" is
>> relevant to the current changes.  After we reach to some agreement
>> on the above discussion, I think v4 is good for committer to review!
>>
>>
>> Thank you!) I am ready to discuss it.
>
>
> Actually I meant to discuss the "Unfortunately, I found a request..", looks
> we have reached an agreement there:)
>
> Yes, we have)

Hi Andy Fan,

If the changes of Alena are ok, can you merge the changes and post an
updated version so that CFBot can apply the patch and verify the
changes. As currently CFBot is trying to apply only Alena's changes
and failing with the following at [1]:
=== Applying patches on top of PostgreSQL commit ID
fba2112b1569fd001a9e54dfdd73fd3cb8f16140 ===
=== applying patch ./pull-up.diff
patching file src/test/regress/expected/subselect.out
Hunk #1 succeeded at 1926 with fuzz 2 (offset -102 lines).
patching file src/test/regress/sql/subselect.sql
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/subselect.sql.rej

[1] - http://cfbot.cputube.org/patch_46_4268.log

Regards,
Vignesh

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Alexander Korotkov

Date:

13 February 2024, 10:50:40

Hi!

> If the changes of Alena are ok, can you merge the changes and post an
> updated version so that CFBot can apply the patch and verify the
> changes. As currently CFBot is trying to apply only Alena's changes
> and failing with the following at [1]:

I think this is a very nice and pretty simple optimization.  I've
merged the changes by Alena, and slightly revised the code changes in
convert_ANY_sublink_to_join().  I'm going to push this if there are no
objections.

------
Regards,
Alexander Korotkov

Attachment

0001-Pull-up-ANY-SUBLINK-with-the-necessary-lateral-su-v6.patch

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andy Fan

Date:

15 February 2024, 09:51:28

Hi Alexander,

> Hi!
>
>> If the changes of Alena are ok, can you merge the changes and post an
>> updated version so that CFBot can apply the patch and verify the
>> changes. As currently CFBot is trying to apply only Alena's changes
>> and failing with the following at [1]:
>
> I think this is a very nice and pretty simple optimization.  I've
> merged the changes by Alena, and slightly revised the code changes in
> convert_ANY_sublink_to_join().  I'm going to push this if there are no
> objections.

Thanks for picking up this! I double checked the patch, it looks good to
me. 

-- 
Best Regards
Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andrei Lepikhov

Date:

01 July 2024, 09:17:50

On 10/12/23 14:52, Andy Fan wrote:
> Here the sublink can't be pulled up because of its reference to
> the  LHS of left join, the original logic is that no matter the 'b.t in ..'
> returns the true or false,  the rows in LHS will be returned.  If we
> pull it up to LHS, some rows in LHS will be filtered out, which
> breaks its original semantics.
Hi,
I spent some time trying to understand your sentence.
I mean the following case:

SELECT * FROM t1 LEFT JOIN t2
   ON t2.x IN (SELECT y FROM t3 WHERE t1.x=t3.x);

I read [1,2,3], but I am still unsure why it is impossible in the case 
of OUTER JOIN. By setting the LATERAL clause, we forbid any clauses from 
the RTE subquery to bubble up as a top-level clause and filter tuples 
from LHS, am I wrong? Does it need more research or you can show some 
case to support your opinion - why this type of transformation must be 
disallowed?

[1] https://www.postgresql.org/message-id/6531.1218473967%40sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/BANLkTikGFtGnAaXVh5%3DntRdN%2B4w%2Br%3DNPuw%40mail.gmail.com
[3] https://www.vldb.org/conf/1992/P091.PDF

-- 
regards, Andrei Lepikhov

Re: A new strategy for pull-up correlated ANY_SUBLINK

From

Andrei Lepikhov

Date:

03 July 2024, 08:33:44

On 7/1/24 16:17, Andrei Lepikhov wrote:
> On 10/12/23 14:52, Andy Fan wrote:
>> Here the sublink can't be pulled up because of its reference to
>> the  LHS of left join, the original logic is that no matter the 'b.t 
>> in ..'
>> returns the true or false,  the rows in LHS will be returned.  If we
>> pull it up to LHS, some rows in LHS will be filtered out, which
>> breaks its original semantics.
> Hi,
> I spent some time trying to understand your sentence.
> I mean the following case:
> 
> SELECT * FROM t1 LEFT JOIN t2
>    ON t2.x IN (SELECT y FROM t3 WHERE t1.x=t3.x);
> 
> I read [1,2,3], but I am still unsure why it is impossible in the case 
> of OUTER JOIN. By setting the LATERAL clause, we forbid any clauses from 
> the RTE subquery to bubble up as a top-level clause and filter tuples 
> from LHS, am I wrong? Does it need more research or you can show some 
> case to support your opinion - why this type of transformation must be 
> disallowed?
> 
> [1] https://www.postgresql.org/message-id/6531.1218473967%40sss.pgh.pa.us
> [2] 
> https://www.postgresql.org/message-id/BANLkTikGFtGnAaXVh5%3DntRdN%2B4w%2Br%3DNPuw%40mail.gmail.com
> [3] https://www.vldb.org/conf/1992/P091.PDF
> 

I delved into it a bit more. After reading [4,5] I invented query that 
is analogue of the query above, but with manually pulled-up sublink:

EXPLAIN (COSTS OFF)
SELECT * FROM t1 LEFT JOIN t2 JOIN LATERAL
(SELECT t1.x AS x1, y,x FROM t3) q1 ON (t2.x=q1.y AND q1.x1=q1.x) ON true;

And you can see the plan:

  Nested Loop Left Join
    ->  Seq Scan on t1
    ->  Hash Join
          Hash Cond: (t2.x = t3.y)
          ->  Seq Scan on t2
          ->  Hash
                ->  Seq Scan on t3
                      Filter: (t1.x = x)

Just for fun, I played with MSSQL Server and if I read its explain 
correctly, it also allows pulls-up sublink which mentions LHS:

-------------------------------------
Nested Loops(Left Outer Join, OUTER REFERENCES:(t1.x))
   Table Scan(OBJECT:(t1))
     Hash Match(Right Semi Join, HASH:(t3.y)=(t2.x),
                RESIDUAL:(t2.x=t3.y))
       Table Scan(OBJECT:(t3), WHERE:(t1.x=t3.x))
     Table Scan(OBJECT:(t2))
-------------------------------------

(I cleaned MSSQL explain a little bit for clarity).
So, may we allow references to LHS in such sublink?

[4] 
https://www.postgresql.org/message-id/flat/15523.1372190410%40sss.pgh.pa.us
[5] 
https://www.postgresql.org/message-id/20130617235236.GA1636@jeremyevans.local

-- 
regards, Andrei Lepikhov