Thread: Pulling up direct-correlated ANY_SUBLINK

Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:
Hi,

Currently we do not try to pull up sub-select of type ANY_SUBLINK if it
refers to any Vars of the parent query, as indicated in the code snippet
below:

JoinExpr *
convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
                            Relids available_rels)
{
    ...

    if (contain_vars_of_level((Node *) subselect, 1))
        return NULL;


Why do we have this check?

Can we try to pull up direct-correlated ANY SubLink with the help of
LATERAL? That is, do the pull up in the same way as uncorrelated ANY
SubLink, by adding the SubLink's subselect to the query's rangetable,
but explicitly set LATERAL for its RangeTblEntry, like:

--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1226,13 +1226,6 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
        Assert(sublink->subLinkType == ANY_SUBLINK);

        /*
-        * The sub-select must not refer to any Vars of the parent query. (Vars of
-        * higher levels should be okay, though.)
-        */
-       if (contain_vars_of_level((Node *) subselect, 1))
-               return NULL;
-
-       /*
         * The test expression must contain some Vars of the parent query, else
         * it's not gonna be a join.  (Note that it won't have Vars referring to
         * the subquery, rather Params.)
@@ -1267,7 +1260,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
        rte = addRangeTableEntryForSubquery(pstate,
                                            subselect,
                                            makeAlias("ANY_subquery", NIL),
-                                           false,
+                                          contain_vars_of_level((Node *) subselect, 1), /* lateral */
                                            false);
        parse->rtable = lappend(parse->rtable, rte);
        rtindex = list_length(parse->rtable);


By this way, we can convert the query:

select * from a where a.i = ANY(select i from b where a.j > b.j);

To:

select * from a SEMI JOIN lateral(select * from b where a.j > b.j) sub on a.i = sub.i;


Does this make sense?

Thanks
Richard

Re: Pulling up direct-correlated ANY_SUBLINK

From
Antonin Houska
Date:
Richard Guo <riguo@pivotal.io> wrote:

> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

> By this way, we can convert the query:
>
> select * from a where a.i = ANY(select i from b where a.j > b.j);
>
> To:
>
> select * from a SEMI JOIN lateral(select * from b where a.j > b.j)
> sub on a.i = sub.i;
>

I tried this a few years ago. This is where the problems started:

https://www.postgresql.org/message-id/1386716782.5203.YahooMailNeo%40web162905.mail.bf1.yahoo.com

I'm not sure I remember enough, but the problem has something to do with one
possible strategy to plan SEMI JOIN: unique-ify the inner path and then
perform plain INNER JOIN instead.

I think the problemm was that the WHERE clause of the subquery didn't
participate in the SEMI JOIN evaluation and was used as filter instead. Thus
the clause's Vars were not used in unique keys of the inner path and so the
SEMI JOIN didn't work well.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Pulling up direct-correlated ANY_SUBLINK

From
Tom Lane
Date:
Richard Guo <riguo@pivotal.io> writes:
> Currently we do not try to pull up sub-select of type ANY_SUBLINK if it
> refers to any Vars of the parent query, as indicated in the code snippet
> below:
>     if (contain_vars_of_level((Node *) subselect, 1))
>         return NULL;
> Why do we have this check?

Because the result would not be a join between two independent tables.

> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

Perhaps.  But what's the argument that you'd end up with a better
plan?  LATERAL pretty much constrains things to use a nestloop,
so I'm not sure there's anything fundamentally different.

            regards, tom lane



Re: Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:
Hi Antonin,

On Tue, Sep 10, 2019 at 4:31 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:

> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

> By this way, we can convert the query:
>
> select * from a where a.i = ANY(select i from b where a.j > b.j);
>
> To:
>
> select * from a SEMI JOIN lateral(select * from b where a.j > b.j)
> sub on a.i = sub.i;
>

I tried this a few years ago. This is where the problems started:

https://www.postgresql.org/message-id/1386716782.5203.YahooMailNeo%40web162905.mail.bf1.yahoo.com

Thank you for this link. Good to know the discussions years ago.
 
I'm not sure I remember enough, but the problem has something to do with one
possible strategy to plan SEMI JOIN: unique-ify the inner path and then
perform plain INNER JOIN instead.

I think the problemm was that the WHERE clause of the subquery didn't
participate in the SEMI JOIN evaluation and was used as filter instead. Thus
the clause's Vars were not used in unique keys of the inner path and so the
SEMI JOIN didn't work well.

This used to be a problem until it was fixed by commit 043f6ff0, which
includes the postponed qual from a LATERAL subquery into the quals seen
by make_outerjoininfo().

Thanks
Richard

Re: Pulling up direct-correlated ANY_SUBLINK

From
Antonin Houska
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > Can we try to pull up direct-correlated ANY SubLink with the help of
> > LATERAL?
> 
> Perhaps.  But what's the argument that you'd end up with a better
> plan?  LATERAL pretty much constrains things to use a nestloop,
> so I'm not sure there's anything fundamentally different.

I think that subquery pull-up is most beneficial when the queries (both the
subquery and the upper query) contain more than a few tables. In such a case,
if only a few tables reference the upper query (or if just a single one does),
the constraints imposed by LATERAL might be less significant.

Nevertheless, I don't know how to overcome the problems that I mentioned
upthread.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:
Hi Antonin,

On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska <ah@cybertec.at> wrote:

Nevertheless, I don't know how to overcome the problems that I mentioned
upthread.

Do you mean the problem "the WHERE clause of the subquery didn't
participate in the SEMI JOIN evaluation"? Good news is it has been fixed
by commit 043f6ff0 as I mentioned upthread.

Thanks
Richard


Re: Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:
Hi Tom,

On Tue, Sep 10, 2019 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

Perhaps.  But what's the argument that you'd end up with a better
plan?  LATERAL pretty much constrains things to use a nestloop,
so I'm not sure there's anything fundamentally different.

This is a point I didn't think of. In that case if the pull-up mostly
results in a nestloop then we cannot make sure we will get a better
plan. Thank you for pointing it out.

Thanks
Richard

Re: Pulling up direct-correlated ANY_SUBLINK

From
Antonin Houska
Date:
Richard Guo <riguo@pivotal.io> wrote:

> On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska <ah@cybertec.at>
> wrote:
> 
>    
>     Nevertheless, I don't know how to overcome the problems that I
>     mentioned
>     upthread.
> 
> 
> Do you mean the problem "the WHERE clause of the subquery didn't
> participate in the SEMI JOIN evaluation"? Good news is it has been
> fixed
> by commit 043f6ff0 as I mentioned upthread.

Do you say that my old patch (rebased) no longer breaks the regression tests?

(I noticed your other email in the thread which seems to indicate that you're
no lo longer interested to work on the feature, but asking out of curiosity.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:

On Thu, Sep 12, 2019 at 11:35 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:

> On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska <ah@cybertec.at>
> wrote:
>
>   
>     Nevertheless, I don't know how to overcome the problems that I
>     mentioned
>     upthread.
>
>
> Do you mean the problem "the WHERE clause of the subquery didn't
> participate in the SEMI JOIN evaluation"? Good news is it has been
> fixed
> by commit 043f6ff0 as I mentioned upthread.

Do you say that my old patch (rebased) no longer breaks the regression tests?

I think so.
 

(I noticed your other email in the thread which seems to indicate that you're
no lo longer interested to work on the feature, but asking out of curiosity.)

Tom pointed out that even if we pull up the subquery with the help of
LATERAL, we cannot make sure we will end up with a better plan, since
LATERAL pretty much constrains things to use a nestloop. Hmm, I think
what he said makes sense.

Thanks
Richard
 

Re: Pulling up direct-correlated ANY_SUBLINK

From
Andy Fan
Date:


On Tue, Sep 17, 2019 at 4:41 PM Richard Guo <riguo@pivotal.io> wrote:

On Thu, Sep 12, 2019 at 11:35 PM Antonin Houska <ah@cybertec.at> wrote:
Richard Guo <riguo@pivotal.io> wrote:

> On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska <ah@cybertec.at>
> wrote:
>
>   
>     Nevertheless, I don't know how to overcome the problems that I
>     mentioned
>     upthread.
>
>
> Do you mean the problem "the WHERE clause of the subquery didn't
> participate in the SEMI JOIN evaluation"? Good news is it has been
> fixed
> by commit 043f6ff0 as I mentioned upthread.

Do you say that my old patch (rebased) no longer breaks the regression tests?

I think so.
 

(I noticed your other email in the thread which seems to indicate that you're
no lo longer interested to work on the feature, but asking out of curiosity.)

Tom pointed out that even if we pull up the subquery with the help of
LATERAL, we cannot make sure we will end up with a better plan, since
LATERAL pretty much constrains things to use a nestloop. Hmm, I think
what he said makes sense.

Thanks
Richard
 

Even if we can't do this for the general case,  I still think we can do something
for some special cases,  for example:  
select count(*) from j1 where  (i) in  (select i from j2 where j2.im5 = j1.im5); 
can be converted to 
select count(*) from t1 where (i, im5) in (select i, im5 from j2); 

The conversion can happen just before the convert_ANY_sublink_to_join.

@@ -399,6 +483,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
                /* Is it a convertible ANY or EXISTS clause? */
                if (sublink->subLinkType == ANY_SUBLINK)
                {
+                       reduce_sublink_correlation_exprs(root, sublink);
                        if ((j = convert_ANY_sublink_to_join(root, sublink,
                                                                                                 available_rels1)) != NULL)

However we have to do lots of pre checking for this,  the below is 
something I can think for now.

1). It must be an in-subquery.  
2). The op in correlation_expr must be a mergeable op.
3). no aggregation call in subquery->targetList and subquery->havingQual.  
4). no limit/offset cause. 
5). No volatile function involved for safety. 
 
I can't tell how often it is, I just run into this by my own and search the
maillist and get only 1 report [1].  Is it something worth doing or do we have 
a better strategy to handle it?   Thanks!

Re: Pulling up direct-correlated ANY_SUBLINK

From
Richard Guo
Date:

On Tue, Sep 10, 2019 at 9:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

Perhaps.  But what's the argument that you'd end up with a better
plan?  LATERAL pretty much constrains things to use a nestloop,
so I'm not sure there's anything fundamentally different.

Sorry for the noise on replying such an old thread, but recently I
realized that pulling up direct-correlated ANY SubLink with LATERAL may
cause another problem that we cannot find any legal join order due to
the constraints imposed by LATERAL references. Below is an example:

select * from A where exists
    (select * from B where A.i in (select C.i from C where C.j = B.j));

For this query, after we converting the ANY SubLink to a LATERAL
subquery, the subquery cannot be pulled up further into the parent query
because its qual contains lateral reference to 'B', which is outside a
higher semi join. When considering the join of 'A' and the 'subquery',
we decide it's not legal due to the LATERAL reference. As a result, we
end up with not finding any legal join order for level 2.

Thanks
Richard 

Re: Pulling up direct-correlated ANY_SUBLINK

From
Andy Fan
Date:
Hi All:

On Tue, Sep 10, 2019 at 9:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <riguo@pivotal.io> writes:
> Currently we do not try to pull up sub-select of type ANY_SUBLINK if it
> refers to any Vars of the parent query, as indicated in the code snippet
> below:
>     if (contain_vars_of_level((Node *) subselect, 1))
>         return NULL;
> Why do we have this check?

Because the result would not be a join between two independent tables.

I think this situation is caused by we pull-up the ANY-sublink with 2 
steps, the first step is to pull up the sublink as a subquery,  and the
next step is to pull up the subquery if it is allowed.  The benefits of
this method are obvious,  pulling up the subquery has more requirements,
even if we can just finish the first step, we still get huge benefits.  
However the bad stuff happens if varlevelsup = 1 involves,  step 1 fails! 

The solution here is to use the lateral join to overcome the two
independent tables, the issue of this solution includes:

1.  LATERAL pretty much constrains things to use a nestloop like below, 
but this reason is questioned since if we can pull-up the subquery,  if so the
constraint gone. [1]
2.  It has something with unique-ify the inner path. [2] , but Richard thought 
it should be fixed but without an agreement for all people [3].
3.  Richard [4] found it would fail to get a plan for some query. (the error is
below per my testing)

> ERROR:  failed to build any 3-way joins

So back to the root cause of this issue,  IIUC,  if varlevelsup = 1 involves,
can we just bypass the 2-steps method,  just as what we do for EXISTS
sublinks?  If so, we just need to convert the ANY-SUBLINK to EXIST-SUBLINK
under the case. 

The attached is the one commit which includes the 2 methods discussed
here, controlled by different GUC separately, for easy testing.  Per my test, 
Query 2 choosed the Unique Join with the IN-to-EXISTS method, but not
with the Lateral method, and query 3 raises error with the lateral method,
but not with the IN-to-EXISTS method. 




> Can we try to pull up direct-correlated ANY SubLink with the help of
> LATERAL?

Perhaps.  But what's the argument that you'd end up with a better
plan?  LATERAL pretty much constrains things to use a nestloop,
so I'm not sure there's anything fundamentally different.

                        regards, tom lane


--
Best Regards
Andy Fan
Attachment