Thread: Question about pull_up_sublinks_qual_recurse

Question about pull_up_sublinks_qual_recurse

From
Andy Fan
Date:
Hi:

When I was working on another task, the following case caught my mind.

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);

explain (costs off) select * from t1
where exists (select 1 from t2
              where exists (select 1 from t3
                           where t3.c = t1.c
                           and t2.b = t3.b)
and t2.a = t1.a);


I got the plan like this:

            QUERY PLAN
-----------------------------------
 Hash Semi Join
   Hash Cond: (t1.a = t2.a)
   Join Filter: (hashed SubPlan 2)
   ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2
   SubPlan 2
     ->  Seq Scan on t3
(8 rows)

Note we CAN'T pull up the inner sublink which produced the SubPlan 2.


I traced the reason is after we pull up the outer sublink, we got:

select * from t1 semi join t2 on t2.a = t1.a AND
exists (select 1 from t3
        where t3.c = t1.c
        and t2.b = t3.b);

Later we tried to pull up the EXISTS sublink to t1 OR t2 separately, since
this subselect referenced to t1 AND t2, so we CAN'T pull up the sublink. I
am thinking why we have to pull up it t1 OR t2 rather than JoinExpr(t1, t2),
I think the latter one is better.

So I changed the code like this,  I got the plan I wanted and 'make
installcheck' didn't find any exception.


                   QUERY PLAN
------------------------------------------------
 Hash Semi Join
   Hash Cond: ((t2.b = t3.b) AND (t1.c = t3.c))
   ->  Hash Semi Join
         Hash Cond: (t1.a = t2.a)
         ->  Seq Scan on t1
         ->  Hash
               ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t3
(9 rows)

@@ -553,10 +553,10 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
  */
  j->quals = pull_up_sublinks_qual_recurse(root,
  j->quals,
- &j->larg,
- available_rels1,
- &j->rarg,
- child_rels);
+ jtlink1,
+ bms_union(available_rels1, child_rels),
+ NULL,
+ NULL);
  /* Return NULL representing constant TRUE */
  return NULL;
  }

Any feedback is welcome. 

--
Best Regards
Andy Fan

Re: Question about pull_up_sublinks_qual_recurse

From
Andy Fan
Date:
Later we tried to pull up the EXISTS sublink to t1 OR t2 separately, since
this subselect referenced to t1 AND t2, so we CAN'T pull up the sublink. I
am thinking why we have to pull up it t1 OR t2 rather than JoinExpr(t1, t2),
I think the latter one is better.

After some more self review,  I find my proposal has the following side effects.

    select * from t1
    where exists (select 1 from t2
                  where exists (select 1 from t3
                               where t3.c = t1.c)
    and t2.a = t1.a);

In the above example,  the innermost sublink will be joined with
SemiJoin (t1 t2) in the patched version,  but joined with t1 in the current
master.  However, even if we set the JoinTree with
SemiJoin(SemiJoin(t1 t2), t3),  the join reorder functions can generate a
path which joins t1 with t3 first and then t2 still.  So any hint about this
patch's self-review is welcome. 

--
Best Regards
Andy Fan

Re: Question about pull_up_sublinks_qual_recurse

From
Tom Lane
Date:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> After some more self review,  I find my proposal has the following side
> effects.

Yeah, I do not think this works at all.  The mechanism as it stands
right now is that we can insert pulled-up semijoins above j->larg
if they only need variables from relations in j->larg, and we can
insert them above j->rarg if they only need variables from relations
in j->rarg.  You can't just ignore that distinction and insert them
somewhere further up the tree.  Per the comment in
pull_up_sublinks_jointree_recurse:

         * Now process qual, showing appropriate child relids as available,
         * and attach any pulled-up jointree items at the right place. In the
         * inner-join case we put new JoinExprs above the existing one (much
         * as for a FromExpr-style join).  In outer-join cases the new
         * JoinExprs must go into the nullable side of the outer join. The
         * point of the available_rels machinations is to ensure that we only
         * pull up quals for which that's okay.

If the pulled-up join doesn't go into the nullable side of the upper
join then you've changed semantics.  In this case, it'd amount to
reassociating a semijoin that was within the righthand side of another
semijoin to above that other semijoin.  The discussion of outer join
reordering in optimizer/README says that that doesn't work, and while
I'm too lazy to construct an example right now, I believe it's true.

            regards, tom lane



Re: Question about pull_up_sublinks_qual_recurse

From
Andy Fan
Date:
Hi Tom:

Thanks for your reply! I have self reviewed the below message at 3 different
time periods to prevent from too inaccurate replies. It may be more detailed
than it really needed, but it probably can show where I am lost.

On Sat, Oct 15, 2022 at 3:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If the pulled-up join doesn't go into the nullable side of the upper
join then you've changed semantics.  In this case, it'd amount to
reassociating a semijoin that was within the righthand side of another
semijoin to above that other semijoin. 

I understand your reply as:

select * from t1 left join t2 on exists (select 1 from t3 where t3.a = t2.a);
= select * from t1 left join (t2 semi join t3 on t3.a = t2.a) on true;  -- go to nullable side
!=  select * from (t1 left join t2 on true) semi join t3 on (t3.a = t2.a);  -- go to above the JoinExpr

I CAN follow the above.  And for this case it is controlled by below code:

pull_up_sublinks_qual_recurse

switch (j->jointype)
{
          case JOIN_INNER:
                       ...
          case JOIN_LEFT:
                j->quals = pull_up_sublinks_qual_recurse(root, j->quals,
                                                                                           &j->rarg,
                                                                                           rightrelids,
                                                                                           NULL, NULL);
              break;
            ...
}

and I didn't change this.   My question is could we assume

A semijoin B ON EXISTS (SELECT 1 FROM C on (Pbc))
=  (A semijoin (B semijoin C on (Pbc))) on TRUE.  (current master did)
=  (A semijoin B ON true) semijoin C on (Pbc)        (my current thinking)

Note that there is no 'left outer join' at this place.   Since there are too
many places called pull_up_sublinks_qual_recurse, to make things 
less confused, I prepared a patch for this one line change to show where
exactly I changed (see patch 2);  I think this is the first place I lost.

The discussion of outer join
reordering in optimizer/README says that that doesn't work,

I think you are talking about the graph "Valid OUTER JOIN Optimizations".
I can follow until below.

"
SEMI joins work a little bit differently.  A semijoin can be reassociated
into or out of the lefthand side of another semijoin, left join, or
antijoin, but not into or out of the righthand side.  ..
"
I am unclear why 
 (A semijoin B on (Pab)) semijoin C on (Pbc)
!= A semijoin (B semijoin C on (Pbc)) on (Pab); 

Seems both return rows from A which match both semijoin (Pab) and
(Pbc).  or I misunderstand the above words in the first place?

At last, when I checked optimizer/README, it looks like we used
a 'nullable side'  while it should be 'nonnullable side'? see patch 1
for details.

--
Best Regards
Andy Fan
Attachment

Re: Question about pull_up_sublinks_qual_recurse

From
Andy Fan
Date:


On Sat, Oct 15, 2022 at 3:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> After some more self review,  I find my proposal has the following side
> effects.

Yeah, I do not think this works at all.  .... 
The discussion of outer join
reordering in optimizer/README says that that doesn't work, and while
I'm too lazy to construct an example right now, I believe it's true.

I came to this topic again recently and have finally figured out the
reason.  It looks to me that semi join is slightly different with outer
join in this case.

The following test case can show why we have to
pull_up_sublinks_qual_recurse to either LHS or RHS rather than the
JoinExpr.

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);

insert into t1 select 1, 1, 2;
insert into t2 select 1, 2, 1;
insert into t2 select 1, 1, 2;
insert into t3 select 1, 1, 2;

select * from t1
where exists (select 1 from t2
-- below references to t1 and t2 at the same time
where exists (select 1 from t3
   where t1.c = t2.c and t2.b = t3.b)  
and t1.a = t2.a);

which can be transformed to

SELECT * FROM t1 SEMI JOIN t2
ON t1.a = t2.a
AND exists (select 1 from t3
where t1.c = t2.c
    and t2.b = t3.b)

Here the semantics of the query is return the rows in T1 iff there is a
row in t2 matches the whole clause (t1.a = t2.a AND exists..);

But If we transform it to

SELECT * FROM (t1 SEMI JOIN t2
ON t1.a = t2.a) SEMI JOIN t3
on t1.c = t2.c and t2.b = t3.b;

The scan on T2 would stop if ONLY (t1.a = t2.a) matches and the following
rows will be ignored. However the matched rows may doesn't match the
exists clause! So in the above example, the correct result set will be 1 
row. If we pull up the sublink above the JoinExpr, no row would be found.

The attached is just a comment and a test case to help understand why we
have to do things like this.
 
--
Best Regards
Andy Fan
Attachment