Thread: BUG #18429: Inconsistent results on similar queries with join lateral

BUG #18429: Inconsistent results on similar queries with join lateral

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18429
Logged by:          Benoît Ryder
Email address:      b.ryder@ateme.com
PostgreSQL version: 15.6
Operating system:   Debian
Description:

Hi,
I stumbled over a behavior difference between two PostgreSQL versions while
migrating from 9.4 to 15.6.
I managed to create a minimal, easy to reproduce setup, with few queries
that should produce the same result, but don't, depending on various
tweaks.
Note that I'm not an SQL/PostgreSQL expert, and may have overlooked
something or misused "join lateral".

Script used: all select queries are expected to return nothing, but they
sometimes return a single row.
```

-- Setup
drop schema weird cascade;
create schema if not exists weird;
create table weird.t (
  wd int not null,
  wt int not null,

  primary key (wd, wt)
);
insert into weird.t values (4, 6);

-- Q1
with c2 as (
  -- Return 1 row `(4, 6)` when fetched separately
  select arrayd.ad d, coalesce(c.t, 0) t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
        order by wt desc limit 1
    ) c on true
)
-- `where` clause should return false: (14 - 6) / 4 = (12 - 6) / 4 → false
select 1 from c2 where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

-- Q2 (simplified sub-query)
with c2 as (
  select wd d, wt t from weird.t
)
select 1 from c2 where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

-- Q3 (sub-select instead of `with`)
select 1 from (
  select arrayd.ad d, coalesce(c.t, 0) t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
        order by wt desc limit 1
    ) c on true
  ) c2
  where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

-- Q4 (remove `order by limit` from Q3)
select 1 from (
  select arrayd.ad d, coalesce(c.t, 0) t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
    ) c on true
  ) c2
  where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

-- Q5 (remove `coalesce` from Q4)
select 1 from (
  select arrayd.ad d, c.t t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
    ) c on true
  ) c2
  where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

```

I ran the same queries on various PostgreSQL server versions, either from
local Debian packages, or using the official docker images.
All queries are run using psql 15.6 (Debian 15.6-0+deb12u1), by copy-pasting
the script above.

Results
```
              Q1 Q2 Q3 Q4 Q5  `select version();`
docker 16.2   ✓  ✓  ✓  ✓  ✓   PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
docker 16.0   ✓  ✓  ✓  ✓  ✓   PostgreSQL 16.0 (Debian 16.0-1.pgdg120+1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
docker 15.6   ✗  ✓  ✗  ✗  ✓   PostgreSQL 15.6 (Debian 15.6-1.pgdg120+2) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
debian 15.6   ✗  ✓  ✗  ✗  ✓   PostgreSQL 15.6 (Debian 15.6-0+deb12u1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
debian  9.4   ✓  ✓  ✗  ✓  ✓   PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu,
compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
```

Note that the trivial version (Q2) and the version without coalesce (Q5) are
correct on every version.
When fetched separately, the inner select for c2 returns the same one-raw
table.

Thanks


Re: BUG #18429: Inconsistent results on similar queries with join lateral

From
"David G. Johnston"
Date:
On Friday, April 12, 2024, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18429
Logged by:          Benoît Ryder
Email address:      b.ryder@ateme.com
PostgreSQL version: 15.6
Operating system:   Debian
Description:       

-- `where` clause should return false: (14 - 6) / 4 = (12 - 6) / 4 → false
select 1 from c2 where (14 - c2.t) / c2.d = (12 - c2.t) / c2.

You are doing integer division here and the right hand side equals, 1.5; I suppose something may have used to round that up to the integer 2 which would make both sides equals but now (v16) rounds it down (or more accurately truncates it) to 1.

David J.
I suspected a rounding issue at some point, but using floor didn't helped.
Displaying the values of `c2.t` shows more:
```
with c2 as (
  select arrayd.ad d, coalesce(c.t, 0) t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
        order by wt desc limit 1
    ) c on true
)
select c2.t, c2.d from c2
  where (14 - c2.t) / c2.d = (12 - c2.t) / c2.d;

 t | d
---+---
 0 | 4
(1 row)

with c2 as (
  select arrayd.ad d, coalesce(c.t, 0) t
    from unnest(ARRAY[4]) as arrayd(ad)
    left join lateral (
      select wt t from weird.t
        where wd = arrayd.ad
        order by wt desc limit 1
    ) c on true
)
select c2.t, c2.d from c2
  where true;

 t | d
---+---
 6 | 4
(1 row)
```

The only difference is the where clause, and it changes the returned value of `c2.t`.
If `c2.t = 0`, the where condition is true with proper integer rounding/truncation, which explains the result from Q1.

From: David G. Johnston <david.g.johnston@gmail.com>
Sent: Friday, April 12, 2024 15:52
To: Benoit Ryder <b.ryder@ateme.com>; pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #18429: Inconsistent results on similar queries with join lateral
 
You don't often get email from david.g.johnston@gmail.com. Learn why this is important
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Friday, April 12, 2024, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18429
Logged by:          Benoît Ryder
Email address:      b.ryder@ateme.com
PostgreSQL version: 15.6
Operating system:   Debian
Description:       

-- `where` clause should return false: (14 - 6) / 4 = (12 - 6) / 4 → false
select 1 from c2 where (14 - c2.t) / c2.d = (12 - c2.t) / c2.

You are doing integer division here and the right hand side equals, 1.5; I suppose something may have used to round that up to the integer 2 which would make both sides equals but now (v16) rounds it down (or more accurately truncates it) to 1.

David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Friday, April 12, 2024, PG Bug reporting form <noreply@postgresql.org>
> wrote:
>> -- `where` clause should return false: (14 - 6) / 4 = (12 - 6) / 4 → false
>> select 1 from c2 where (14 - c2.t) / c2.d = (12 - c2.t) / c2.

> You are doing integer division here and the right hand side equals, 1.5; I
> suppose something may have used to round that up to the integer 2 which
> would make both sides equals but now (v16) rounds it down (or more
> accurately truncates it) to 1.

I don't think that's relevant.  It's clear that the older versions
are returning inconsistent answers.  Comparing EXPLAIN output shows
why --- current HEAD produces (for Q1)

 Nested Loop Left Join  (cost=0.16..3.48 rows=1 width=4)
   Filter: (((14 - COALESCE(t.wt, 0)) / arrayd.ad) = ((12 - COALESCE(t.wt, 0)) / arrayd.ad))
   ->  Function Scan on unnest arrayd  (cost=0.00..0.01 rows=1 width=4)
   ->  Limit  (cost=0.15..3.45 rows=1 width=4)
         ->  Index Only Scan Backward using t_pkey on t  (cost=0.15..36.35 rows=11 width=4)
               Index Cond: (wd = arrayd.ad)

while v15 produces

 Nested Loop Left Join  (cost=0.16..3.50 rows=1 width=4)
   Filter: (((14 - COALESCE(c.t, 0)) / arrayd.ad) = ((12 - COALESCE(c.t, 0)) / arrayd.ad))
   ->  Function Scan on unnest arrayd  (cost=0.00..0.01 rows=1 width=4)
   ->  Subquery Scan on c  (cost=0.15..3.47 rows=1 width=4)
         Filter: (((14 - COALESCE(c.t, 0)) / arrayd.ad) = ((12 - COALESCE(c.t, 0)) / arrayd.ad))
         ->  Limit  (cost=0.15..3.45 rows=1 width=4)
               ->  Index Only Scan Backward using t_pkey on t  (cost=0.15..36.35 rows=11 width=4)
                     Index Cond: (wd = arrayd.ad)

So there's the problem: a copy of the upper WHERE clause is being
inappropriately applied below the outer join, and that filters out the
only row of the "c" subselect.  Then when we re-apply the WHERE at
top level, the COALESCEs produce 0 allowing the condition to evaluate
as true.

(I didn't check Q3-Q5 in detail, but probably they've got variants
of the same issue.)

"git bisect" fingers this commit as the first one producing correct
answers:

commit 2489d76c4906f4461a364ca8ad7e0751ead8aa0d
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Jan 30 13:16:20 2023 -0500

    Make Vars be outer-join-aware.

This is kind of exciting for me, as IIRC it's the first field-detected
bug that that work fixes.  However, I'm not real sure right now how
we might fix it in the back branches ...

            regards, tom lane



> This is kind of exciting for me, as IIRC it's the first field-detected
> bug that that work fixes.  However, I'm not real sure right now how
> we might fix it in the back branches ...

Thanks for the insight! And I'm glad this report can be help. :)

Would you know a way to avoid this bug? I could probably tweak the query
until the result is looking good, but I would prefer to make sure the
bug won't reappear if the query is used with different data or context.



Benoit Ryder <b.ryder@ateme.com> writes:
> Would you know a way to avoid this bug? I could probably tweak the query
> until the result is looking good, but I would prefer to make sure the
> bug won't reappear if the query is used with different data or context.

I'm not seeing a bulletproof way offhand, other than "update to v16".
Disabling nestloop plans fixes your Q4 in the back branches, but not
Q1 (because there's no opportunity for a hash or merge join in Q1).

It looks like the problem is that the old join_clause_is_movable
logic is incorrectly deciding that the WHERE condition can be
pushed down to the sub-select relation.  So we should be able to
fix it there, but I'm not sure how messy that will be or whether
we'll lose the ability to generate some correct plans.  It's hard
to justify putting a huge amount of work into old branches though.

            regards, tom lane



Re: BUG #18429: Inconsistent results on similar queries with join lateral

From
"David G. Johnston"
Date:
On Fri, Apr 12, 2024 at 8:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Benoit Ryder <b.ryder@ateme.com> writes:
> Would you know a way to avoid this bug? I could probably tweak the query
> until the result is looking good, but I would prefer to make sure the
> bug won't reappear if the query is used with different data or context.

I'm not seeing a bulletproof way offhand, other than "update to v16".
Disabling nestloop plans fixes your Q4 in the back branches, but not
Q1 (because there's no opportunity for a hash or merge join in Q1).


Q1 is trivially fixed by specifying "WITH c2 AS MATERIALIZED ( ... )", No?  Which is why it isn't broken in 9.4 and any of the other versions where we materialized CTEs by default.

David J.

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Fri, Apr 12, 2024 at 8:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not seeing a bulletproof way offhand, other than "update to v16".

> Q1 is trivially fixed by specifying "WITH c2 AS MATERIALIZED ( ... )", No?

Q1 as stated, yeah.  But the question was about generically avoiding
this bug, and I'm not yet sure about how to do that in cases not
involving a WITH boundary (eg Q3,Q4).

            regards, tom lane



I wrote:
> It looks like the problem is that the old join_clause_is_movable
> logic is incorrectly deciding that the WHERE condition can be
> pushed down to the sub-select relation.

Nope, that's not accurate: I soon found that the
join_clause_is_movable functions aren't being invoked at all for the
troublesome clause.  It turns out that that clause is generated by
generate_join_implied_equalities (after having been, rather uselessly,
decomposed into an EquivalenceClass), and the reason it gets into the
scan-level condition is that get_baserel_parampathinfo puts it there
without any movability checking.  That happens because
get_baserel_parampathinfo believes this:

     * Add in joinclauses generated by EquivalenceClasses, too.  (These
     * necessarily satisfy join_clause_is_movable_into.)

But this clause *doesn't* satisfy that function; in the back branches
it'll fail the test against nullable_relids.  So I said to myself,
great, we can fix this by filtering the output clauses with
join_clause_is_movable_into.  That takes care of the submitted bug
all right, but it turns out that it also causes some existing
regression test cases to give wrong answers.  And that is because
of a different problem: some clauses generated by
generate_join_implied_equalities don't satisfy the "Clause must
physically reference at least one target rel" heuristic in
join_clause_is_movable_into.  Specifically that fails if we have a
clause generated for a child appendrel member (ie, one arm of a
flattened UNION ALL construct) whose EC member expression is Var-free.

So this seems like a bit of a mess.  We can fix the submitted bug with
the kluge of only testing the nullable_relids condition, as I've done
in the attached patch for v15.  (As join_clause_is_movable_into says,
this condition is conservative and might sometimes reject a clause
that could be evaluated safely, but that's fine for
get_baserel_parampathinfo's purposes.)  But I'm worried about the
physically-reference business, because this function isn't the only
one that assumes that all output from generate_join_implied_equalities
will satisfy join_clause_is_movable_into: there are other functions
that actually assert that.  How come we've not seen assertion
failures, or reports of wrong query results similar to this one?
I can believe that the constant-EC-member situation is impossible
when considering a join as inner_rel, so get_joinrel_parampathinfo's
first usage is probably safe.  But it sure looks like the
generate_join_implied_equalities_for_ecs usage is not.  I wonder
if we can replace the test on clause_relids with something a bit
more reliable, or skip it when considering EC-derived clauses.

Anyway, if we go no further than this then we need the first
attached patch in <= v15, while in v16/HEAD we should at least
modify the comment to reflect reality, as I've done in the
second patch.

Thoughts?  Can anybody devise a test case that triggers the
Asserts I mentioned?

            regards, tom lane

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 3c75fd56f2..0b130c771f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1333,14 +1333,25 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,
     }

     /*
-     * Add in joinclauses generated by EquivalenceClasses, too.  (These
-     * necessarily satisfy join_clause_is_movable_into.)
+     * Add in joinclauses generated by EquivalenceClasses, too.  We do not
+     * check these against join_clause_is_movable_into, because its heuristic
+     * about physically referencing the target rel can fail in certain cases
+     * involving child rels whose EC member is a constant.  However, we must
+     * check its condition about the target rel not being nullable below the
+     * clause.  (The other conditions it checks should be true by construction
+     * for an EC-derived clause.)
      */
-    pclauses = list_concat(pclauses,
-                           generate_join_implied_equalities(root,
-                                                            joinrelids,
-                                                            required_outer,
-                                                            baserel));
+    foreach(lc, generate_join_implied_equalities(root,
+                                                 joinrelids,
+                                                 required_outer,
+                                                 baserel))
+    {
+        RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+        /* Target rel must not be nullable below the clause */
+        if (!bms_overlap(baserel->relids, rinfo->nullable_relids))
+            pclauses = lappend(pclauses, rinfo);
+    }

     /* Estimate the number of rows returned by the parameterized scan */
     rows = get_parameterized_baserel_size(root, baserel, pclauses);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 867c6d20cc..b356153900 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5949,6 +5949,37 @@ select * from
  3 | 3
 (6 rows)

+-- check for generation of join EC conditions at wrong level (bug #18429)
+explain (costs off)
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+                                              QUERY PLAN
+-------------------------------------------------------------------------------------------------------
+ Nested Loop Left Join
+   Filter: ((COALESCE(tenk1.hundred, 0) * arrayd.ad) = (COALESCE(tenk1.hundred, 0) * (arrayd.ad + 1)))
+   ->  Function Scan on unnest arrayd
+   ->  Index Scan using tenk1_unique2 on tenk1
+         Index Cond: (unique2 = arrayd.ad)
+(5 rows)
+
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+ ad | h
+----+---
+(0 rows)
+
 -- check the number of columns specified
 SELECT * FROM (int8_tbl i cross join int4_tbl j) ss(a,b,c,d);
 ERROR:  join expression "ss" has 3 columns available but 4 columns specified
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1113e98445..11a857041a 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2029,6 +2029,25 @@ select * from
    (select q1.v)
   ) as q2;

+-- check for generation of join EC conditions at wrong level (bug #18429)
+explain (costs off)
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+
 -- check the number of columns specified
 SELECT * FROM (int8_tbl i cross join int4_tbl j) ss(a,b,c,d);

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 0c125e42e8..7da2f913ce 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1594,8 +1594,11 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,
     }

     /*
-     * Add in joinclauses generated by EquivalenceClasses, too.  (These
-     * necessarily satisfy join_clause_is_movable_into.)
+     * Add in joinclauses generated by EquivalenceClasses, too.  We do not
+     * check these against join_clause_is_movable_into, because its heuristic
+     * about physically referencing the target rel can fail in certain cases
+     * involving child rels whose EC member is a constant.  We must trust the
+     * EC mechanism to not produce joinclauses that don't belong here.
      */
     pclauses = list_concat(pclauses,
                            generate_join_implied_equalities(root,


On Sun, Apr 14, 2024 at 6:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
So this seems like a bit of a mess.  We can fix the submitted bug with
the kluge of only testing the nullable_relids condition, as I've done
in the attached patch for v15.  (As join_clause_is_movable_into says,
this condition is conservative and might sometimes reject a clause
that could be evaluated safely, but that's fine for
get_baserel_parampathinfo's purposes.)

I wondered that we could fix this issue by checking em_nullable_relids
in generate_join_implied_equalities_normal when we determine if an EC
member is computable at the join.  Then I realize that this is not easy
to achieve without knowing the exact join(s) where the nulling would
happen, which is exactly what the nullingrel stuff introduced in v16
does.  So your proposed fix seems the right way to go.

Now that we've learned that join_clause_is_movable_into's heuristic
about physically referencing the target rel can fail for EC-derived
clauses, I'm kind of concerned that we may end up with duplicate clauses
in the final plan, since we do not check EC-derived clauses against
join_clause_is_movable_into in get_baserel_parampathinfo while we do in
create_nestloop_path.  What if we have an EC-derived clause that in
get_baserel_parampathinfo it is put into ppi_clauses while in
create_nestloop_path it does not pass the movability checking?  Is it
possible to occur, or is it just my illusion?

Thanks
Richard

On Fri, Apr 12, 2024 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"git bisect" fingers this commit as the first one producing correct
answers:

commit 2489d76c4906f4461a364ca8ad7e0751ead8aa0d
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Jan 30 13:16:20 2023 -0500

    Make Vars be outer-join-aware.

This is kind of exciting for me, as IIRC it's the first field-detected
bug that that work fixes.

+1.  As a big fan of the outer-join-aware-Var work, I'm so excited to
see field-detected bugs it fixes.

Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
> I wondered that we could fix this issue by checking em_nullable_relids
> in generate_join_implied_equalities_normal when we determine if an EC
> member is computable at the join.  Then I realize that this is not easy
> to achieve without knowing the exact join(s) where the nulling would
> happen, which is exactly what the nullingrel stuff introduced in v16
> does.  So your proposed fix seems the right way to go.

Yeah.  The reason these queries work correctly in v16+ is that
get_baserel_parampathinfo calls generate_join_implied_equalities
with joinrelids that don't include the outer join's relid, so
the latter won't produce any join clauses that need to be evaluated
above the outer join.  But later, when build_joinrel_restrictlist
calls generate_join_implied_equalities, it does include the outer
join's relid, so we correctly produce and enforce the clause as a
filter clause at the outer join's plan level.

However, pre-v16, those two calls pass the exact same parameters
and necessarily get the exact same clauses.  We could only fix that
with an API change for generate_join_implied_equalities, which seems
dangerous in stable branches.  So I think filtering the clause list
in get_baserel_parampathinfo is the right direction for a solution
there.  We can make it cleaner than in my WIP patch though...

> Now that we've learned that join_clause_is_movable_into's heuristic
> about physically referencing the target rel can fail for EC-derived
> clauses, I'm kind of concerned that we may end up with duplicate clauses
> in the final plan, since we do not check EC-derived clauses against
> join_clause_is_movable_into in get_baserel_parampathinfo while we do in
> create_nestloop_path.  What if we have an EC-derived clause that in
> get_baserel_parampathinfo it is put into ppi_clauses while in
> create_nestloop_path it does not pass the movability checking?  Is it
> possible to occur, or is it just my illusion?

I'm not sure either, but it certainly seems like a hazard.  Also,
get_joinrel_parampathinfo is really depending on getting consistent
results from join_clause_is_movable_into to assign clauses to the
correct join level.  So after sleeping on it I think that "the
results of generate_join_implied_equalities should pass
join_clause_is_movable_into" is an invariant we don't really want to
let go of, meaning that it would be a good idea to fix equivclass.c
to ensure that's true in these child-rel corner cases.  That's not
very hard, requiring just a small hack in create_join_clause, as
attached.  It's not that much of a hack either because there are
other places in equivclass.c that force the relid sets for child
expressions to be more than what pull_varnos would conclude (search
for comments mentioning pull_varnos).

Having done that, we can add code in HEAD/v16 to assert that
join_clause_is_movable_into is true here, while in the older branches
we can use it to filter rather than klugily checking nullable_relids
directly.  So I end with the attached two patches.

I didn't include the new test case in the HEAD/v16 patch; since it
doesn't represent a live bug for those branches I felt like maybe
it's not worth the test cycles going forward.  But there's certainly
room for the opposite opinion.  What do you think?

            regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a619ff9177..1d6bedb399 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1896,6 +1896,21 @@ create_join_clause(PlannerInfo *root,
                                                   rightem->em_relids),
                                         ec->ec_min_security);

+    /*
+     * If either EM is a child, force the clause's clause_relids to include
+     * the relid(s) of the child rel.  In normal cases it would already, but
+     * not if we are considering appendrel child relations with pseudoconstant
+     * translated variables (i.e., UNION ALL sub-selects with constant output
+     * items).  We must do this so that join_clause_is_movable_into() will
+     * think that the clause should be evaluated at the correct place.
+     */
+    if (leftem->em_is_child)
+        rinfo->clause_relids = bms_add_members(rinfo->clause_relids,
+                                               leftem->em_relids);
+    if (rightem->em_is_child)
+        rinfo->clause_relids = bms_add_members(rinfo->clause_relids,
+                                               rightem->em_relids);
+
     /* If it's a child clause, copy the parent's rinfo_serial */
     if (parent_rinfo)
         rinfo->rinfo_serial = parent_rinfo->rinfo_serial;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 0c125e42e8..e05b21c884 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1560,6 +1560,7 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,
     ParamPathInfo *ppi;
     Relids        joinrelids;
     List       *pclauses;
+    List       *eqclauses;
     Bitmapset  *pserials;
     double        rows;
     ListCell   *lc;
@@ -1595,14 +1596,25 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,

     /*
      * Add in joinclauses generated by EquivalenceClasses, too.  (These
-     * necessarily satisfy join_clause_is_movable_into.)
+     * necessarily satisfy join_clause_is_movable_into; but in assert-enabled
+     * builds, let's verify that.)
      */
-    pclauses = list_concat(pclauses,
-                           generate_join_implied_equalities(root,
-                                                            joinrelids,
-                                                            required_outer,
-                                                            baserel,
-                                                            NULL));
+    eqclauses = generate_join_implied_equalities(root,
+                                                 joinrelids,
+                                                 required_outer,
+                                                 baserel,
+                                                 NULL);
+#ifdef USE_ASSERT_CHECKING
+    foreach(lc, eqclauses)
+    {
+        RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+        Assert(join_clause_is_movable_into(rinfo,
+                                           baserel->relids,
+                                           joinrelids));
+    }
+#endif
+    pclauses = list_concat(pclauses, eqclauses);

     /* Compute set of serial numbers of the enforced clauses */
     pserials = NULL;
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 9f39f4661a..06f89a6732 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1851,6 +1851,21 @@ create_join_clause(PlannerInfo *root,
                                                   rightem->em_nullable_relids),
                                         ec->ec_min_security);

+    /*
+     * If either EM is a child, force the clause's clause_relids to include
+     * the relid(s) of the child rel.  In normal cases it would already, but
+     * not if we are considering appendrel child relations with pseudoconstant
+     * translated variables (i.e., UNION ALL sub-selects with constant output
+     * items).  We must do this so that join_clause_is_movable_into() will
+     * think that the clause should be evaluated at the correct place.
+     */
+    if (leftem->em_is_child)
+        rinfo->clause_relids = bms_add_members(rinfo->clause_relids,
+                                               leftem->em_relids);
+    if (rightem->em_is_child)
+        rinfo->clause_relids = bms_add_members(rinfo->clause_relids,
+                                               rightem->em_relids);
+
     /* Mark the clause as redundant, or not */
     rinfo->parent_ec = parent_ec;

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 3c75fd56f2..6e1c87a4e2 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1300,6 +1300,7 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,
     ParamPathInfo *ppi;
     Relids        joinrelids;
     List       *pclauses;
+    List       *eqclauses;
     double        rows;
     ListCell   *lc;

@@ -1333,14 +1334,24 @@ get_baserel_parampathinfo(PlannerInfo *root, RelOptInfo *baserel,
     }

     /*
-     * Add in joinclauses generated by EquivalenceClasses, too.  (These
-     * necessarily satisfy join_clause_is_movable_into.)
+     * Add in joinclauses generated by EquivalenceClasses, too.  In principle
+     * these should always satisfy join_clause_is_movable_into; but if we are
+     * below an outer join the clauses might contain Vars that should only be
+     * evaluated above the join, so we have to check.
      */
-    pclauses = list_concat(pclauses,
-                           generate_join_implied_equalities(root,
-                                                            joinrelids,
-                                                            required_outer,
-                                                            baserel));
+    eqclauses = generate_join_implied_equalities(root,
+                                                 joinrelids,
+                                                 required_outer,
+                                                 baserel);
+    foreach(lc, eqclauses)
+    {
+        RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+        if (join_clause_is_movable_into(rinfo,
+                                        baserel->relids,
+                                        joinrelids))
+            pclauses = lappend(pclauses, rinfo);
+    }

     /* Estimate the number of rows returned by the parameterized scan */
     rows = get_parameterized_baserel_size(root, baserel, pclauses);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 867c6d20cc..b356153900 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5949,6 +5949,37 @@ select * from
  3 | 3
 (6 rows)

+-- check for generation of join EC conditions at wrong level (bug #18429)
+explain (costs off)
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+                                              QUERY PLAN
+-------------------------------------------------------------------------------------------------------
+ Nested Loop Left Join
+   Filter: ((COALESCE(tenk1.hundred, 0) * arrayd.ad) = (COALESCE(tenk1.hundred, 0) * (arrayd.ad + 1)))
+   ->  Function Scan on unnest arrayd
+   ->  Index Scan using tenk1_unique2 on tenk1
+         Index Cond: (unique2 = arrayd.ad)
+(5 rows)
+
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+ ad | h
+----+---
+(0 rows)
+
 -- check the number of columns specified
 SELECT * FROM (int8_tbl i cross join int4_tbl j) ss(a,b,c,d);
 ERROR:  join expression "ss" has 3 columns available but 4 columns specified
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1113e98445..11a857041a 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2029,6 +2029,25 @@ select * from
    (select q1.v)
   ) as q2;

+-- check for generation of join EC conditions at wrong level (bug #18429)
+explain (costs off)
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+select * from (
+  select arrayd.ad, coalesce(c.hundred, 0) as h
+  from unnest(array[1]) as arrayd(ad)
+  left join lateral (
+    select hundred from tenk1 where unique2 = arrayd.ad
+  ) c on true
+) c2
+where c2.h * c2.ad = c2.h * (c2.ad + 1);
+
 -- check the number of columns specified
 SELECT * FROM (int8_tbl i cross join int4_tbl j) ss(a,b,c,d);



On Mon, Apr 15, 2024 at 12:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> Now that we've learned that join_clause_is_movable_into's heuristic
> about physically referencing the target rel can fail for EC-derived
> clauses, I'm kind of concerned that we may end up with duplicate clauses
> in the final plan, since we do not check EC-derived clauses against
> join_clause_is_movable_into in get_baserel_parampathinfo while we do in
> create_nestloop_path.  What if we have an EC-derived clause that in
> get_baserel_parampathinfo it is put into ppi_clauses while in
> create_nestloop_path it does not pass the movability checking?  Is it
> possible to occur, or is it just my illusion?

I'm not sure either, but it certainly seems like a hazard.  Also,
get_joinrel_parampathinfo is really depending on getting consistent
results from join_clause_is_movable_into to assign clauses to the
correct join level.  So after sleeping on it I think that "the
results of generate_join_implied_equalities should pass
join_clause_is_movable_into" is an invariant we don't really want to
let go of, meaning that it would be a good idea to fix equivclass.c
to ensure that's true in these child-rel corner cases.  That's not
very hard, requiring just a small hack in create_join_clause, as
attached.  It's not that much of a hack either because there are
other places in equivclass.c that force the relid sets for child
expressions to be more than what pull_varnos would conclude (search
for comments mentioning pull_varnos).

Having done that, we can add code in HEAD/v16 to assert that
join_clause_is_movable_into is true here, while in the older branches
we can use it to filter rather than klugily checking nullable_relids
directly.  So I end with the attached two patches.

+1 to both patches.
 
I didn't include the new test case in the HEAD/v16 patch; since it
doesn't represent a live bug for those branches I felt like maybe
it's not worth the test cycles going forward.  But there's certainly
room for the opposite opinion.  What do you think?

I agree that the new test case for v15 does not seem to be worth
including in v16+.  It seems to me that it would be better if we can
have another new test case to verify that we've included child rel's
em_relids even for appendrel child relations with pseudoconstant
translated variables, i.e. to verify that the change in equivclass.c
takes effect.  Maybe with a query like below:

explain (costs off)
select * from tenk1 t1
  left join lateral
    (select t1.unique1 as t1u, 0 as c
     union all
     select t1.unique1 as t1u, 1 as c) s on true
where t1.unique1 = s.c;

Without the change in equivclass.c, this query would trigger the new
added assert in get_baserel_parampathinfo for v16, and give a wrong plan
for v15.  What do you think?

Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
> I agree that the new test case for v15 does not seem to be worth
> including in v16+.  It seems to me that it would be better if we can
> have another new test case to verify that we've included child rel's
> em_relids even for appendrel child relations with pseudoconstant
> translated variables, i.e. to verify that the change in equivclass.c
> takes effect.  Maybe with a query like below:

> explain (costs off)
> select * from tenk1 t1
>   left join lateral
>     (select t1.unique1 as t1u, 0 as c
>      union all
>      select t1.unique1 as t1u, 1 as c) s on true
> where t1.unique1 = s.c;

> Without the change in equivclass.c, this query would trigger the new
> added assert in get_baserel_parampathinfo for v16, and give a wrong plan
> for v15.  What do you think?

I didn't add such a test because there are already several cases
(in foreign_data.sql, IIRC) that trigger the assert, which is
how come I found the problem in the first place.  Admittedly,
those depend on potentially-changeable details of an
information_schema view, so maybe it'd be better to have a
bespoke test.

            regards, tom lane



I didn't add such a test because there are already several cases
(in foreign_data.sql, IIRC) that trigger the assert, which is
how come I found the problem in the first place.  Admittedly,
those depend on potentially-changeable details of an
information_schema view, so maybe it'd be better to have a
bespoke test.

I recently had a situation where having a bespoke test would have been helpful. Attached is a mostly-verbatim formalization of the tests from the original email. Q3 was of particular use to me.
Attachment