Thread: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN

The following bug has been logged on the website:

Bug reference:      17978
Logged by:          Zuming Jiang
Email address:      zuming.jiang@inf.ethz.ch
PostgreSQL version: 16beta1
Operating system:   Ubuntu 20.04
Description:

My fuzzer finds a bug in Postgres, which triggers an unexpected error in
Postgres.

--- Set up database ---
CREATE TABLE t0 (vkey integer, c1 timestamp);
CREATE TABLE t1 (vkey integer, pkey integer);
CREATE TABLE t2 (c11 timestamp, c14 timestamp);
CREATE VIEW t3 AS
 SELECT ref_0.vkey AS c_0
   FROM (t0 ref_0
     LEFT JOIN t2 ref_1 ON ((ref_0.c1 = ref_1.c11)))
  WHERE ((false < ( SELECT (ref_0.c1 <> ref_1.c14) AS c_0
           FROM t2 ref_6
         ORDER BY ref_0.c1, ref_1.c14
        LIMIT 1)));
CREATE VIEW t6 AS SELECT 1 AS c_0, 2 AS c_1, 3 AS c_2;
INSERT INTO t1 VALUES (80, 90000);
ALTER TABLE ONLY t1 ADD CONSTRAINT t1_pkey PRIMARY KEY (pkey);
------------------------

The fuzzer generates Test case:

--- Test case ---
select  
    ref_38.c_0 as c_10 
  from 
    ((t3 as ref_38
        full outer join t6 as ref_39
        on (ref_38.c_0 = ref_39.c_1))
      right outer join t1 as ref_40
      on (ref_38.c_0 = ref_40.vkey));
------------------------

--- Expected behavior ---
The test case should not trigger any error.

--- Actual behavior ---
The test case trigger an error: 

ERROR:  wrong varnullingrels (b) (expected (b 5)) for Var 6/2

--- Postgres version ---
Github commit: 3f1aaaa180689f2015e7f7bd01c9be6d7a993b42
Version: PostgreSQL 16beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

--- Platform information ---
Platform: Ubuntu 20.04
Kernel: Linux 5.4.0-147-generic


PG Bug reporting form <noreply@postgresql.org> writes:
> My fuzzer finds a bug in Postgres, which triggers an unexpected error in
> Postgres.

Thanks for the report!

I initially thought this was something wrong with subplan parameters,
but it reproduces without a subplan.  I reduced it to this test case
in the regression database:

select ss1.unique1
from
  (select t0.unique1
   from onek t0 left join onek t1 on t0.hundred = t1.thousand
   where t0.hundred <> coalesce(t1.tenthous, -1)) as ss1
  left join (select 1 as c_1) as ss2 on ss1.unique1 = ss2.c_1
  right join int4_tbl i4 on ss1.unique1 = i4.f1;

ERROR:  wrong varnullingrels (b) (expected (b 5)) for Var 6/7

Temporarily disabling the setrefs.c crosscheck shows that it's
trying to create this plan:

 Hash Right Join
   Hash Cond: (t1.thousand = t0.hundred)
   Filter: (t0.hundred <> COALESCE(t1.tenthous, '-1'::integer))
   ->  Seq Scan on onek t1
   ->  Hash
         ->  Nested Loop Left Join
               ->  Seq Scan on int4_tbl i4
               ->  Index Scan using onek_unique1 on onek t0
                     Index Cond: (unique1 = i4.f1)

which is the wrong join order: the filter condition can't be
applied at that join level.  So the nullingrel cross-check
has caught a real bug, but why the bug?  Pre-v16, this would
have been prevented by the delay_upper_joins mechanism.
I convinced myself that we didn't need that anymore, but
maybe I was mistaken.  It could also be some smaller problem.
It's curious that the bug doesn't reproduce if you remove the
visibly-useless join to ss2:

regression=# explain (costs off) select ss1.unique1
regression-# from
regression-#   (select t0.unique1
regression(#    from onek t0 left join onek t1 on t0.hundred = t1.thousand
regression(#    where t0.hundred <> coalesce(t1.tenthous, -1)) as ss1
regression-# right join int4_tbl i4 on ss1.unique1 = i4.f1;

 Hash Right Join
   Hash Cond: (t0.unique1 = i4.f1)
   ->  Hash Left Join
         Hash Cond: (t0.hundred = t1.thousand)
         Filter: (t0.hundred <> COALESCE(t1.tenthous, '-1'::integer))
         ->  Seq Scan on onek t0
         ->  Hash
               ->  Seq Scan on onek t1
   ->  Hash
         ->  Seq Scan on int4_tbl i4

So maybe the logic is fundamentally correct but gets confused by
intermediate outer joins.  Needs more time to look at than I have
today, but I'll add an open item.

            regards, tom lane



I wrote:
> ... which is the wrong join order: the filter condition can't be
> applied at that join level.  So the nullingrel cross-check
> has caught a real bug, but why the bug?  Pre-v16, this would
> have been prevented by the delay_upper_joins mechanism.
> I convinced myself that we didn't need that anymore, but
> maybe I was mistaken.  It could also be some smaller problem.
> It's curious that the bug doesn't reproduce if you remove the
> visibly-useless join to ss2:

Ah-hah, I now understand why that is.  Without the join to ss2,
the FROM/WHERE clause is directly below the left join to int4_tbl,
and remove_useless_result_rtes will hoist the problematic WHERE
qual up into the upper left join's quals -- see the para beginning
"This pass also replaces single-child FromExprs with their child node"
in prepjointree.c.  After that, we can see that the left join's quals
reference both sides of the lower left join so identity 3 cannot apply.

With the join to ss2, that intervening join prevents the hoisting from
happening and then we incorrectly conclude that identity 3 can be used.

I'm inclined to think that temporarily hoisting such quals into the
upper left join's qual list is still the best solution, as anything
else would require weird and bug-prone action-at-a-distance checks
during deconstruct_jointree.  However, we need to make it happen in
this case where the hoisting needs to pass a qual from a lower WHERE
in a left join's LHS up to the RHS of a higher left join.

(I think that this is the only missing case.  Intermediate joins
that aren't LEFT will prevent commutation anyway, as will multi-
member FROM joins.)

I'm not sure if the best way is to extend that logic in
remove_useless_result_rtes, or to rip it out and handle the
problem during deconstruct_jointree.  The latter would probably
involve more new code, but it might end up cleaner.  This whole
business of removing trivial FromExprs is a bit outside what
you'd expect remove_useless_result_rtes to do.  If memory serves,
I wrote that logic before inventing the new multi-pass architecture
for deconstruct_jointree; it's possible that that change would
make it easier to deal with this in deconstruct_jointree.

I have no time to write any code today, but that seems like the
direction to pursue.

            regards, tom lane




On Sun, Jun 18, 2023 at 3:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Temporarily disabling the setrefs.c crosscheck shows that it's
trying to create this plan:

which is the wrong join order: the filter condition can't be
applied at that join level.  So the nullingrel cross-check
has caught a real bug, but why the bug? 

FWIW, I found that we have an existing test query in sql/join.sql that
almost exposes this issue.

explain (costs off)
select * from int4_tbl t1
  left join ((select t2.f1 from int4_tbl t2
                left join int4_tbl t3 on t2.f1 > 0
                where t3.f1 is null) s
             left join tenk1 t4 on s.f1 > 1)
    on s.f1 = t1.f1;

If we change the WHERE clause to 't2.f1 != coalesce(t3.f1, 1)', we will
see this issue.

explain (costs off)
select * from int4_tbl t1
  left join ((select t2.f1 from int4_tbl t2
                left join int4_tbl t3 on t2.f1 > 0
                where t2.f1 != coalesce(t3.f1, 1)) s
             left join tenk1 t4 on s.f1 > 1)
    on s.f1 = t1.f1;
ERROR:  wrong varnullingrels (b) (expected (b 5)) for Var 6/1

Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
> FWIW, I found that we have an existing test query in sql/join.sql that
> almost exposes this issue.

> explain (costs off)
> select * from int4_tbl t1
>   left join ((select t2.f1 from int4_tbl t2
>                 left join int4_tbl t3 on t2.f1 > 0
>                 where t3.f1 is null) s
>              left join tenk1 t4 on s.f1 > 1)
>     on s.f1 = t1.f1;

> If we change the WHERE clause to 't2.f1 != coalesce(t3.f1, 1)', we will
> see this issue.

Interesting that you should bring up that query, because after
quick-hacking a fix in remove_useless_results(), all the existing
test cases pass except that that one changes plan: it now does the
joins in syntactic order, whereas HEAD finds a plan that joins t4
last.  Joining t4 last has a noticeably better cost estimate than
the syntactic-order plan, so it's kind of sad that we can't find
it anymore.  On the other hand, v15 and before don't find it either,
so this wouldn't be a regression.

According to inspection of the SpecialJoinInfos, HEAD believes
that the t2/t3 outer join of that query could commute according
to identity 3 with the outermost left join between t1 and t2.
It doesn't act on that observation, and I'm not sure it's even
correct to believe that.  This patch doesn't believe that anymore,
but it still believes that the t1/t2 join could commute with the
t2/t4 join.  Why it no longer acts on that is unclear.  There is
probably some buglet or missed opportunity somewhere.

Even odder is that given your modified query in which the WHERE
clause mentions both t2 and t3, it *does* find the plan that
joins t4 last, as shown in the new test case.  This makes me
wonder if it's something triggered by the avoid-clauseless-joins
heuristic, though I can't quite see how that'd apply.

Anyway, what I'm inclined to do is flesh out the attached by updating
the comments for remove_useless_results() and then push it.  Later
on we can look for why it's not finding the better join order; that's
a separable issue, and if it is about avoid-clauseless-joins then we
might choose to live with it rather than incur a lot of planner cost
to fix it.

            regards, tom lane

diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 2f589b1b99..0091eaea23 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -3322,7 +3322,9 @@ remove_useless_results_recurse(PlannerInfo *root, Node *jtnode,
          */
         j->larg = remove_useless_results_recurse(root, j->larg,
                                                  (j->jointype == JOIN_INNER) ?
-                                                 &j->quals : NULL,
+                                                 &j->quals :
+                                                 (j->jointype == JOIN_LEFT) ?
+                                                 parent_quals : NULL,
                                                  dropped_outer_joins);
         j->rarg = remove_useless_results_recurse(root, j->rarg,
                                                  (j->jointype == JOIN_INNER ||
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index cc4c122fdd..f7a34b76f8 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2531,19 +2531,19 @@ select * from int4_tbl t1
     on s.f1 = t1.f1;
                    QUERY PLAN
 -------------------------------------------------
- Nested Loop Left Join
-   Join Filter: (t2.f1 > 1)
-   ->  Hash Right Join
-         Hash Cond: (t2.f1 = t1.f1)
+ Hash Right Join
+   Hash Cond: (t2.f1 = t1.f1)
+   ->  Nested Loop Left Join
+         Join Filter: (t2.f1 > 1)
          ->  Nested Loop Left Join
                Join Filter: (t2.f1 > 0)
                Filter: (t3.f1 IS NULL)
                ->  Seq Scan on int4_tbl t2
                ->  Materialize
                      ->  Seq Scan on int4_tbl t3
-         ->  Hash
-               ->  Seq Scan on int4_tbl t1
-   ->  Seq Scan on tenk1 t4
+         ->  Seq Scan on tenk1 t4
+   ->  Hash
+         ->  Seq Scan on int4_tbl t1
 (13 rows)

 explain (costs off)
@@ -2628,6 +2628,31 @@ select * from onek t1
                            Filter: (two = t2.two)
 (11 rows)

+explain (costs off)
+select * from int4_tbl t1
+  left join ((select t2.f1 from int4_tbl t2
+                left join int4_tbl t3 on t2.f1 > 0
+                where t2.f1 <> coalesce(t3.f1, -1)) s
+             left join tenk1 t4 on s.f1 > 1)
+    on s.f1 = t1.f1;
+                           QUERY PLAN
+-----------------------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (t2.f1 > 1)
+   ->  Hash Right Join
+         Hash Cond: (t2.f1 = t1.f1)
+         ->  Nested Loop Left Join
+               Join Filter: (t2.f1 > 0)
+               Filter: (t2.f1 <> COALESCE(t3.f1, '-1'::integer))
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize
+                     ->  Seq Scan on int4_tbl t3
+         ->  Hash
+               ->  Seq Scan on int4_tbl t1
+   ->  Materialize
+         ->  Seq Scan on tenk1 t4
+(14 rows)
+
 --
 -- check a case where we formerly got confused by conflicting sort orders
 -- in redundant merge join path keys
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index e77e469570..3683a562c2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -528,6 +528,14 @@ select * from onek t1
       (select * from onek t3 where t3.two = t2.two offset 0) s
       on t2.unique1 = 1;

+explain (costs off)
+select * from int4_tbl t1
+  left join ((select t2.f1 from int4_tbl t2
+                left join int4_tbl t3 on t2.f1 > 0
+                where t2.f1 <> coalesce(t3.f1, -1)) s
+             left join tenk1 t4 on s.f1 > 1)
+    on s.f1 = t1.f1;
+
 --
 -- check a case where we formerly got confused by conflicting sort orders
 -- in redundant merge join path keys

I wrote:
> Anyway, what I'm inclined to do is flesh out the attached by updating
> the comments for remove_useless_results() and then push it.  Later
> on we can look for why it's not finding the better join order; that's
> a separable issue, and if it is about avoid-clauseless-joins then we
> might choose to live with it rather than incur a lot of planner cost
> to fix it.

I couldn't resist poking into that, and it seems there's less here
than meets the eye.  I'd been guessing that the presence or absence
of a t2 reference in the WHERE clause was affecting this, but no: the
SpecialJoinInfos look exactly the same for both queries, and the set
of joins considered is the same in both.  What is causing the
different plan shape is that the selectivity estimates for these
WHERE clauses are a lot different:

regression=# explain
regression-# select t2.f1 from int4_tbl t2
regression-#                 left join int4_tbl t3 on t2.f1 > 0
regression-#                 where t3.f1 is null;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..2.49 rows=1 width=4)
   Join Filter: (t2.f1 > 0)
   Filter: (t3.f1 IS NULL)
   ->  Seq Scan on int4_tbl t2  (cost=0.00..1.05 rows=5 width=4)
   ->  Materialize  (cost=0.00..1.07 rows=5 width=4)
         ->  Seq Scan on int4_tbl t3  (cost=0.00..1.05 rows=5 width=4)
(6 rows)

regression=# explain
regression-# select t2.f1 from int4_tbl t2
regression-#                 left join int4_tbl t3 on t2.f1 > 0
regression-#                 where t2.f1 <> coalesce(t3.f1, -1);
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..2.55 rows=8 width=4)
   Join Filter: (t2.f1 > 0)
   Filter: (t2.f1 <> COALESCE(t3.f1, '-1'::integer))
   ->  Seq Scan on int4_tbl t2  (cost=0.00..1.05 rows=5 width=4)
   ->  Materialize  (cost=0.00..1.07 rows=5 width=4)
         ->  Seq Scan on int4_tbl t3  (cost=0.00..1.05 rows=5 width=4)

and that ends up with the other join order looking better.  We can
synthesize a different non-strict, t3-only qual with a similar
selectivity estimate:

regression=# explain
regression-# select t2.f1 from int4_tbl t2
regression-#                 left join int4_tbl t3 on t2.f1 > 0
regression-#                 where    -1 <> coalesce(t3.f1, -1);
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..2.55 rows=8 width=4)
   Join Filter: (t2.f1 > 0)
   Filter: ('-1'::integer <> COALESCE(t3.f1, '-1'::integer))
   ->  Seq Scan on int4_tbl t2  (cost=0.00..1.05 rows=5 width=4)
   ->  Materialize  (cost=0.00..1.07 rows=5 width=4)
         ->  Seq Scan on int4_tbl t3  (cost=0.00..1.05 rows=5 width=4)

and then if you plug that into the whole query you get a t4-last plan:

regression=# explain (costs off)
regression-# select * from int4_tbl t1
regression-#   left join ((select t2.f1 from int4_tbl t2
regression(#                 left join int4_tbl t3 on t2.f1 > 0
regression(#                 where    -1 <> coalesce(t3.f1, -1)   ) s
regression(#              left join tenk1 t4 on s.f1 > 1)
regression-#     on s.f1 = t1.f1;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Nested Loop Left Join
   Join Filter: (t2.f1 > 1)
   ->  Hash Right Join
         Hash Cond: (t2.f1 = t1.f1)
         ->  Nested Loop Left Join
               Join Filter: (t2.f1 > 0)
               Filter: ('-1'::integer <> COALESCE(t3.f1, '-1'::integer))
               ->  Seq Scan on int4_tbl t2
               ->  Materialize
                     ->  Seq Scan on int4_tbl t3
         ->  Hash
               ->  Seq Scan on int4_tbl t1
   ->  Materialize
         ->  Seq Scan on tenk1 t4
(14 rows)

So, nothing to see here after all.

            regards, tom lane




On Tue, Jun 20, 2023 at 4:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> Anyway, what I'm inclined to do is flesh out the attached by updating
> the comments for remove_useless_results() and then push it.  Later
> on we can look for why it's not finding the better join order; that's
> a separable issue, and if it is about avoid-clauseless-joins then we
> might choose to live with it rather than incur a lot of planner cost
> to fix it.

I couldn't resist poking into that, and it seems there's less here
than meets the eye.  I'd been guessing that the presence or absence
of a t2 reference in the WHERE clause was affecting this, but no: the
SpecialJoinInfos look exactly the same for both queries, and the set
of joins considered is the same in both.  What is causing the
different plan shape is that the selectivity estimates for these
WHERE clauses are a lot different:

I was also wondering why the plan changes for this existing query after
applying this fix.  After some investigation I came to the same
conclusion: it's all about different size estimates.

For this query, we'd form joinrel that includes {t1, t2, t3}, with or
without the fix.  However, without the fix we form the joinrel with
{t1/t2} and {t3}, and thus treat 't2.f1 > 0' and 't3.f1 IS NULL' as
restrict clauses, and with that we calculate the joinrel size as 1.
After applying this fix, the t1/t2 join is not legal anymore (which I
think is right), and we form joinrel {t1/t2/t3} with {t1} and {t2/t3}.
This time we treat 't2.f1 = t1.f1' as restrict clause and calculate the
joinrel size as 5.  I manually changed its size to 1 with gdb and then
the final plan changed back to the previous one, ie, the one that joins
t4 last, with exactly the same cost as previously.

Thanks
Richard