Thread: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17978 Logged by: Zuming Jiang Email address: zuming.jiang@inf.ethz.ch PostgreSQL version: 16beta1 Operating system: Ubuntu 20.04 Description: My fuzzer finds a bug in Postgres, which triggers an unexpected error in Postgres. --- Set up database --- CREATE TABLE t0 (vkey integer, c1 timestamp); CREATE TABLE t1 (vkey integer, pkey integer); CREATE TABLE t2 (c11 timestamp, c14 timestamp); CREATE VIEW t3 AS SELECT ref_0.vkey AS c_0 FROM (t0 ref_0 LEFT JOIN t2 ref_1 ON ((ref_0.c1 = ref_1.c11))) WHERE ((false < ( SELECT (ref_0.c1 <> ref_1.c14) AS c_0 FROM t2 ref_6 ORDER BY ref_0.c1, ref_1.c14 LIMIT 1))); CREATE VIEW t6 AS SELECT 1 AS c_0, 2 AS c_1, 3 AS c_2; INSERT INTO t1 VALUES (80, 90000); ALTER TABLE ONLY t1 ADD CONSTRAINT t1_pkey PRIMARY KEY (pkey); ------------------------ The fuzzer generates Test case: --- Test case --- select ref_38.c_0 as c_10 from ((t3 as ref_38 full outer join t6 as ref_39 on (ref_38.c_0 = ref_39.c_1)) right outer join t1 as ref_40 on (ref_38.c_0 = ref_40.vkey)); ------------------------ --- Expected behavior --- The test case should not trigger any error. --- Actual behavior --- The test case trigger an error: ERROR: wrong varnullingrels (b) (expected (b 5)) for Var 6/2 --- Postgres version --- Github commit: 3f1aaaa180689f2015e7f7bd01c9be6d7a993b42 Version: PostgreSQL 16beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit --- Platform information --- Platform: Ubuntu 20.04 Kernel: Linux 5.4.0-147-generic
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > My fuzzer finds a bug in Postgres, which triggers an unexpected error in > Postgres. Thanks for the report! I initially thought this was something wrong with subplan parameters, but it reproduces without a subplan. I reduced it to this test case in the regression database: select ss1.unique1 from (select t0.unique1 from onek t0 left join onek t1 on t0.hundred = t1.thousand where t0.hundred <> coalesce(t1.tenthous, -1)) as ss1 left join (select 1 as c_1) as ss2 on ss1.unique1 = ss2.c_1 right join int4_tbl i4 on ss1.unique1 = i4.f1; ERROR: wrong varnullingrels (b) (expected (b 5)) for Var 6/7 Temporarily disabling the setrefs.c crosscheck shows that it's trying to create this plan: Hash Right Join Hash Cond: (t1.thousand = t0.hundred) Filter: (t0.hundred <> COALESCE(t1.tenthous, '-1'::integer)) -> Seq Scan on onek t1 -> Hash -> Nested Loop Left Join -> Seq Scan on int4_tbl i4 -> Index Scan using onek_unique1 on onek t0 Index Cond: (unique1 = i4.f1) which is the wrong join order: the filter condition can't be applied at that join level. So the nullingrel cross-check has caught a real bug, but why the bug? Pre-v16, this would have been prevented by the delay_upper_joins mechanism. I convinced myself that we didn't need that anymore, but maybe I was mistaken. It could also be some smaller problem. It's curious that the bug doesn't reproduce if you remove the visibly-useless join to ss2: regression=# explain (costs off) select ss1.unique1 regression-# from regression-# (select t0.unique1 regression(# from onek t0 left join onek t1 on t0.hundred = t1.thousand regression(# where t0.hundred <> coalesce(t1.tenthous, -1)) as ss1 regression-# right join int4_tbl i4 on ss1.unique1 = i4.f1; Hash Right Join Hash Cond: (t0.unique1 = i4.f1) -> Hash Left Join Hash Cond: (t0.hundred = t1.thousand) Filter: (t0.hundred <> COALESCE(t1.tenthous, '-1'::integer)) -> Seq Scan on onek t0 -> Hash -> Seq Scan on onek t1 -> Hash -> Seq Scan on int4_tbl i4 So maybe the logic is fundamentally correct but gets confused by intermediate outer joins. Needs more time to look at than I have today, but I'll add an open item. regards, tom lane
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Tom Lane
Date:
I wrote: > ... which is the wrong join order: the filter condition can't be > applied at that join level. So the nullingrel cross-check > has caught a real bug, but why the bug? Pre-v16, this would > have been prevented by the delay_upper_joins mechanism. > I convinced myself that we didn't need that anymore, but > maybe I was mistaken. It could also be some smaller problem. > It's curious that the bug doesn't reproduce if you remove the > visibly-useless join to ss2: Ah-hah, I now understand why that is. Without the join to ss2, the FROM/WHERE clause is directly below the left join to int4_tbl, and remove_useless_result_rtes will hoist the problematic WHERE qual up into the upper left join's quals -- see the para beginning "This pass also replaces single-child FromExprs with their child node" in prepjointree.c. After that, we can see that the left join's quals reference both sides of the lower left join so identity 3 cannot apply. With the join to ss2, that intervening join prevents the hoisting from happening and then we incorrectly conclude that identity 3 can be used. I'm inclined to think that temporarily hoisting such quals into the upper left join's qual list is still the best solution, as anything else would require weird and bug-prone action-at-a-distance checks during deconstruct_jointree. However, we need to make it happen in this case where the hoisting needs to pass a qual from a lower WHERE in a left join's LHS up to the RHS of a higher left join. (I think that this is the only missing case. Intermediate joins that aren't LEFT will prevent commutation anyway, as will multi- member FROM joins.) I'm not sure if the best way is to extend that logic in remove_useless_result_rtes, or to rip it out and handle the problem during deconstruct_jointree. The latter would probably involve more new code, but it might end up cleaner. This whole business of removing trivial FromExprs is a bit outside what you'd expect remove_useless_result_rtes to do. If memory serves, I wrote that logic before inventing the new multi-pass architecture for deconstruct_jointree; it's possible that that change would make it easier to deal with this in deconstruct_jointree. I have no time to write any code today, but that seems like the direction to pursue. regards, tom lane
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Richard Guo
Date:
On Sun, Jun 18, 2023 at 3:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Temporarily disabling the setrefs.c crosscheck shows that it's
trying to create this plan:
which is the wrong join order: the filter condition can't be
applied at that join level. So the nullingrel cross-check
has caught a real bug, but why the bug?
FWIW, I found that we have an existing test query in sql/join.sql that
almost exposes this issue.
explain (costs off)
select * from int4_tbl t1
left join ((select t2.f1 from int4_tbl t2
left join int4_tbl t3 on t2.f1 > 0
where t3.f1 is null) s
left join tenk1 t4 on s.f1 > 1)
on s.f1 = t1.f1;
If we change the WHERE clause to 't2.f1 != coalesce(t3.f1, 1)', we will
see this issue.
explain (costs off)
select * from int4_tbl t1
left join ((select t2.f1 from int4_tbl t2
left join int4_tbl t3 on t2.f1 > 0
where t2.f1 != coalesce(t3.f1, 1)) s
left join tenk1 t4 on s.f1 > 1)
on s.f1 = t1.f1;
ERROR: wrong varnullingrels (b) (expected (b 5)) for Var 6/1
Thanks
Richard
almost exposes this issue.
explain (costs off)
select * from int4_tbl t1
left join ((select t2.f1 from int4_tbl t2
left join int4_tbl t3 on t2.f1 > 0
where t3.f1 is null) s
left join tenk1 t4 on s.f1 > 1)
on s.f1 = t1.f1;
If we change the WHERE clause to 't2.f1 != coalesce(t3.f1, 1)', we will
see this issue.
explain (costs off)
select * from int4_tbl t1
left join ((select t2.f1 from int4_tbl t2
left join int4_tbl t3 on t2.f1 > 0
where t2.f1 != coalesce(t3.f1, 1)) s
left join tenk1 t4 on s.f1 > 1)
on s.f1 = t1.f1;
ERROR: wrong varnullingrels (b) (expected (b 5)) for Var 6/1
Thanks
Richard
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes: > FWIW, I found that we have an existing test query in sql/join.sql that > almost exposes this issue. > explain (costs off) > select * from int4_tbl t1 > left join ((select t2.f1 from int4_tbl t2 > left join int4_tbl t3 on t2.f1 > 0 > where t3.f1 is null) s > left join tenk1 t4 on s.f1 > 1) > on s.f1 = t1.f1; > If we change the WHERE clause to 't2.f1 != coalesce(t3.f1, 1)', we will > see this issue. Interesting that you should bring up that query, because after quick-hacking a fix in remove_useless_results(), all the existing test cases pass except that that one changes plan: it now does the joins in syntactic order, whereas HEAD finds a plan that joins t4 last. Joining t4 last has a noticeably better cost estimate than the syntactic-order plan, so it's kind of sad that we can't find it anymore. On the other hand, v15 and before don't find it either, so this wouldn't be a regression. According to inspection of the SpecialJoinInfos, HEAD believes that the t2/t3 outer join of that query could commute according to identity 3 with the outermost left join between t1 and t2. It doesn't act on that observation, and I'm not sure it's even correct to believe that. This patch doesn't believe that anymore, but it still believes that the t1/t2 join could commute with the t2/t4 join. Why it no longer acts on that is unclear. There is probably some buglet or missed opportunity somewhere. Even odder is that given your modified query in which the WHERE clause mentions both t2 and t3, it *does* find the plan that joins t4 last, as shown in the new test case. This makes me wonder if it's something triggered by the avoid-clauseless-joins heuristic, though I can't quite see how that'd apply. Anyway, what I'm inclined to do is flesh out the attached by updating the comments for remove_useless_results() and then push it. Later on we can look for why it's not finding the better join order; that's a separable issue, and if it is about avoid-clauseless-joins then we might choose to live with it rather than incur a lot of planner cost to fix it. regards, tom lane diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c index 2f589b1b99..0091eaea23 100644 --- a/src/backend/optimizer/prep/prepjointree.c +++ b/src/backend/optimizer/prep/prepjointree.c @@ -3322,7 +3322,9 @@ remove_useless_results_recurse(PlannerInfo *root, Node *jtnode, */ j->larg = remove_useless_results_recurse(root, j->larg, (j->jointype == JOIN_INNER) ? - &j->quals : NULL, + &j->quals : + (j->jointype == JOIN_LEFT) ? + parent_quals : NULL, dropped_outer_joins); j->rarg = remove_useless_results_recurse(root, j->rarg, (j->jointype == JOIN_INNER || diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index cc4c122fdd..f7a34b76f8 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -2531,19 +2531,19 @@ select * from int4_tbl t1 on s.f1 = t1.f1; QUERY PLAN ------------------------------------------------- - Nested Loop Left Join - Join Filter: (t2.f1 > 1) - -> Hash Right Join - Hash Cond: (t2.f1 = t1.f1) + Hash Right Join + Hash Cond: (t2.f1 = t1.f1) + -> Nested Loop Left Join + Join Filter: (t2.f1 > 1) -> Nested Loop Left Join Join Filter: (t2.f1 > 0) Filter: (t3.f1 IS NULL) -> Seq Scan on int4_tbl t2 -> Materialize -> Seq Scan on int4_tbl t3 - -> Hash - -> Seq Scan on int4_tbl t1 - -> Seq Scan on tenk1 t4 + -> Seq Scan on tenk1 t4 + -> Hash + -> Seq Scan on int4_tbl t1 (13 rows) explain (costs off) @@ -2628,6 +2628,31 @@ select * from onek t1 Filter: (two = t2.two) (11 rows) +explain (costs off) +select * from int4_tbl t1 + left join ((select t2.f1 from int4_tbl t2 + left join int4_tbl t3 on t2.f1 > 0 + where t2.f1 <> coalesce(t3.f1, -1)) s + left join tenk1 t4 on s.f1 > 1) + on s.f1 = t1.f1; + QUERY PLAN +----------------------------------------------------------------- + Nested Loop Left Join + Join Filter: (t2.f1 > 1) + -> Hash Right Join + Hash Cond: (t2.f1 = t1.f1) + -> Nested Loop Left Join + Join Filter: (t2.f1 > 0) + Filter: (t2.f1 <> COALESCE(t3.f1, '-1'::integer)) + -> Seq Scan on int4_tbl t2 + -> Materialize + -> Seq Scan on int4_tbl t3 + -> Hash + -> Seq Scan on int4_tbl t1 + -> Materialize + -> Seq Scan on tenk1 t4 +(14 rows) + -- -- check a case where we formerly got confused by conflicting sort orders -- in redundant merge join path keys diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql index e77e469570..3683a562c2 100644 --- a/src/test/regress/sql/join.sql +++ b/src/test/regress/sql/join.sql @@ -528,6 +528,14 @@ select * from onek t1 (select * from onek t3 where t3.two = t2.two offset 0) s on t2.unique1 = 1; +explain (costs off) +select * from int4_tbl t1 + left join ((select t2.f1 from int4_tbl t2 + left join int4_tbl t3 on t2.f1 > 0 + where t2.f1 <> coalesce(t3.f1, -1)) s + left join tenk1 t4 on s.f1 > 1) + on s.f1 = t1.f1; + -- -- check a case where we formerly got confused by conflicting sort orders -- in redundant merge join path keys
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Tom Lane
Date:
I wrote: > Anyway, what I'm inclined to do is flesh out the attached by updating > the comments for remove_useless_results() and then push it. Later > on we can look for why it's not finding the better join order; that's > a separable issue, and if it is about avoid-clauseless-joins then we > might choose to live with it rather than incur a lot of planner cost > to fix it. I couldn't resist poking into that, and it seems there's less here than meets the eye. I'd been guessing that the presence or absence of a t2 reference in the WHERE clause was affecting this, but no: the SpecialJoinInfos look exactly the same for both queries, and the set of joins considered is the same in both. What is causing the different plan shape is that the selectivity estimates for these WHERE clauses are a lot different: regression=# explain regression-# select t2.f1 from int4_tbl t2 regression-# left join int4_tbl t3 on t2.f1 > 0 regression-# where t3.f1 is null; QUERY PLAN ----------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..2.49 rows=1 width=4) Join Filter: (t2.f1 > 0) Filter: (t3.f1 IS NULL) -> Seq Scan on int4_tbl t2 (cost=0.00..1.05 rows=5 width=4) -> Materialize (cost=0.00..1.07 rows=5 width=4) -> Seq Scan on int4_tbl t3 (cost=0.00..1.05 rows=5 width=4) (6 rows) regression=# explain regression-# select t2.f1 from int4_tbl t2 regression-# left join int4_tbl t3 on t2.f1 > 0 regression-# where t2.f1 <> coalesce(t3.f1, -1); QUERY PLAN ----------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..2.55 rows=8 width=4) Join Filter: (t2.f1 > 0) Filter: (t2.f1 <> COALESCE(t3.f1, '-1'::integer)) -> Seq Scan on int4_tbl t2 (cost=0.00..1.05 rows=5 width=4) -> Materialize (cost=0.00..1.07 rows=5 width=4) -> Seq Scan on int4_tbl t3 (cost=0.00..1.05 rows=5 width=4) and that ends up with the other join order looking better. We can synthesize a different non-strict, t3-only qual with a similar selectivity estimate: regression=# explain regression-# select t2.f1 from int4_tbl t2 regression-# left join int4_tbl t3 on t2.f1 > 0 regression-# where -1 <> coalesce(t3.f1, -1); QUERY PLAN ----------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..2.55 rows=8 width=4) Join Filter: (t2.f1 > 0) Filter: ('-1'::integer <> COALESCE(t3.f1, '-1'::integer)) -> Seq Scan on int4_tbl t2 (cost=0.00..1.05 rows=5 width=4) -> Materialize (cost=0.00..1.07 rows=5 width=4) -> Seq Scan on int4_tbl t3 (cost=0.00..1.05 rows=5 width=4) and then if you plug that into the whole query you get a t4-last plan: regression=# explain (costs off) regression-# select * from int4_tbl t1 regression-# left join ((select t2.f1 from int4_tbl t2 regression(# left join int4_tbl t3 on t2.f1 > 0 regression(# where -1 <> coalesce(t3.f1, -1) ) s regression(# left join tenk1 t4 on s.f1 > 1) regression-# on s.f1 = t1.f1; QUERY PLAN ------------------------------------------------------------------------- Nested Loop Left Join Join Filter: (t2.f1 > 1) -> Hash Right Join Hash Cond: (t2.f1 = t1.f1) -> Nested Loop Left Join Join Filter: (t2.f1 > 0) Filter: ('-1'::integer <> COALESCE(t3.f1, '-1'::integer)) -> Seq Scan on int4_tbl t2 -> Materialize -> Seq Scan on int4_tbl t3 -> Hash -> Seq Scan on int4_tbl t1 -> Materialize -> Seq Scan on tenk1 t4 (14 rows) So, nothing to see here after all. regards, tom lane
Re: BUG #17978: Unexpected error: "wrong varnullingrels (b) (expected (b 5)) for Var 6/2" triggered by JOIN
From
Richard Guo
Date:
On Tue, Jun 20, 2023 at 4:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> Anyway, what I'm inclined to do is flesh out the attached by updating
> the comments for remove_useless_results() and then push it. Later
> on we can look for why it's not finding the better join order; that's
> a separable issue, and if it is about avoid-clauseless-joins then we
> might choose to live with it rather than incur a lot of planner cost
> to fix it.
I couldn't resist poking into that, and it seems there's less here
than meets the eye. I'd been guessing that the presence or absence
of a t2 reference in the WHERE clause was affecting this, but no: the
SpecialJoinInfos look exactly the same for both queries, and the set
of joins considered is the same in both. What is causing the
different plan shape is that the selectivity estimates for these
WHERE clauses are a lot different:
I was also wondering why the plan changes for this existing query after
applying this fix. After some investigation I came to the same
conclusion: it's all about different size estimates.
For this query, we'd form joinrel that includes {t1, t2, t3}, with or
without the fix. However, without the fix we form the joinrel with
{t1/t2} and {t3}, and thus treat 't2.f1 > 0' and 't3.f1 IS NULL' as
restrict clauses, and with that we calculate the joinrel size as 1.
After applying this fix, the t1/t2 join is not legal anymore (which I
think is right), and we form joinrel {t1/t2/t3} with {t1} and {t2/t3}.
This time we treat 't2.f1 = t1.f1' as restrict clause and calculate the
joinrel size as 5. I manually changed its size to 1 with gdb and then
the final plan changed back to the previous one, ie, the one that joins
t4 last, with exactly the same cost as previously.
Thanks
Richard
applying this fix. After some investigation I came to the same
conclusion: it's all about different size estimates.
For this query, we'd form joinrel that includes {t1, t2, t3}, with or
without the fix. However, without the fix we form the joinrel with
{t1/t2} and {t3}, and thus treat 't2.f1 > 0' and 't3.f1 IS NULL' as
restrict clauses, and with that we calculate the joinrel size as 1.
After applying this fix, the t1/t2 join is not legal anymore (which I
think is right), and we form joinrel {t1/t2/t3} with {t1} and {t2/t3}.
This time we treat 't2.f1 = t1.f1' as restrict clause and calculate the
joinrel size as 5. I manually changed its size to 1 with gdb and then
the final plan changed back to the previous one, ie, the one that joins
t4 last, with exactly the same cost as previously.
Thanks
Richard