Thread: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause
The following bug has been logged on the website: Bug reference: 17976 Logged by: Zuming Jiang Email address: zuming.jiang@inf.ethz.ch PostgreSQL version: 16beta1 Operating system: Ubuntu 20.04 Description: My fuzzer finds a logical bug in Postgres, which makes Postgres return inconsistent results. --- Set up database --- create table t0 (c2 text); create table t1 (pkey int4, c4 int4, c5 text, c6 int4, c8 float8, c9 int4); create table t2 (c10 text, c12 timestamp); CREATE VIEW t3 AS SELECT '1' AS c_0 FROM (( SELECT ref_0.c2 AS c_0 FROM t0 ref_0 GROUP BY ref_0.c2) subq_0 FULL JOIN t2 ref_1 ON ((subq_0.c_0 = ref_1.c10))) WHERE (ref_1.c12 > ref_1.c12); CREATE VIEW t4 AS SELECT ref_1.c9 AS c_2, ref_1.c8 AS c_3, ref_1.c4 AS c_4, 1 AS c_6, ref_1.c6 AS c_9 FROM (t3 ref_0 RIGHT JOIN t1 ref_1 ON ((ref_0.c_0 = ref_1.c5))); insert into t1 values (11000, 0, null::text, 0, 0.0, 15); --- The fuzzer generates Test case 1: --- Test case 1 --- select count(*) as c_6 from (t1 as ref_15 left outer join t4 as ref_16 on (ref_15.pkey = ref_16.c_2)) where (case when (((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or ((ref_16.c_9 >= ref_16.c_4) is null)) then ref_16.c_3 else ref_16.c_3 end ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z') and (not (ref_15.c5 like '7%z'))) and ((ref_15.c5 like '7%z') is not null)) then ref_16.c_6 else ref_15.c8 end); --- Because `ref_16.c_9 >= ref_16.c_4` could only be TRUE, FALSE, or NULL, `(((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or ((ref_16.c_9 >= ref_16.c_4) is null))` must be TRUE. Therefore, I replace `(((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or ((ref_16.c_9 >= ref_16.c_4) is null))` with TRUE, and get Test case 2: --- Test case 2 --- select count(*) as c_6 from (t1 as ref_15 left outer join t4 as ref_16 on (ref_15.pkey = ref_16.c_2)) where (case when true then ref_16.c_3 else ref_16.c_3 end ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z') and (not (ref_15.c5 like '7%z'))) and ((ref_15.c5 like '7%z') is not null)) then ref_16.c_6 else ref_15.c8 end); --- --- Expected behavior --- Test case 1 and Test case 2 return the same results. --- Actual behavior --- Test case 1 returns 1, while Test case returns 0. --- Postgres version --- Github commit: 3f1aaaa180689f2015e7f7bd01c9be6d7a993b42 Version: PostgreSQL 16beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit --- Platform information --- Platform: Ubuntu 20.04 Kernel: Linux 5.4.0-147-generic
On Thu, 15 Jun 2023 at 04:27, PG Bug reporting form <noreply@postgresql.org> wrote: > My fuzzer finds a logical bug in Postgres, which makes Postgres return > inconsistent results. > --- Expected behavior --- > Test case 1 and Test case 2 return the same results. > > --- Actual behavior --- > Test case 1 returns 1, while Test case returns 0. Thank you for the report. The first bad commit seems to be b448f1c8 using your setup and the following query: explain select * from (t1 as ref_15 left outer join t4 as ref_16 on (ref_15.pkey = ref_16.c_2)) where (case when (((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or ((ref_16.c_9 >= ref_16.c_4) is null)) then ref_16.c_3 else ref_16.c_3 end ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z') and (not (ref_15.c5 like '7%z'))) and ((ref_15.c5 like '7%z') is not null)) then ref_16.c_6 else ref_15.c8 end); b448f1c8d8 gives: QUERY PLAN -------------------------------------------------------------------------------- Hash Right Join (cost=31.83..71760.52 rows=1881800 width=80) Hash Cond: (ref_1.c9 = ref_15.pkey) -> Nested Loop Left Join (cost=0.00..4895.70 rows=388000 width=24) -> Seq Scan on t1 ref_1 (cost=0.00..19.70 rows=970 width=52) -> Materialize (cost=0.00..27.00 rows=400 width=32) -> Seq Scan on t2 ref_1_1 (cost=0.00..25.00 rows=400 width=32) Filter: (c12 > c12) -> Hash (cost=19.70..19.70 rows=970 width=56) -> Seq Scan on t1 ref_15 (cost=0.00..19.70 rows=970 width=56) (9 rows) whereas the prior commit gives: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -------------------------------------------------------------------------------------------------------------------- Hash Right Join (cost=63.33..6475.96 rows=47 width=80) Hash Cond: (ref_1.c9 = ref_15.pkey) Filter: (CASE WHEN ((ref_1.c6 >= ref_1.c4) OR (ref_1.c6 < ref_1.c4) OR ((ref_1.c6 >= ref_1.c4) IS NULL)) THEN ref_1.c8 ELSE ref_1.c8 END = dcbrt(CASE WHEN ((ref_15.c5 ~~ '7%z'::text) AND (ref_15. c5 !~~ '7%z'::text) AND ((ref_15.c5 ~~ '7%z'::text) IS NOT NULL)) THEN ((1))::double precision ELSE ref_15.c8 END)) -> Nested Loop Left Join (cost=31.50..5898.27 rows=1940 width=24) Join Filter: ('1'::text = ref_1.c5) -> Seq Scan on t1 ref_1 (cost=0.00..19.70 rows=970 width=52) -> Materialize (cost=31.50..59.57 rows=400 width=0) -> Hash Left Join (cost=31.50..57.57 rows=400 width=0) Hash Cond: (ref_1_1.c10 = ref_0.c2) -> Seq Scan on t2 ref_1_1 (cost=0.00..25.00 rows=400 width=32) Filter: (c12 > c12) -> Hash (cost=29.00..29.00 rows=200 width=32) -> HashAggregate (cost=27.00..29.00 rows=200 width=32) Group Key: ref_0.c2 -> Seq Scan on t0 ref_0 (cost=0.00..23.60 rows=1360 width=32) -> Hash (cost=19.70..19.70 rows=970 width=56) -> Seq Scan on t1 ref_15 (cost=0.00..19.70 rows=970 width=56) (17 rows) so it looks like the join filter is being lost somewhere along the way. David
PG Bug reporting form <noreply@postgresql.org> writes: > My fuzzer finds a logical bug in Postgres, which makes Postgres return > inconsistent results. Thanks for the report! I poked at this a little bit. v16 is actually generating an incorrect plan for both query variants; the bad plan for the second variant just happens to not give visibly wrong answers for this input data. regression=# explain (costs off) regression-# select count(*) as c_6 regression-# from regression-# (t1 as ref_15 regression(# left outer join t4 as ref_16 regression(# on (ref_15.pkey = ref_16.c_2)) regression-# where (case when true regression(# then ref_16.c_3 else ref_16.c_3 end regression(# ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z') regression(# and (not (ref_15.c5 like '7%z'))) regression(# and ((ref_15.c5 like '7%z') is not null)) regression(# then ref_16.c_6 else ref_15.c8 end); QUERY PLAN --------------------------------------------------- Aggregate -> Hash Join Hash Cond: (ref_1.c9 = ref_15.pkey) -> Nested Loop Left Join Join Filter: ('1'::text = ref_1.c5) -> Seq Scan on t1 ref_1 -> Materialize -> Seq Scan on t2 ref_1_1 Filter: (c12 > c12) -> Hash -> Seq Scan on t1 ref_15 (11 rows) v15 and before give Aggregate -> Hash Join Hash Cond: (ref_1.c9 = ref_15.pkey) Join Filter: (dcbrt(CASE WHEN ((ref_15.c5 ~~ '7%z'::text) AND (ref_15.c5 !~~ '7%z'::text) AND ((ref_15.c5 ~~ '7%z'::text)IS NOT NULL)) THEN ((1))::double precision ELSE ref_15.c8 END) = ref_1.c8) -> Nested Loop Left Join Join Filter: ('1'::text = ref_1.c5) -> Seq Scan on t1 ref_1 -> Materialize -> Hash Left Join Hash Cond: (ref_1_1.c10 = ref_0.c2) -> Seq Scan on t2 ref_1_1 Filter: (c12 > c12) -> Hash -> HashAggregate Group Key: ref_0.c2 -> Seq Scan on t0 ref_0 -> Hash -> Seq Scan on t1 ref_15 So the good news is that v16 correctly recognizes that the left join to ref_0 can be discarded. (Older versions recognize this if you just select directly from t3 or t4, but fail to make that deduction when it's buried under an additional layer of outer join. I believe this better result is due to the outer-join-aware-Vars changes.) The bad news is that the top-level join filter condition has gone missing. That still happens even with a greatly simplified WHERE condition, ie this isn't really about CASE: regression=# explain (costs off) select count(*) as c_6 from (t1 as ref_15 left outer join t4 as ref_16 on (ref_15.pkey = ref_16.c_2)) where ref_16.c_6 = ref_15.c8; QUERY PLAN --------------------------------------------------- Aggregate -> Hash Right Join Hash Cond: (ref_1.c9 = ref_15.pkey) -> Nested Loop Left Join Join Filter: ('1'::text = ref_1.c5) -> Seq Scan on t1 ref_1 -> Materialize -> Seq Scan on t2 ref_1_1 Filter: (c12 > c12) -> Hash -> Seq Scan on t1 ref_15 (11 rows) We are converting this WHERE condition to an EquivalenceClass with members "ref_16.c_6" and "ref_15.c8", and what seems to be the problem is that analyzejoins.c fails to strip the removed rel(s) from the EquivalenceMember for "ref_16.c_6", so it never looks like we've reached a join level where it's time to enforce that. I'd always kind of wondered how we got away with not updating EquivalenceClasses during join removal, and the answer evidently is that we can't anymore. I've not tried to write a patch yet. regards, tom lane
David Rowley <dgrowleyml@gmail.com> writes: > The first bad commit seems to be b448f1c8 Yeah, I think the proximate cause is the removal of the delay_upper_joins restrictions in analyzejoins.c. That allows us to remove the join to ref_0, which prior versions didn't do; but then that exposes the need to clean up dead references in EquivalenceClasses. I'm wondering a bit now if there are variants of this that'd fail in pre-v16 branches. It's not obvious to me offhand why delay_upper_joins would have prevented all cases where a removable rel could be mentioned in an EquivalenceClass member (presumably via the ph_rels set of a PHV). regards, tom lane
On Thu, Jun 15, 2023 at 7:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'd always kind of wondered how we got away with not updating
EquivalenceClasses during join removal, and the answer evidently
is that we can't anymore. I've not tried to write a patch yet.
I wondered about that too, and I thought that the target rel could not
be mentioned in any EC, otherwise join_is_removable should have noticed
that this rel is used above the join and thus decided that the join is
not removable. Hence there is no need to update ECs during join
removal. Apparently this is not right, as shown by this issue.
I went ahead and drafted a patch as attached. But I'm not sure if it
suffices to only update ec_relids, em_relids and ec_sources as the patch
does. Also I'm wondering if any of them would become empty after the
update.
Thanks
Richard
be mentioned in any EC, otherwise join_is_removable should have noticed
that this rel is used above the join and thus decided that the join is
not removable. Hence there is no need to update ECs during join
removal. Apparently this is not right, as shown by this issue.
I went ahead and drafted a patch as attached. But I'm not sure if it
suffices to only update ec_relids, em_relids and ec_sources as the patch
does. Also I'm wondering if any of them would become empty after the
update.
Thanks
Richard
Attachment
Richard Guo <guofenglinux@gmail.com> writes: > I went ahead and drafted a patch as attached. But I'm not sure if it > suffices to only update ec_relids, em_relids and ec_sources as the patch > does. Also I'm wondering if any of them would become empty after the > update. Thanks. I pushed this after making some changes: * I'm not convinced that it's safe to update only the ECs listed in the baserel's eclass_indexes; that would miss any ECs that mention only the outer join's relid and not the baserel's. Perhaps that's impossible but I don't feel comfortable about it. It shouldn't cost much more to just scan the whole eq_classes list here. * I pushed the update into a separate function for cosmetic reasons (mostly to make it easy to add comments similar to those for remove_rel_from_restrictinfo). * I added logic to remove dead EquivalenceMembers entirely. This is partly to make sure we don't generate bogus joinclauses using them, but mostly to save cycles in later examinations of the EC. * To be on the safe side I made it clear the ec_derived lists. We shouldn't really need whatever is in there anymore anyway, so it's probably not worth fixing those RestrictInfos. I noticed in testing that this frequently makes the whole EquivalenceClass a dead letter (with 0 or 1 surviving member), but sadly we can't remove it from eq_classes unless we want to rebuild all the eclass_indexes sets. Probably not worth it. regards, tom lane
On Fri, Jun 16, 2023 at 1:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Richard Guo <guofenglinux@gmail.com> writes: > > I went ahead and drafted a patch as attached. But I'm not sure if it > > suffices to only update ec_relids, em_relids and ec_sources as the patch > > does. Also I'm wondering if any of them would become empty after the > > update. > > Thanks. I pushed this after making some changes: > Does this commit close the open item: "join removal can no longer skip updating EquivalenceClasses"? If so, can we update it? -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > Does this commit close the open item: "join removal can no longer skip > updating EquivalenceClasses"? If so, can we update it? Yeah. I'd left that open because there are several perhaps-overlapping issues in the same area, but that bug as-stated is gone. regards, tom lane
On Mon, Jun 19, 2023 at 6:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > Does this commit close the open item: "join removal can no longer skip > > updating EquivalenceClasses"? If so, can we update it? > > Yeah. > Done. -- With Regards, Amit Kapila.