Thread: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

PG Bug reporting form

Date:

14 June 2023, 18:59:47

The following bug has been logged on the website:

Bug reference:      17976
Logged by:          Zuming Jiang
Email address:      zuming.jiang@inf.ethz.ch
PostgreSQL version: 16beta1
Operating system:   Ubuntu 20.04
Description:

My fuzzer finds a logical bug in Postgres, which makes Postgres return
inconsistent results.

--- Set up database ---
create table t0 (c2 text);
create table t1 (pkey int4, c4 int4, c5 text, c6 int4, c8 float8, c9
int4);
create table t2 (c10 text, c12 timestamp);
CREATE VIEW t3 AS
 SELECT '1' AS c_0
   FROM (( SELECT ref_0.c2 AS c_0
           FROM t0 ref_0
          GROUP BY ref_0.c2) subq_0
     FULL JOIN t2 ref_1 ON ((subq_0.c_0 = ref_1.c10)))
  WHERE (ref_1.c12 > ref_1.c12);
CREATE VIEW t4 AS
 SELECT
    ref_1.c9 AS c_2,
    ref_1.c8 AS c_3,
    ref_1.c4 AS c_4,
    1 AS c_6,
    ref_1.c6 AS c_9
   FROM (t3 ref_0
     RIGHT JOIN t1 ref_1 ON ((ref_0.c_0 = ref_1.c5)));
insert into t1 values (11000, 0, null::text, 0, 0.0, 15);
---

The fuzzer generates Test case 1:

--- Test case 1 ---
select count(*) as c_6
  from
    (t1 as ref_15
      left outer join t4 as ref_16
      on (ref_15.pkey = ref_16.c_2))
  where (case when (((ref_16.c_9 >= ref_16.c_4)
                   or (not (ref_16.c_9 >= ref_16.c_4)))
                   or ((ref_16.c_9 >= ref_16.c_4) is null))
              then ref_16.c_3 else ref_16.c_3 end
        ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z')
                                       and (not (ref_15.c5 like '7%z')))
                                       and ((ref_15.c5 like '7%z') is not
null))
                                  then ref_16.c_6 else ref_15.c8 end);
---

Because `ref_16.c_9 >= ref_16.c_4` could only be TRUE, FALSE, or NULL,
`(((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or
((ref_16.c_9 >= ref_16.c_4) is null))` must be TRUE. Therefore, I replace
`(((ref_16.c_9 >= ref_16.c_4) or (not (ref_16.c_9 >= ref_16.c_4))) or
((ref_16.c_9 >= ref_16.c_4) is null))` with TRUE, and get Test case 2:

--- Test case 2 ---
select count(*) as c_6
  from
    (t1 as ref_15
      left outer join t4 as ref_16
      on (ref_15.pkey = ref_16.c_2))
  where (case when true
              then ref_16.c_3 else ref_16.c_3 end
        ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z')
                                       and (not (ref_15.c5 like '7%z')))
                                       and ((ref_15.c5 like '7%z') is not
null))
                                  then ref_16.c_6 else ref_15.c8 end);
---

--- Expected behavior ---
Test case 1 and Test case 2 return the same results.

--- Actual behavior ---
Test case 1 returns 1, while Test case returns 0.

--- Postgres version ---
Github commit: 3f1aaaa180689f2015e7f7bd01c9be6d7a993b42
Version: PostgreSQL 16beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

--- Platform information ---
Platform: Ubuntu 20.04
Kernel: Linux 5.4.0-147-generic

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

David Rowley

Date:

15 June 2023, 02:22:55

On Thu, 15 Jun 2023 at 04:27, PG Bug reporting form
<noreply@postgresql.org> wrote:
> My fuzzer finds a logical bug in Postgres, which makes Postgres return
> inconsistent results.
> --- Expected behavior ---
> Test case 1 and Test case 2 return the same results.
>
> --- Actual behavior ---
> Test case 1 returns 1, while Test case returns 0.

Thank you for the report.

The first bad commit seems to be b448f1c8

using your setup and the following query:

explain select *
  from
    (t1 as ref_15
      left outer join t4 as ref_16
      on (ref_15.pkey = ref_16.c_2))
  where (case when (((ref_16.c_9 >= ref_16.c_4)
                   or (not (ref_16.c_9 >= ref_16.c_4)))
                   or ((ref_16.c_9 >= ref_16.c_4) is null))
              then ref_16.c_3 else ref_16.c_3 end
        ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z')
                                       and (not (ref_15.c5 like '7%z')))
                                       and ((ref_15.c5 like '7%z') is not
null))
                                  then ref_16.c_6 else ref_15.c8 end);

b448f1c8d8 gives:

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Hash Right Join  (cost=31.83..71760.52 rows=1881800 width=80)
   Hash Cond: (ref_1.c9 = ref_15.pkey)
   ->  Nested Loop Left Join  (cost=0.00..4895.70 rows=388000 width=24)
         ->  Seq Scan on t1 ref_1  (cost=0.00..19.70 rows=970 width=52)
         ->  Materialize  (cost=0.00..27.00 rows=400 width=32)
               ->  Seq Scan on t2 ref_1_1  (cost=0.00..25.00 rows=400 width=32)
                     Filter: (c12 > c12)
   ->  Hash  (cost=19.70..19.70 rows=970 width=56)
         ->  Seq Scan on t1 ref_15  (cost=0.00..19.70 rows=970 width=56)
(9 rows)

whereas the prior commit gives:



          QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=63.33..6475.96 rows=47 width=80)
   Hash Cond: (ref_1.c9 = ref_15.pkey)
   Filter: (CASE WHEN ((ref_1.c6 >= ref_1.c4) OR (ref_1.c6 < ref_1.c4)
OR ((ref_1.c6 >= ref_1.c4) IS NULL)) THEN ref_1.c8 ELSE ref_1.c8 END =
dcbrt(CASE WHEN ((ref_15.c5 ~~ '7%z'::text) AND (ref_15.
c5 !~~ '7%z'::text) AND ((ref_15.c5 ~~ '7%z'::text) IS NOT NULL)) THEN
((1))::double precision ELSE ref_15.c8 END))
   ->  Nested Loop Left Join  (cost=31.50..5898.27 rows=1940 width=24)
         Join Filter: ('1'::text = ref_1.c5)
         ->  Seq Scan on t1 ref_1  (cost=0.00..19.70 rows=970 width=52)
         ->  Materialize  (cost=31.50..59.57 rows=400 width=0)
               ->  Hash Left Join  (cost=31.50..57.57 rows=400 width=0)
                     Hash Cond: (ref_1_1.c10 = ref_0.c2)
                     ->  Seq Scan on t2 ref_1_1  (cost=0.00..25.00
rows=400 width=32)
                           Filter: (c12 > c12)
                     ->  Hash  (cost=29.00..29.00 rows=200 width=32)
                           ->  HashAggregate  (cost=27.00..29.00
rows=200 width=32)
                                 Group Key: ref_0.c2
                                 ->  Seq Scan on t0 ref_0
(cost=0.00..23.60 rows=1360 width=32)
   ->  Hash  (cost=19.70..19.70 rows=970 width=56)
         ->  Seq Scan on t1 ref_15  (cost=0.00..19.70 rows=970 width=56)
(17 rows)

so it looks like the join filter is being lost somewhere along the way.

David

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Tom Lane

Date:

15 June 2023, 02:34:23

PG Bug reporting form <noreply@postgresql.org> writes:
> My fuzzer finds a logical bug in Postgres, which makes Postgres return
> inconsistent results.

Thanks for the report!

I poked at this a little bit.  v16 is actually generating an incorrect
plan for both query variants; the bad plan for the second variant just
happens to not give visibly wrong answers for this input data.

regression=# explain (costs off)
regression-# select count(*) as c_6
regression-#   from
regression-#     (t1 as ref_15
regression(#       left outer join t4 as ref_16
regression(#       on (ref_15.pkey = ref_16.c_2))
regression-#   where (case when true
regression(#               then ref_16.c_3 else ref_16.c_3 end
regression(#         ) = pg_catalog.dcbrt(case when (((ref_15.c5 like '7%z')
regression(#                                        and (not (ref_15.c5 like '7%z')))
regression(#                                        and ((ref_15.c5 like '7%z') is not null))
regression(#                                   then ref_16.c_6 else ref_15.c8 end);
                    QUERY PLAN
---------------------------------------------------
 Aggregate
   ->  Hash Join
         Hash Cond: (ref_1.c9 = ref_15.pkey)
         ->  Nested Loop Left Join
               Join Filter: ('1'::text = ref_1.c5)
               ->  Seq Scan on t1 ref_1
               ->  Materialize
                     ->  Seq Scan on t2 ref_1_1
                           Filter: (c12 > c12)
         ->  Hash
               ->  Seq Scan on t1 ref_15
(11 rows)

v15 and before give

 Aggregate
   ->  Hash Join
         Hash Cond: (ref_1.c9 = ref_15.pkey)
         Join Filter: (dcbrt(CASE WHEN ((ref_15.c5 ~~ '7%z'::text) AND (ref_15.c5 !~~ '7%z'::text) AND ((ref_15.c5 ~~
'7%z'::text)IS NOT NULL)) THEN ((1))::double precision ELSE ref_15.c8 END) = ref_1.c8) 
         ->  Nested Loop Left Join
               Join Filter: ('1'::text = ref_1.c5)
               ->  Seq Scan on t1 ref_1
               ->  Materialize
                     ->  Hash Left Join
                           Hash Cond: (ref_1_1.c10 = ref_0.c2)
                           ->  Seq Scan on t2 ref_1_1
                                 Filter: (c12 > c12)
                           ->  Hash
                                 ->  HashAggregate
                                       Group Key: ref_0.c2
                                       ->  Seq Scan on t0 ref_0
         ->  Hash
               ->  Seq Scan on t1 ref_15

So the good news is that v16 correctly recognizes that the left join
to ref_0 can be discarded.  (Older versions recognize this if you
just select directly from t3 or t4, but fail to make that deduction
when it's buried under an additional layer of outer join.  I believe
this better result is due to the outer-join-aware-Vars changes.)
The bad news is that the top-level join filter condition has gone
missing.  That still happens even with a greatly simplified WHERE
condition, ie this isn't really about CASE:

regression=# explain (costs off)
select count(*) as c_6
  from
    (t1 as ref_15
      left outer join t4 as ref_16
      on (ref_15.pkey = ref_16.c_2))
  where ref_16.c_6 = ref_15.c8;
                    QUERY PLAN
---------------------------------------------------
 Aggregate
   ->  Hash Right Join
         Hash Cond: (ref_1.c9 = ref_15.pkey)
         ->  Nested Loop Left Join
               Join Filter: ('1'::text = ref_1.c5)
               ->  Seq Scan on t1 ref_1
               ->  Materialize
                     ->  Seq Scan on t2 ref_1_1
                           Filter: (c12 > c12)
         ->  Hash
               ->  Seq Scan on t1 ref_15
(11 rows)

We are converting this WHERE condition to an EquivalenceClass with
members "ref_16.c_6" and "ref_15.c8", and what seems to be the
problem is that analyzejoins.c fails to strip the removed rel(s)
from the EquivalenceMember for "ref_16.c_6", so it never looks
like we've reached a join level where it's time to enforce that.

I'd always kind of wondered how we got away with not updating
EquivalenceClasses during join removal, and the answer evidently
is that we can't anymore.  I've not tried to write a patch yet.

            regards, tom lane

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Tom Lane

Date:

15 June 2023, 02:55:27

David Rowley <dgrowleyml@gmail.com> writes:
> The first bad commit seems to be b448f1c8

Yeah, I think the proximate cause is the removal of the delay_upper_joins
restrictions in analyzejoins.c.  That allows us to remove the join to
ref_0, which prior versions didn't do; but then that exposes the need to
clean up dead references in EquivalenceClasses.  I'm wondering a bit
now if there are variants of this that'd fail in pre-v16 branches.
It's not obvious to me offhand why delay_upper_joins would have prevented
all cases where a removable rel could be mentioned in an EquivalenceClass
member (presumably via the ph_rels set of a PHV).

            regards, tom lane

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Richard Guo

Date:

15 June 2023, 10:12:01

On Thu, Jun 15, 2023 at 7:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd always kind of wondered how we got away with not updating
EquivalenceClasses during join removal, and the answer evidently
is that we can't anymore. I've not tried to write a patch yet.

I wondered about that too, and I thought that the target rel could not
be mentioned in any EC, otherwise join_is_removable should have noticed
that this rel is used above the join and thus decided that the join is
not removable. Hence there is no need to update ECs during join
removal. Apparently this is not right, as shown by this issue.

I went ahead and drafted a patch as attached. But I'm not sure if it
suffices to only update ec_relids, em_relids and ec_sources as the patch
does. Also I'm wondering if any of them would become empty after the
update.

Thanks
Richard

Attachment

v1-0001-Remove-references-to-the-removed-rel-from-EquivalenceClasses.patch

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Tom Lane

Date:

15 June 2023, 22:35:57

Richard Guo <guofenglinux@gmail.com> writes:
> I went ahead and drafted a patch as attached.  But I'm not sure if it
> suffices to only update ec_relids, em_relids and ec_sources as the patch
> does.  Also I'm wondering if any of them would become empty after the
> update.

Thanks.  I pushed this after making some changes:

* I'm not convinced that it's safe to update only the ECs listed
in the baserel's eclass_indexes; that would miss any ECs that mention
only the outer join's relid and not the baserel's.  Perhaps that's
impossible but I don't feel comfortable about it.  It shouldn't cost
much more to just scan the whole eq_classes list here.

* I pushed the update into a separate function for cosmetic reasons
(mostly to make it easy to add comments similar to those for
remove_rel_from_restrictinfo).

* I added logic to remove dead EquivalenceMembers entirely.  This is
partly to make sure we don't generate bogus joinclauses using them,
but mostly to save cycles in later examinations of the EC.

* To be on the safe side I made it clear the ec_derived lists.
We shouldn't really need whatever is in there anymore anyway,
so it's probably not worth fixing those RestrictInfos.

I noticed in testing that this frequently makes the whole
EquivalenceClass a dead letter (with 0 or 1 surviving member),
but sadly we can't remove it from eq_classes unless we want to
rebuild all the eclass_indexes sets.  Probably not worth it.

            regards, tom lane

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Amit Kapila

Date:

19 June 2023, 15:43:48

On Fri, Jun 16, 2023 at 1:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Richard Guo <guofenglinux@gmail.com> writes:
> > I went ahead and drafted a patch as attached.  But I'm not sure if it
> > suffices to only update ec_relids, em_relids and ec_sources as the patch
> > does.  Also I'm wondering if any of them would become empty after the
> > update.
>
> Thanks.  I pushed this after making some changes:
>

Does this commit close the open item: "join removal can no longer skip
updating EquivalenceClasses"? If so, can we update it?

--
With Regards,
Amit Kapila.

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Tom Lane

Date:

19 June 2023, 16:29:55

Amit Kapila <amit.kapila16@gmail.com> writes:
> Does this commit close the open item: "join removal can no longer skip
> updating EquivalenceClasses"? If so, can we update it?

Yeah.  I'd left that open because there are several perhaps-overlapping
issues in the same area, but that bug as-stated is gone.

            regards, tom lane

Re: BUG #17976: Inconsistent results of SELECT using CASE WHEN clause

From

Amit Kapila

Date:

20 June 2023, 12:25:26

On Mon, Jun 19, 2023 at 6:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Does this commit close the open item: "join removal can no longer skip
> > updating EquivalenceClasses"? If so, can we update it?
>
> Yeah.
>

Done.

--
With Regards,
Amit Kapila.