Bugs in planner's equivalence-class processing - Mailing list pgsql-hackers

From Tom Lane
Subject Bugs in planner's equivalence-class processing
Date
Msg-id 27159.1350403012@sss.pgh.pa.us
Whole thread Raw
Responses Re: Bugs in planner's equivalence-class processing  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Bugs in planner's equivalence-class processing  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Bugs in planner's equivalence-class processing  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
I looked into the problem reported in bug #7604 (Bill MacArthur was kind
enough to send me a reproducer off-list).  The error can be demonstrated
by this example in the regression test database:

select f1, unique2, case when unique2 is null then f1 else 0 end from int4_tbl a left join tenk1 b on f1 = unique2
where(case when unique2 is null then f1 else 0 end) = 0;
 

If you look at the output of the basic join:

regression=# select f1, unique2, case when unique2 is null then f1 else 0 end
regression-#   from int4_tbl a left join tenk1 b on f1 = unique2;    f1      | unique2 |    case     
-------------+---------+-------------          0 |       0 |           0     123456 |         |      123456    -123456
|        |     -123456 2147483647 |         |  2147483647-2147483647 |         | -2147483647
 
(5 rows)

it's obvious that only the first of these rows passes the added WHERE
condition, so the query should produce just that row; but what you
actually get in 9.2 is all five rows!  EXPLAIN gives a clue what's
going wrong:
Nested Loop Left Join  (cost=0.00..22.51 rows=1 width=8)  ->  Seq Scan on int4_tbl a  (cost=0.00..1.05 rows=5 width=4)
-> Index Only Scan using tenk1_unique2 on tenk1 b  (cost=0.00..4.28 rows=1 w
 
idth=4)        Index Cond: (unique2 = a.f1)        Filter: (CASE WHEN (unique2 IS NULL) THEN a.f1 ELSE 0 END = 0)
(5 rows)

The filter condition has been dropped down to the scan of tenk1 (relying
on a.f1 being passed in as a parameter), which is wrong since there it
is unable to filter null-extended rows produced by the left join
operator.

At first I didn't see how this could be happening: the code for placing
quals looks at RestrictInfo.nullable_relids, and this qual should have
tenk1 in its nullable_relids since it's above a left join to tenk1.
But gdb soon told me that that wasn't what join_clause_is_movable_to was
seeing.  Eventually I figured out that the equivalence-class machinery
was eating the clause (forming an EquivalenceClass consisting of the
CASE expression and the constant 0) and reconstructing a new clause that
didn't have nullable_relids set.

So this is essentially an oversight in the patch that added tracking of
nullable_relids.  I got confused about the difference between
outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed
because its natural semantic level would be at the join anyway) and
having nonempty nullable_relids, and thought that equivalence-class
processing would never see such clauses because it doesn't accept
outerjoin_delayed clauses.  So there's no code in equivclass.c to
compute nullable_relids sets for constructed clauses.  At some point
it might be worth adding same, but for now I'm just going to tweak
distribute_qual_to_rels to not believe that clauses with nonempty
nullable_relids can be equivalence clauses.

It gets worse though.  Tracking clauses' nullable_relids is new in 9.2,
but variants of this example can be demonstrated to fail as far back
as 7.4.  Here's one:

select q1, unique2, thousand, hundredfrom int8_tbl a left join tenk1 b on q1 = unique2
where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123);

If you look at just the bare join output, there are no rows that
should pass the WHERE clause, but actually what you get in 7.4-9.1
is
q1  | unique2 | thousand | hundred 
-----+---------+----------+---------123 |         |          |        123 |         |          |        
(2 rows)

which is completely broken because it's not even a subset of the bare
join output :-(.  9.1 shows this EXPLAIN output:
Nested Loop Left Join  (cost=0.00..42.48 rows=1 width=20)  Filter: (a.q1 = COALESCE(b.thousand, 123))  ->  Seq Scan on
int8_tbla  (cost=0.00..1.05 rows=5 width=8)  ->  Index Scan using tenk1_unique2 on tenk1 b  (cost=0.00..8.27 rows=1
width=
12)        Index Cond: (a.q1 = unique2)        Filter: (COALESCE(thousand, 123) = COALESCE(hundred, 123))

What's happening here is that because the two WHERE clauses are not
outerjoin_delayed, they're being deconstructed as equivalence clauses,
and then we synthesize an equivalence constraint on the two COALESCE
expressions that can be applied at the tenk1 scan level.  That gets
rid of the row that should join to q1=123, allowing the bogus
null-extended join rows to be formed.

So this shows that, although 9.2 has a more extensive form of the
disease, the use of outerjoin_delayed to decide whether a clause can be
an equivalence clause is really quite wrong.  I believe what I need to
do to fix this is back-port the 9.2 logic that computes a
nullable_relids set for each clause.  We won't need to store it, just
check whether it's empty before letting the clause be treated as an
equivalence clause.

Is anybody concerned about the compatibility implications of fixing this
bug in the back branches?  I'm worried about people complaining that we
broke their application in a minor release.  Maybe they were depending
on incorrect behavior, but they might complain anyway.  On the other
hand, the fact that this hasn't been reported from the field in nine
years suggests that not many people write queries like this.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Satoshi Nagayasu
Date:
Subject: Re: pg_stat_lwlocks view - lwlocks statistics, round 2
Next
From: Tom Lane
Date:
Subject: Re: Global Sequences