Re: BUG #1528: Rows returned that should be excluded by WHERE clause - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: BUG #1528: Rows returned that should be excluded by WHERE clause |
Date | |
Msg-id | 8109.1110411647@sss.pgh.pa.us Whole thread Raw |
Responses |
We are not following the spec for HAVING without GROUP BY
|
List | pgsql-hackers |
I wrote: > I think the problem can be expressed as > regression=# select 2 as id, max(b) from t2 having 2 = 1; > id | max > ----+----- > 2 | > (1 row) > the issue is clearly that the known-false HAVING clause is pushed down > inside the aggregation, as though it were WHERE. The existing code > pushes down HAVING to WHERE if the clause contains no aggregates, but > evidently this is too simplistic. What are the correct conditions for > pushing down HAVING clauses to WHERE? After reading the spec a little, I think that we have oversimplified our handling of aggregate-free HAVING clauses. If you look in planner.c you'll find that such a clause is converted into a WHERE clause, but this is not what the spec says to do, and you can tell the difference in cases like the above. What the spec actually says, or at least implies, is that a HAVING clause is to be evaluated only once per group --- where the "group" is the whole table if there's no GROUP BY clause. The group is to be discarded if the HAVING clause doesn't return true. SQL92 7.8: 1) Let T be the result of the preceding <from clause>, <where clause>, or <group by clause>. If that clauseis not a <group by clause>, then T consists of a single group and does not have a grouping column. 2) The <search condition> is applied to each group of T. The result of the <having clause> is a groupedtable of those groups of T for which the result of the <search condition> is true. So it's clear that what the above case should return is a grouped table having no groups ... ie, no rows out. What we are actually returning is one group containing no rows, which is visibly different because of the presence of the aggregate function in the SELECT list. There are really four cases to think about, depending on whether the query has GROUP BY and on whether it has any aggregates outside the HAVING clause: 1. No GROUP BY, no aggregates Per spec, the HAVING clause should be evaluated once and either we return the whole input or none of it. Since there are no grouped columns and (by assumption) no aggregates in the HAVING clause, the HAVING clause must in fact be variable-free, ie, it's a pseudoconstant clause. (Only pseudoconstant, because it might contain outer-level variables or volatile functions.) I think the correct implementation in this case is to generate a gating Result node with the HAVING clause as a one-time filter, so that we don't evaluate any of the query if the HAVING is false. The current code gets this almost right: it will make a variable-free WHERE clause into a Result gating condition *if it contains no volatile functions*. So it's wrong for the volatile function case but right otherwise. 2. GROUP BY, no aggregates In this case the HAVING clause might contain references to the grouping columns. It is legitimate to push down the HAVING to become WHERE, but *only* if it doesn't contain any volatile functions --- otherwise it might be possible to tell that the HAVING clause was executed more than once. It would be useful to push down the HAVING if, for example, it could become an indexscan qualifier. However if the HAVING condition is expensive to compute (eg it contains a subselect) we'd probably be better off not to push it into WHERE, but to arrange to evaluate it only once per group. Right now the executor cannot support testing such a condition, but I think it would be easy enough to improve nodeGroup.c to allow testing a qual condition for each group. 3. No GROUP BY, has aggregates As in case 1, the HAVING clause must be variable-free, so the best implementation would be to put it into a gating Result node. It would be correct to treat it the same way as we do for a HAVING clause containing aggregates (ie, attach it as a qual condition to the Agg plan node) --- but that would mean computing and throwing away the aggregate result when the HAVING fails, when we could skip computing it altogether. 4. GROUP BY and has aggregates This is really the same as case 2: we could push down the HAVING condition if it contains no volatile functions, but unless it is cheap to evaluate we are probably best off to attach it as a qual condition to the Agg node, ie, evaluate it only once per group. The only difference is that we don't need an executor fix to support this, since Agg does quals already. So, aside from the originally reported bug, there are two other problems in this logic: it isn't ensuring that volatile functions will be evaluated only once per group, and it isn't considering evaluation cost in deciding whether a clause that could be converted to WHERE should be or not. I haven't yet tried to make a patch that fixes all of these things. It'll likely come out complex enough that we don't want to back-patch it into 8.0 or before. If so, I'll try to make a simpler variant that fixes the semantic bugs but doesn't try to be smart about evaluation cost. Comments? regards, tom lane
pgsql-hackers by date: