Re: BUG #1528: Rows returned that should be excluded by WHERE clause - Mailing list pgsql-hackers

From Tom Lane
Subject Re: BUG #1528: Rows returned that should be excluded by WHERE clause
Date
Msg-id 8109.1110411647@sss.pgh.pa.us
Whole thread Raw
Responses We are not following the spec for HAVING without GROUP BY  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> I think the problem can be expressed as

> regression=# select 2 as id, max(b) from t2 having 2 = 1;
>  id | max 
> ----+-----
>   2 |    
> (1 row)

> the issue is clearly that the known-false HAVING clause is pushed down
> inside the aggregation, as though it were WHERE.  The existing code
> pushes down HAVING to WHERE if the clause contains no aggregates, but
> evidently this is too simplistic.  What are the correct conditions for
> pushing down HAVING clauses to WHERE?

After reading the spec a little, I think that we have oversimplified our
handling of aggregate-free HAVING clauses.  If you look in planner.c
you'll find that such a clause is converted into a WHERE clause, but
this is not what the spec says to do, and you can tell the difference
in cases like the above.

What the spec actually says, or at least implies, is that a HAVING
clause is to be evaluated only once per group --- where the "group"
is the whole table if there's no GROUP BY clause.  The group is
to be discarded if the HAVING clause doesn't return true.  SQL92 7.8:
        1) Let T be the result of the preceding <from clause>, <where           clause>, or <group by clause>. If that
clauseis not a <group           by clause>, then T consists of a single group and does not have           a grouping
column.
        2) The <search condition> is applied to each group of T. The result           of the <having clause> is a
groupedtable of those groups of T           for which the result of the <search condition> is true.
 

So it's clear that what the above case should return is a grouped table
having no groups ... ie, no rows out.  What we are actually returning is
one group containing no rows, which is visibly different because of the
presence of the aggregate function in the SELECT list.

There are really four cases to think about, depending on whether the
query has GROUP BY and on whether it has any aggregates outside the
HAVING clause:

1. No GROUP BY, no aggregates

Per spec, the HAVING clause should be evaluated once and either we
return the whole input or none of it.  Since there are no grouped
columns and (by assumption) no aggregates in the HAVING clause, the
HAVING clause must in fact be variable-free, ie, it's a pseudoconstant
clause.  (Only pseudoconstant, because it might contain outer-level
variables or volatile functions.)  I think the correct implementation
in this case is to generate a gating Result node with the HAVING clause
as a one-time filter, so that we don't evaluate any of the query if the
HAVING is false.  The current code gets this almost right: it will make
a variable-free WHERE clause into a Result gating condition *if it
contains no volatile functions*.  So it's wrong for the volatile
function case but right otherwise.

2. GROUP BY, no aggregates

In this case the HAVING clause might contain references to the grouping
columns.  It is legitimate to push down the HAVING to become WHERE,
but *only* if it doesn't contain any volatile functions --- otherwise it
might be possible to tell that the HAVING clause was executed more than
once.  It would be useful to push down the HAVING if, for example, it
could become an indexscan qualifier.  However if the HAVING condition
is expensive to compute (eg it contains a subselect) we'd probably be
better off not to push it into WHERE, but to arrange to evaluate it
only once per group.  Right now the executor cannot support testing
such a condition, but I think it would be easy enough to improve nodeGroup.c
to allow testing a qual condition for each group.

3. No GROUP BY, has aggregates

As in case 1, the HAVING clause must be variable-free, so the best
implementation would be to put it into a gating Result node.  It would
be correct to treat it the same way as we do for a HAVING clause
containing aggregates (ie, attach it as a qual condition to the Agg plan
node) --- but that would mean computing and throwing away the aggregate
result when the HAVING fails, when we could skip computing it altogether.

4. GROUP BY and has aggregates

This is really the same as case 2: we could push down the HAVING
condition if it contains no volatile functions, but unless it is
cheap to evaluate we are probably best off to attach it as a qual
condition to the Agg node, ie, evaluate it only once per group.
The only difference is that we don't need an executor fix to support
this, since Agg does quals already.

So, aside from the originally reported bug, there are two other problems
in this logic: it isn't ensuring that volatile functions will be
evaluated only once per group, and it isn't considering evaluation
cost in deciding whether a clause that could be converted to WHERE
should be or not.

I haven't yet tried to make a patch that fixes all of these things.
It'll likely come out complex enough that we don't want to back-patch
it into 8.0 or before.  If so, I'll try to make a simpler variant that
fixes the semantic bugs but doesn't try to be smart about evaluation
cost.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Thomas Hallgren
Date:
Subject: Re: Runtime accepting build discrepancies
Next
From: Kevin Brown
Date:
Subject: Re: fool-toleranced optimizer