Thread: Optimizing maximum/minimum queries (yet again)

From:
Tom Lane
Date:

We've been talking about this long enough ... let's try to actually do it ...

In the last go-round, the thread starting here,
http://archives.postgresql.org/pgsql-hackers/2004-11/msg00371.php
we spent a lot of time agonizing over GROUP BY and whether the
optimization is usable for anything beside MAX/MIN.  However, no one
presented any concrete reasons for thinking we could use it for any
other aggregates.  And using it with GROUP BY would require some
fundamental restructuring of the way we do grouping --- right now we
have to look at every row anyway to compute the grouping sets, so
there's no possibility of gain from optimizing the aggregates.

But the real issue is that no one seems willing to tackle the complete
problem.  So, let's forget those things and see if there is something
within reach if we just focus on MAX/MIN without grouping.

I realized today that this may not be as hard as I thought.
Specifically, I'm imagining that we could convert
SELECT min(x), max(y) FROM tab WHERE ...

into sub-selects in a one-row outer query:
SELECT (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1),       (SELECT y FROM tab WHERE ... ORDER BY y DESC LIMIT 1);

Breaking it down like that means we can consider each aggregate
independently, which definitely simplifies matters.  This still allows
us to handle complex combinations like
SELECT min(x), max(y) - min(y), max(z) FROM tab WHERE ...

as well as aggregates in the HAVING clause --- the restriction is just
that each aggregate in the query has to be an optimizable MIN or MAX.
(If any are not, we're going to end up scanning the table anyway, and I
see no point in doing separate subqueries for some of the aggregates in
that case.)  We replace each aggregate call with a sub-SELECT, drop the
outer FROM and WHERE clauses, and there we are.

So I propose the following restrictions:

1. We will consider only aggregate queries with no GROUP BY clause.

2. The query can only reference one table.  (There doesn't seem to be
any way to handle join conditions reasonably.  We could possibly handle
Cartesian-product cases, but I don't see the value.)

When we have such a query, we'll scan the targetlist (and HAVING clause
if any) to examine each aggregate function.  We can only optimize if every
aggregate is convertible into an index scan, which requires the following
conditions:

1. The aggregate has an associated sort operator according to pg_aggregate
(see below).

2. The aggregate's argument matches a column of some btree index of the
table, and the sort operator matches either the "<" or ">" member of the
column's opclass.

3. If the column is not the first one of its index, then we must have an
indexable equality constraint (x = something) in WHERE for each earlier
column.  We can also make use of an inequality on the target column
itself, if it is of the opposite direction from the aggregate operator
(for example, if the aggregate operator matches the opclass "<", we can
use x > something or x >= something).  Such an inequality will form a
starting boundary for the index scan.

This means that we can for example optimize cases like
SELECT MIN(x) FROM tab WHERE x > 42

which turns into a forward indexscan starting at the x > 42 boundary, or
SELECT MIN(x) FROM tab WHERE y = 42

which needs an index on (y,x) and starts at the first y = 42 entry.
(Note: this doesn't work with a constraint like y >= 42 ...)

The fact that an indexscan conversion is possible doesn't necessarily
make it a good idea.  For example consider the identical query
SELECT MIN(x) FROM tab WHERE y = 42

when we have separate indexes on x and y.  If the first row with y = 42
is far into the x index, the "optimized" version could be far slower than
our normal implementation (which would probably do an indexscan on y,
instead).  In general, any constraints that don't fit into our indexscan
plan have to be looked at with suspicion.  So AFAICS we have to plan the
query both ways and estimate which is cheaper.  But we are only going to
do this for single-table queries, so the extra planner overhead won't be
large.

Catalog representation:

We must have a way to identify an aggregate as being related to an
index's sort order.  I am envisioning adding a column to pg_aggregate
that is either 0 (for a non-max/min aggregate) or the OID of an operator
that represents the sort order associated with the aggregate.
The rule is that these two queries must give the same answer:
SELECT agg(col) FROM tab;
SELECT col FROM tab ORDER BY col USING operator LIMIT 1;

that is, the aggregate value is equivalent to the first value in the sort
order induced by the operator.  For example, we associate min(int) with
int4lt and max(int) with int4gt.  (We also assume the aggregate must return
NULL for no input.)

The reason for associating with operators, and not directly to opclasses,
is that this will successfully handle both regular and reverse-sort
opclasses.

Adding such a column is just a small matter of programming.  We'd need to
add a clause to CREATE AGGREGATE to let user-defined min/max aggregates
in on the fun, but that's no big deal.

Comments?  Anyone see anything I missed?
        regards, tom lane


From:
Bruno Wolff III
Date:

On Fri, Apr 08, 2005 at 20:50:09 -0400, Tom Lane <> wrote:
> 
> Comments?  Anyone see anything I missed?

It should be possible to make this work for bool_and and bool_or as those
are equivalent to min and max for the boolean type.


From:
Bruno Wolff III
Date:

On Fri, Apr 08, 2005 at 20:50:09 -0400, Tom Lane <> wrote:
> 
>     SELECT (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1),
>            (SELECT y FROM tab WHERE ... ORDER BY y DESC LIMIT 1);
> Comments?  Anyone see anything I missed?

Are NULLs a problem? In the second case above, wouldn't you get NULL
instead of the value returned by max if there were some NULL and some
not NULL values?


From:
Bruno Wolff III
Date:

On Fri, Apr 08, 2005 at 20:50:09 -0400, Tom Lane <> wrote:
> 
> Comments?  Anyone see anything I missed?

Thinking about the case for NULLs some more, I am wondering if you are
going to treat aggregates with strict state functions different than
those that don't? It seems for ones with strict state functions you need
to not include NULL values when doing using ORDER BY. For aggregates
that aren't strict it may be possible that it is desired that NULL
be returned if there is a NULL value in one of the rows.


From:
Mark Kirkwood
Date:

Looks great! I had been slowly thinking along similar lines via the 
equivalence:

SELECT min(x) FROM tab WHERE...

SELECT min(x) FROM (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1) AS t


However, it looks like your approach is more flexible than this :-)

best wishes

Mark


Tom Lane wrote:
> 
> I realized today that this may not be as hard as I thought.
> Specifically, I'm imagining that we could convert
> 
>     SELECT min(x), max(y) FROM tab WHERE ...
> 
> into sub-selects in a one-row outer query:
> 
>     SELECT (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1),
>            (SELECT y FROM tab WHERE ... ORDER BY y DESC LIMIT 1);
> 
> Breaking it down like that means we can consider each aggregate
> independently, which definitely simplifies matters.  


From:
Tom Lane
Date:

Bruno Wolff III <> writes:
>   Tom Lane <> wrote:
>> SELECT (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1),
>> (SELECT y FROM tab WHERE ... ORDER BY y DESC LIMIT 1);

> Are NULLs a problem? In the second case above, wouldn't you get NULL
> instead of the value returned by max if there were some NULL and some
> not NULL values?

Hmm ... we might have to hack the LIMIT step to skip over NULLs.
Doesn't seem like an insurmountable issue though.
        regards, tom lane


From:
Tom Lane
Date:

Bruno Wolff III <> writes:
> Thinking about the case for NULLs some more, I am wondering if you are
> going to treat aggregates with strict state functions different than
> those that don't?

We only intend this to support MAX and MIN.  If you're inventing an
aggregate that doesn't play exactly by those rules, you don't get to
take advantage of the optimization.
        regards, tom lane


From:
Tom Lane
Date:

Bruno Wolff III <> writes:
> It should be possible to make this work for bool_and and bool_or as those
> are equivalent to min and max for the boolean type.

This would just be a matter of marking them properly in the catalogs.

However, are they really equivalent in the corner cases?  In particular,
I think boolean AND across zero input rows is probably supposed to
return TRUE, not NULL.
        regards, tom lane


From:
Bruno Wolff III
Date:

On Fri, Apr 08, 2005 at 23:40:28 -0400, Tom Lane <> wrote:
> Bruno Wolff III <> writes:
> > It should be possible to make this work for bool_and and bool_or as those
> > are equivalent to min and max for the boolean type.
> 
> This would just be a matter of marking them properly in the catalogs.
> 
> However, are they really equivalent in the corner cases?  In particular,
> I think boolean AND across zero input rows is probably supposed to
> return TRUE, not NULL.

I am not sure what the spec says, but according to how the seem to work,
the answer appears to be that they are equivalent.

area=> select bool_and(true) where false;bool_and
----------

(1 row)

area=> select bool_or(true) where false;bool_or
---------

(1 row)


From:
Neil Conway
Date:

Tom Lane wrote:
> Specifically, I'm imagining that we could convert
> 
>     SELECT min(x), max(y) FROM tab WHERE ...
> 
> into sub-selects in a one-row outer query:
> 
>     SELECT (SELECT x FROM tab WHERE ... ORDER BY x LIMIT 1),
>            (SELECT y FROM tab WHERE ... ORDER BY y DESC LIMIT 1);

Does this transformation work for a query of the form:
    SELECT min(x), max(y) FROM tab WHERE random() > 0.5;

(which isn't a very useful query, but I'm sure you can imagine a more 
realistic example involving volatile functions in the WHERE clause.)

-Neil


From:
Tom Lane
Date:

Neil Conway <> writes:
> Does this transformation work for a query of the form:
>      SELECT min(x), max(y) FROM tab WHERE random() > 0.5;

I've been going back and forth on that.  We wouldn't lose a lot in the
real world if we simply abandoned the optimization attempt whenever we
find any volatile functions in WHERE.  OTOH you could also argue that
we have never guaranteed that volatile functions in WHERE would be
evaluated at every table row --- consider something likeSELECT ... WHERE x > 42 AND random() > 0.5;
All that this optimization might do is to further cut the fraction of
table rows at which the volatile function actually gets checked.  So
I'm not seeing that it would break any code that worked reliably before.

Still, if it makes you feel at all uncomfortable, we can just refuse
the optimization in such cases.
        regards, tom lane


From:
Neil Conway
Date:

Tom Lane wrote:
> All that this optimization might do is to further cut the fraction of
> table rows at which the volatile function actually gets checked.  So
> I'm not seeing that it would break any code that worked reliably before.

Hmm; what about
    SELECT min(x), min(x) FROM tab WHERE random() > 0.5;

Applying the optimization would mean the two min(x) expressions would 
likely be different, which seems rather weird.

> Still, if it makes you feel at all uncomfortable, we can just refuse
> the optimization in such cases.

I'd say that's probably safest.

-Neil


From:
Tom Lane
Date:

Neil Conway <> writes:
> Hmm; what about
>      SELECT min(x), min(x) FROM tab WHERE random() > 0.5;
> Applying the optimization would mean the two min(x) expressions would 
> likely be different, which seems rather weird.

Actually not: my expectation is that identical aggregate calls will get
folded into one subplan.  This is what happens now (when you call min(x)
twice it's computed just once) and the subplan mechanism already has the
infrastructure needed to let us keep doing it.  I feel we need to be
able to do this in order to justify the assumption that evaluating each
aggregate separately is OK from the efficiency standpoint.

>> Still, if it makes you feel at all uncomfortable, we can just refuse
>> the optimization in such cases.

> I'd say that's probably safest.

I don't have a problem with that, but I haven't quite convinced myself
that we need to expend the cycles to check for it, either ...
        regards, tom lane


From:
Bruno Wolff III
Date:

On Sat, Apr 09, 2005 at 00:57:11 -0400, Tom Lane <> wrote:
> 
> I don't have a problem with that, but I haven't quite convinced myself
> that we need to expend the cycles to check for it, either ...

You could have two different aggregates and end up with values
that could happen if the same set of rows was used to evaluate both.
I would expect that the sequential plan would be better for a volatile
where clause since you are going to execute it for every row anyway.
So disabling the optimization in that case isn't normally going to
slow things down.


From:
Tom Lane
Date:

Bruno Wolff III <> writes:
> Tom Lane <> wrote:
>> I don't have a problem with that, but I haven't quite convinced myself
>> that we need to expend the cycles to check for it, either ...

> I would expect that the sequential plan would be better for a volatile
> where clause since you are going to execute it for every row anyway.

Well, no, wait a minute.  We have never promised that we would
physically evaluate every volatile function at every table row.
What we promise is that we do not assume-without-proof that the
function's value will be the same at every table row.  I don't see
where this optimization breaks that promise.

Obviously, we do make such an assumption for WHERE clauses that actually
get taken into the indexscan condition.  But we already check volatility
before considering a clause as a possible indexscan condition.  The
question here is whether we have to reject the optimization if there are
additional WHERE clauses, not directly related to the proposed
indexscan, that contain volatile functions.  I'm not seeing the argument
that says we must do that.
        regards, tom lane