Thread: Using ORDER BY with AGGREGATE/GROUP BY in a SELECT statement

Using ORDER BY with AGGREGATE/GROUP BY in a SELECT statement

From

"David D. Kilzer"

Date:

11 May 2001, 11:26:25

NOTE: I did a moderate search through the PgSQL mail list archives, 
but couldn't find an example of a question where both ORDER BY and
aggregation were used.  In all examples it was possible to use either
one or the other with a possible subselect.  [Keep reading...I
discovered how to do this with subselects, but I'm wondering if (1) what
I'm doing with ORDER BY and aggregation will ever be allowed and (2)
whether the subselect solution is much more inefficient than being able
to order a pre-aggregate set.]

I need to query a table and return aggregated results.  The aggregation
occurs many ways (I'm joining 7 tables and returning 19 columns
currently):
 o Columns in a GROUP BY clause o SUM() functions o COUNT() functions

However, I wrote a new FUNCTION and a new AGGREGATE to return the *last*
value of a field during the aggregation process (see below).  This means
that I would like to use an ORDER BY clause with the SELECT ... GROUP BY
statement before the results are aggregated.

DROP FUNCTION "lastint4" (int4, int4);
CREATE FUNCTION "lastint4" (int4, int4)       RETURNS int4            AS 'BEGIN RETURN $2; END;'      LANGUAGE
'PLPGSQL';

DROP AGGREGATE lastitem int4;
CREATE AGGREGATE lastitem (         SFUNC1 = lastint4,       BASETYPE = int4,         STYPE1 = int4);

A simple (made-up) example:
 SELECT r.personid           AS personid       ,SUM(r.laps)          AS laps       ,COUNT(DISTINCT r.id) AS nightsraced
     ,lastitem(r.carid)    AS carid   FROM race r  WHERE r.personid = 14
 
GROUP BY r.personid
ORDER BY r.date DESC;

Does PostgreSQL (or ANSI SQL) allow this?  In the simple tests I ran, I
would get errors similar to the following:

ERROR:  Attribute r.carid must be GROUPed or used in an aggregate function

This seems a bit...obvious because r.carid is already being used in an
aggregate function.  I'm guessing that I'm running into the internals
of how PgSQL processes the query.

Hmm...well I just figured out how I could do this as a subselect, but it
seems hugely inefficient and would require PostgreSQL 7.1 or later
(ORDER BY, LIMIT used in a subselect):
 SELECT r.personid               AS personid       ,SUM(r.laps)              AS laps       ,COUNT(DISTINCT r.id)     AS
nightsraced      ,(SELECT r.carid           FROM race r          WHERE r.personid = 14        ORDER BY r.date DESC
   LIMIT 1)               AS carid   FROM race r  WHERE r.personid = 14
 
GROUP BY r.personid
ORDER BY r.date;

Is a subselect like this really that inefficient (assuming appropriate 
indexes on r.date and r.personid)?  I would think doing this during
aggregation would be much more efficient.

I'm using PostgreSQL 7.0.3 on Debian GNU/Linux 2.2r3
(postgresql-7.0.3-4) on a Linux 2.2.1x kernel.

Thanks for any insight you can provide!

Dave

Re: Using ORDER BY with AGGREGATE/GROUP BY in a SELECT statement

From

"David D. Kilzer"

Date:

11 May 2001, 12:52:30

Okay, I found a workaround for PostgreSQL 7.0.3 for the specific query I
was working on below, but I'd still like to know if an ORDER BY clause
will ever be allowed with an aggregated SELECT statement, and whether it
would generally be more efficient to use an ORDER BY in the situation
described below instead of a subselect query.

Also, I realized after I sent the original message below that I could 
work around the "no ORDER BY, LIMIT in subselect" limitation in 
PostgreSQL 7.0.x using a subselect within a subselect, but then I'd
just be abusing the database engine, not finding a real-world 
solution.  :^)

Thanks!

Dave


On Mon, May 07, 2001 at 12:27:59PM -0500, David D. Kilzer wrote:

> NOTE: I did a moderate search through the PgSQL mail list archives, 
> but couldn't find an example of a question where both ORDER BY and
> aggregation were used.  In all examples it was possible to use either
> one or the other with a possible subselect.  [Keep reading...I
> discovered how to do this with subselects, but I'm wondering if (1) what
> I'm doing with ORDER BY and aggregation will ever be allowed and (2)
> whether the subselect solution is much more inefficient than being able
> to order a pre-aggregate set.]
> 
> I need to query a table and return aggregated results.  The aggregation
> occurs many ways (I'm joining 7 tables and returning 19 columns
> currently):
> 
>   o Columns in a GROUP BY clause
>   o SUM() functions
>   o COUNT() functions
> 
> However, I wrote a new FUNCTION and a new AGGREGATE to return the *last*
> value of a field during the aggregation process (see below).  This means
> that I would like to use an ORDER BY clause with the SELECT ... GROUP BY
> statement before the results are aggregated.
> 
> DROP FUNCTION "lastint4" (int4, int4);
> CREATE FUNCTION "lastint4" (int4, int4)
>         RETURNS int4
>              AS 'BEGIN RETURN $2; END;'
>        LANGUAGE 'PLPGSQL';
> 
> DROP AGGREGATE lastitem int4;
> CREATE AGGREGATE lastitem (
>           SFUNC1 = lastint4,
>         BASETYPE = int4,
>           STYPE1 = int4);
> 
> A simple (made-up) example:
> 
>   SELECT r.personid           AS personid
>         ,SUM(r.laps)          AS laps
>         ,COUNT(DISTINCT r.id) AS nightsraced
>         ,lastitem(r.carid)    AS carid
>     FROM race r
>    WHERE r.personid = 14
> GROUP BY r.personid
> ORDER BY r.date DESC;
> 
> Does PostgreSQL (or ANSI SQL) allow this?  In the simple tests I ran, I
> would get errors similar to the following:
> 
> ERROR:  Attribute r.carid must be GROUPed or used in an aggregate function
> 
> This seems a bit...obvious because r.carid is already being used in an
> aggregate function.  I'm guessing that I'm running into the internals
> of how PgSQL processes the query.
> 
> Hmm...well I just figured out how I could do this as a subselect, but it
> seems hugely inefficient and would require PostgreSQL 7.1 or later
> (ORDER BY, LIMIT used in a subselect):
> 
>   SELECT r.personid               AS personid
>         ,SUM(r.laps)              AS laps
>         ,COUNT(DISTINCT r.id)     AS nightsraced
>         ,(SELECT r.carid
>             FROM race r
>            WHERE r.personid = 14 
>         ORDER BY r.date DESC
>            LIMIT 1)               AS carid
>     FROM race r
>    WHERE r.personid = 14
> GROUP BY r.personid
> ORDER BY r.date;
> 
> Is a subselect like this really that inefficient (assuming appropriate 
> indexes on r.date and r.personid)?  I would think doing this during
> aggregation would be much more efficient.
> 
> I'm using PostgreSQL 7.0.3 on Debian GNU/Linux 2.2r3
> (postgresql-7.0.3-4) on a Linux 2.2.1x kernel.
> 
> Thanks for any insight you can provide!
> 
> Dave

Re: Using ORDER BY with AGGREGATE/GROUP BY in a SELECT statement

From

Tom Lane

Date:

11 May 2001, 13:53:42

"David D. Kilzer" <ddkilzer@lubricants-oil.com> writes:
> [ wants to write an aggregate function that returns its last input ]

The SQL model of query processing has a very definite view of the stages
of processing: first group by, then aggregate, and last order by.  Tuple
ordering is irrelevant according to the basic semantics of the language.
Probably the SQL authors would have left out ORDER BY entirely if they
could have got away with it, but instead they made it a vestigial
appendage that is only allowed at the very last instant before query
outputs are forwarded to a client application.

Thus, it is very bad form to write an aggregate that depends on the
order it sees its inputs in.  This won't be changed, because it's part
of the nature of the language.

In PG 7.1 it's possible to hack around this by ordering the result of
a subselect-in-FROM:
SELECT orderedagg(ss.x) FROM (select x from tab order by y) ss;

which is a gross violation of the letter and spirit of the spec, and
should not be expected to be portable to other DBMSes; but it gets the
job done if you are intent on writing an ordering-dependent aggregate.

However, I don't see any good way to combine this with grouping, since
if you apply GROUP BY to the output of the subselect you'll lose the
ordering again.

>   SELECT r.personid               AS personid
>         ,SUM(r.laps)              AS laps
>         ,COUNT(DISTINCT r.id)     AS nightsraced
>         ,(SELECT r.carid
>             FROM race r
>            WHERE r.personid = 14 
>         ORDER BY r.date DESC
>            LIMIT 1)               AS carid
>     FROM race r
>    WHERE r.personid = 14
> GROUP BY r.personid
> ORDER BY r.date;

This is likely to be reasonably efficient, actually, since the subselect
will be evaluated only once per output group --- in fact, as you've
written it it'll only be evaluated once, period, since it has no
dependencies on the outer query.  More usually you'd probably do
       ,(SELECT r2.carid           FROM race r2          WHERE r2.personid = r.personid       ORDER BY r2.date DESC
    LIMIT 1)               AS carid

so that the result tracks the outer query, and in this form it'd be
redone once per output row.
        regards, tom lane

Re: Using ORDER BY with AGGREGATE/GROUP BY in a SELECT statement

From

"David D. Kilzer"

Date:

14 May 2001, 16:46:18

Tom,

Thanks for the reply, and for correcting my broken SQL.  :^)

A couple days after I sent my own follow-up message, I realized how
silly my question was.  I'll let future questions sit a bit longer
next time.

To make up for my silly question, here is a function + aggregate I 
created while working on a different aspect of my original problem.
Perhaps someone will find this useful.

The function, 'booland', returns the logical 'AND' of two input values.

DROP FUNCTION "booland" (bool, bool);

CREATE FUNCTION "booland" (bool, bool)       RETURNS bool            AS 'BEGIN RETURN $1 AND $2; END;'      LANGUAGE
'PLPGSQL';

The aggregate, 'logical_and', returns the logical 'AND' of all values in
a column in an aggregated query.

DROP AGGREGATE logical_and bool;

CREATE AGGREGATE logical_and (         SFUNC1 = booland,       BASETYPE = bool,         STYPE1 = bool,      INITCOND1 =
't'
);

Obviously, this concept could be extended for a logical 'OR' function
and aggregate.

Dave


On Fri, May 11, 2001 at 12:33:52PM -0400, Tom Lane wrote:

> "David D. Kilzer" <ddkilzer@lubricants-oil.com> writes:
> > [ wants to write an aggregate function that returns its last input ]
> 
> The SQL model of query processing has a very definite view of the stages
> of processing: first group by, then aggregate, and last order by.  Tuple
> ordering is irrelevant according to the basic semantics of the language.
> Probably the SQL authors would have left out ORDER BY entirely if they
> could have got away with it, but instead they made it a vestigial
> appendage that is only allowed at the very last instant before query
> outputs are forwarded to a client application.
> 
> Thus, it is very bad form to write an aggregate that depends on the
> order it sees its inputs in.  This won't be changed, because it's part
> of the nature of the language.
> 
> In PG 7.1 it's possible to hack around this by ordering the result of
> a subselect-in-FROM:
> 
>     SELECT orderedagg(ss.x) FROM (select x from tab order by y) ss;
> 
> which is a gross violation of the letter and spirit of the spec, and
> should not be expected to be portable to other DBMSes; but it gets the
> job done if you are intent on writing an ordering-dependent aggregate.
> 
> However, I don't see any good way to combine this with grouping, since
> if you apply GROUP BY to the output of the subselect you'll lose the
> ordering again.
> 
> >   SELECT r.personid               AS personid
> >         ,SUM(r.laps)              AS laps
> >         ,COUNT(DISTINCT r.id)     AS nightsraced
> >         ,(SELECT r.carid
> >             FROM race r
> >            WHERE r.personid = 14 
> >         ORDER BY r.date DESC
> >            LIMIT 1)               AS carid
> >     FROM race r
> >    WHERE r.personid = 14
> > GROUP BY r.personid
> > ORDER BY r.date;
> 
> This is likely to be reasonably efficient, actually, since the subselect
> will be evaluated only once per output group --- in fact, as you've
> written it it'll only be evaluated once, period, since it has no
> dependencies on the outer query.  More usually you'd probably do
> 
>         ,(SELECT r2.carid
>             FROM race r2
>            WHERE r2.personid = r.personid
>         ORDER BY r2.date DESC
>            LIMIT 1)               AS carid
> 
> so that the result tracks the outer query, and in this form it'd be
> redone once per output row.
> 
>             regards, tom lane