Thread: Column aliases in GROUP BY and HAVING

Column aliases in GROUP BY and HAVING

From

"Peter J. Holzer"

Date:

12 October 2020, 17:10:08

In my mental model of how SQL works, the clauses of an SQL query (if
present) are processed in a certain order:

    FROM
    WHERE
    SELECT
    GROUP BY
    HAVING
    ORDER BY
    LIMIT

and each processes the output of the previous one.

However, consider this:

    hjp=> select * from employees;
    ╔════════════╤═══════╤════════════╗
    ║    ssn     │ name  │ other_data ║
    ╟────────────┼───────┼────────────╢
    ║ 1234010400 │ Alice │ (∅)        ║
    ║ 2345180976 │ Bob   │ (∅)        ║
    ║ 2645101276 │ Carol │ (∅)        ║
    ║ 9843100395 │ David │ (∅)        ║
    ╚════════════╧═══════╧════════════╝

    hjp=> select substring(ssn, 9, 2) as year, count(*) from employees group by year;
    ╔══════╤═══════╗
    ║ year │ count ║
    ╟──────┼───────╢
    ║ 95   │     1 ║
    ║ 76   │     2 ║
    ║ 00   │     1 ║
    ╚══════╧═══════╝
    (3 rows)

In the GROUP BY clause I can use the alias year which was defined
earlier in SELECT.

HAVING comes after GROUP BY, so I should be able to use that there, too.
Right?

    hjp=> select substring(ssn, 9, 2) as year, count(*) from employees group by year having year > '20';
    ERROR:  column "year" does not exist
    LINE 1: ...ear, count(*) from employees group by year having year > '20...

Wrong. I have to use the whole expression again:

    hjp=> select substring(ssn, 9, 2) as year, count(*) from employees group by year having substring(ssn, 9, 2) >
'20';
    ╔══════╤═══════╗
    ║ year │ count ║
    ╟──────┼───────╢
    ║ 95   │     1 ║
    ║ 76   │     2 ║
    ╚══════╧═══════╝
    (2 rows)


This seems inconsistent to me. Is there a technical or semantic reason
for this or is just "because the standard says so".

        hp

PS: Please no discussions about the appropriateness of using an SSN as
an id. This is a completely made-up example.

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc

Re: Column aliases in GROUP BY and HAVING

From

Tom Lane

Date:

12 October 2020, 17:40:03

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> In my mental model of how SQL works, the clauses of an SQL query (if
> present) are processed in a certain order:

>     FROM
>     WHERE
>     SELECT
>     GROUP BY
>     HAVING
>     ORDER BY
>     LIMIT

The SELECT list is certainly done after GROUP BY/HAVING.  Consider

SELECT 1/x FROM ... GROUP BY x HAVING x > 0;

One would be justifiably upset to get a divide-by-zero error from this.

Its relationship to ORDER BY is a bit more tenuous, mainly because of
the SQL92 legacy notion that you can ORDER BY a select-list column.
I'm too lazy to check the code right now, but I think our current
policy is that SELECT expressions are evaluated after ORDER BY/LIMIT
unless they are used as GROUP BY or ORDER BY keys.  Without that,
you'd get unhappy performance results from

SELECT id, expensive_function(x) FROM ... ORDER BY id LIMIT 1;

> In the GROUP BY clause I can use the alias year which was defined
> earlier in SELECT. 

This is a pretty unfortunate legacy thing that we support because
backwards compatibility (and because "GROUP BY 1" is so frequently
a handy shortcut).  Semantically, it's a mess, not only because of
the when-to-evaluate confusion but because it's not too clear
whether a column name refers to a SELECT output column or to some
table column emitted by the FROM clause.  We try to limit the
potential for that sort of confusion by only trying to match
SELECT output names to GROUP/ORDER BY items when the latter are
simple identifiers.

> HAVING comes after GROUP BY, so I should be able to use that there, too.
> Right?

No.  We're not going there.  The core reason why not is that HAVING
expressions are seldom plain column names, so it wouldn't work anyway
unless we opened the floodgates on where SELECT output names could be
matched.

> This seems inconsistent to me. Is there a technical or semantic reason
> for this or is just "because the standard says so".

I think SQL versions newer than SQL92 disallow these references entirely.

(I'm being fairly brief here, but this has all been litigated multiple
times before.  See the archives.)

            regards, tom lane

Re: Column aliases in GROUP BY and HAVING

From

"Peter J. Holzer"

Date:

12 October 2020, 20:03:52

On 2020-10-12 10:40:03 -0400, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > In the GROUP BY clause I can use the alias year which was defined
> > earlier in SELECT.
>
> This is a pretty unfortunate legacy thing that we support because
> backwards compatibility (and because "GROUP BY 1" is so frequently
> a handy shortcut).
[...]
> I think SQL versions newer than SQL92 disallow these references entirely.
>
> (I'm being fairly brief here, but this has all been litigated multiple
> times before.  See the archives.)

That's ok, you've answered my question. Thanks.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc