Re: RFE: Column aliases in WHERE clauses - Mailing list pgsql-general

From Tom Lane
Subject Re: RFE: Column aliases in WHERE clauses
Date
Msg-id 25320.1347924765@sss.pgh.pa.us
Whole thread Raw
In response to Re: RFE: Column aliases in WHERE clauses  (Mike Christensen <mike@kitchenpc.com>)
Responses Re: RFE: Column aliases in WHERE clauses
Re: RFE: Column aliases in WHERE clauses
List pgsql-general
Mike Christensen <mike@kitchenpc.com> writes:
> This definitely makes sense in the context of aggregation, but I'm
> wondering if the same argument applies in the use case originally
> posted:

> SELECT left(value, 1) as first_letter
> FROM some_table
> WHERE first_letter > 'a';

> Obviously, you can write this as:

> SELECT left(value, 1) as first_letter
> FROM some_table
> WHERE left(value, 1) > 'a';

> This would run fine, though you'd be doing a sequential scan on the
> entire table, getting the left most character in each value, then
> filtering those results.  This of course assumes you haven't built an
> index on left(value, 1).

> Thus, in theory the compiler *could* resolve the actual definition of
> first_letter and substitute in that expression on the fly.  I'm
> wondering if that concept is actually disallowed by the SQL spec.

Yes, it is.  If you read the spec you'll find that the scope of
visibility of names defined in the SELECT list doesn't include WHERE.

It's easier to understand why this is if you realize that SQL has a very
clear model of a "pipeline" of query execution.  Conceptually, what
happens is:

1. Form the cartesian product of the tables listed in FROM (ie, all
combinations of rows).

2. Apply the WHERE condition to each row from 1, and drop rows that
don't pass it.

3. If there's a GROUP BY, merge the surviving rows into groups.

4. If there's aggregate functions, compute those over the rows in
each group.

5. If there's a HAVING, filter the grouped rows according to that.

6. Evaluate the SELECT expressions for each remaining row.

7. If there's an ORDER BY, evaluate those expressions and sort the
remaining rows accordingly.

(Obviously, implementations try to improve on this - you don't want
to actually form the cartesian product - but that's the conceptual
model.)

The traditional shortcut of doing "ORDER BY select-column-reference"
is okay according to this world view, because the SELECT expressions
are already available when ORDER BY needs them.  However, it's not
sensible to refer to SELECT outputs in WHERE, HAVING, or GROUP BY,
because those steps precede the evaluation of the SELECT expressions.

This isn't just academic nit-picking either, because the SELECT
expressions might not be valid for rows that don't pass WHERE etc.
Consider
    SELECT 1/x AS inverse FROM data WHERE x <> 0;
The implementation *must* apply WHERE before computing the SELECT
expressions, or it'll get zero-divide failures that should not happen.

Now, having said all that, if you try it you'll find that Postgres
does allow select column references in GROUP BY, using the model
you propose above of copying whatever expression is in SELECT into
GROUP BY.  This is, to put it politely, a mistake that we are now
stuck with for backwards-compatibility reasons.  It's not spec compliant
and it doesn't fit the language's conceptual model, but it's been that
way for long enough that we're not likely to take it out.  We are not,
however, gonna introduce the same mistake elsewhere.

> Obviously, it would add complexity (and compile overhead) but would be
> somewhat handy to avoid repeating really complicated expressions.
> Perhaps Common Table Expressions are a better way of doing this thing
> anyhow.

CTEs or sub-selects are a better answer for that.  Each sub-select has
its own instance of the conceptual pipeline.

            regards, tom lane


pgsql-general by date:

Previous
From: Dann Corbit
Date:
Subject: Re: Official C++ API for postgresql?
Next
From: Mike Christensen
Date:
Subject: Re: RFE: Column aliases in WHERE clauses