Thread: ORDER BY and DISTINCT ON

ORDER BY and DISTINCT ON

From

Neil Conway

Date:

12 December 2003, 19:39:52

We reject the following query:

nconway=# create table abc (a int, b int, c int);
CREATE TABLE
nconway=# select distinct on (a) a, b, c from abc order by b, c, a;
ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY
expressions

This works fine, of course:

nconway=# select distinct on (a) a, b, c from abc order by a, b, c;a | b | c 
---+---+---
(0 rows)

src/backend/parser/parse_clause.c notes:
    /*     * If the user writes both DISTINCT ON and ORDER BY, then the     * two expression lists must match (until
oneor the other     * runs out).  Otherwise the ORDER BY requires a different     * sort order than the DISTINCT does,
andwe can't implement     * that with only one sort pass (and if we do two passes, the     * results will be rather
unpredictable).However, it's OK to     * have more DISTINCT ON expressions than ORDER BY     * expressions; we can just
addthe extra DISTINCT values to     * the sort list, much as we did above for ordinary DISTINCT     * fields.     *
*Actually, it'd be OK for the common prefixes of the two     * lists to match in any order, but implementing that check
   * seems like more trouble than it's worth.     */
 

Does this strike anyone else as being wrong?

-Neil

Re: ORDER BY and DISTINCT ON

From

Bruno Wolff III

Date:

13 December 2003, 20:23:16

On Fri, Dec 12, 2003 at 18:39:20 -0500, Neil Conway <neilc@samurai.com> wrote:
>         /*
>          * If the user writes both DISTINCT ON and ORDER BY, then the
>          * two expression lists must match (until one or the other
>          * runs out).  Otherwise the ORDER BY requires a different
>          * sort order than the DISTINCT does, and we can't implement
>          * that with only one sort pass (and if we do two passes, the
>          * results will be rather unpredictable). However, it's OK to
>          * have more DISTINCT ON expressions than ORDER BY
>          * expressions; we can just add the extra DISTINCT values to
>          * the sort list, much as we did above for ordinary DISTINCT
>          * fields.
>          *
>          * Actually, it'd be OK for the common prefixes of the two
>          * lists to match in any order, but implementing that check
>          * seems like more trouble than it's worth.
>          */
> 
> Does this strike anyone else as being wrong?

These seem like reasonable restrictions. There are easy work arounds
for the above restrictions, so the restrictions aren't a significant burden.

In a world with unlimited developer resources it would be nice to be able
to properly handle any order by list. In the real world I doubt that
that benefit is worth having a major developer work on this rather
than working on any of a number of other things which will result in
more benefit.

Re: ORDER BY and DISTINCT ON

From

Greg Stark

Date:

13 December 2003, 23:12:59

Neil Conway <neilc@samurai.com> writes:

> We reject the following query:
> 
> nconway=# create table abc (a int, b int, c int);
> CREATE TABLE
> nconway=# select distinct on (a) a, b, c from abc order by b, c, a;
> ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY
> expressions

What would you expect to happen here?

Do you really want:

select distinct on (b,c,a) a,b,c from abc order by b,c,a;

or is that you want

select * from (select distinct on (a) a,b,c order by a) order by b,c,a;

Ie, pick a random record for each a and then sort by b,c?

Think of DISTINCT ON as a special form of GROUP BY that instead of doing
aggregate just returns the first record.

So, like DISTINCT ON, GROUP BY also insists on the user providing the ORDER BY
clause. I suppose you could argue postgres could implicitly introduce an extra
sort step when the user-provided ORDER BY doesn't match the GROUP BY or
DISTINCT ON clause but it seems like the user is probably confused if he
really wants a random record and then sort on columns that weren't sorted
previous to the DISTINCT ON.

-- 
greg

Re: ORDER BY and DISTINCT ON

From

Neil Conway

Date:

13 December 2003, 23:43:08

Greg Stark <gsstark@mit.edu> writes:
> Do you really want:
>
> select distinct on (b,c,a) a,b,c from abc order by b,c,a;
>
> or is that you want
>
> select * from (select distinct on (a) a,b,c order by a) order by
> b,c,a;

If I understand you correctly, I don't think I would expect either.
  - ORDER BY provides a sort order for the result set
  - DISTINCT ON specifies a list of expressions, and says: "For each    set of rows in the result set for which these
expressionsare all    equal, retain the first row and throw the rest away", where the    "first row" is defined by the
ORDERBY sort order

So I'd expect this query to
  (a) keep at most one of every distinct 'a' value. When throwing out      duplicates, we should keep the row that
wouldcome first as      specified by the ORDER BY sort order

  (b) sort the result set by b,c,a

ISTM this interpretation is pretty logical, and that the current
restriction is made purely for the sake of ease-of-implementation. If
that's the case, we should at least document this restriction, and
perhaps plan on correcting it in the future.

-Neil

Re: ORDER BY and DISTINCT ON

From

Bruno Wolff III

Date:

14 December 2003, 10:41:06

On Sat, Dec 13, 2003 at 22:12:32 -0500, Greg Stark <gsstark@mit.edu> wrote:
> 
> So, like DISTINCT ON, GROUP BY also insists on the user providing the ORDER BY
> clause. I suppose you could argue postgres could implicitly introduce an extra
> sort step when the user-provided ORDER BY doesn't match the GROUP BY or
> DISTINCT ON clause but it seems like the user is probably confused if he
> really wants a random record and then sort on columns that weren't sorted
> previous to the DISTINCT ON.

You can make the result deterministic by using an initial order by
that uses the distinct expressions followed by the order by expressions.
After that is used to get which records will be returned, a second sort
is done using just the expressions on the order by clause.

Re: ORDER BY and DISTINCT ON

From

Tom Lane

Date:

14 December 2003, 19:10:16

Greg Stark <gsstark@mit.edu> writes:
> So, like DISTINCT ON, GROUP BY also insists on the user providing the
> ORDER BY clause. I suppose you could argue postgres could implicitly
> introduce an extra sort step when the user-provided ORDER BY doesn't
> match the GROUP BY or DISTINCT ON clause but it seems like the user is
> probably confused if he really wants a random record and then sort on
> columns that weren't sorted previous to the DISTINCT ON.

This was discussed before --- see the archives.  I believe the
conclusion was that the results would actually be nondeterministic
if we used two sort steps (that's what the code comment means by
"rather unpredictable").  This is not unrelated to the reasons why
people consider DISTINCT ON to be a messy feature ... ideally it
should be orthogonal to ORDER BY, but it simply isn't.
        regards, tom lane

Re: ORDER BY and DISTINCT ON

From

Neil Conway

Date:

14 December 2003, 22:45:23

Tom Lane <tgl@sss.pgh.pa.us> writes:
> This was discussed before --- see the archives.  I believe the
> conclusion was that the results would actually be nondeterministic
> if we used two sort steps (that's what the code comment means by
> "rather unpredictable").

Does the non-determinism you're referring to result from an ORDER BY
on a non-deterministic expression, or the non-determinism that results
from picking an effectively random row because the ORDER BY isn't
sufficient?

I searched the archives and found Stephen Szabo's comment[1] that:
 The query you've written is potential non-deterministic if you have a people_id that has multiple rows with different
lastnames that meet the where clause.

Which seems like an unconvincing justification for rejecting the
query: we accept DISTINCT ON with no ORDER BY clause at all, for
example.

-Neil

[1] http://archives.postgresql.org/pgsql-hackers/2001-07/msg00588.php

Re: ORDER BY and DISTINCT ON

From

Bruno Wolff III

Date:

14 December 2003, 22:51:48

On Sun, Dec 14, 2003 at 18:09:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> This was discussed before --- see the archives.  I believe the
> conclusion was that the results would actually be nondeterministic
> if we used two sort steps (that's what the code comment means by
> "rather unpredictable").  This is not unrelated to the reasons why
> people consider DISTINCT ON to be a messy feature ... ideally it
> should be orthogonal to ORDER BY, but it simply isn't.

If the sort used to select the records sorts on both the distinct
expressions and the order by expressions you will get a sensible
deterministic result. That result can then be sorted using just
the order by expressions.

Re: ORDER BY and DISTINCT ON

From

Tom Lane

Date:

14 December 2003, 23:00:45

Neil Conway <neilc@samurai.com> writes:
> Does the non-determinism you're referring to result from an ORDER BY
> on a non-deterministic expression, or the non-determinism that results
> from picking an effectively random row because the ORDER BY isn't
> sufficient?

The latter --- you don't know which row you'll get, because it depends
on what the sorting procedure does with equal keys.  (I think.  This
argument was a few years ago and I've not bothered to review the
archives.)  With ordinary DISTINCT this does not matter because you
can't tell the difference between "equal" rows anyway --- but with
DISTINCT ON, you can tell the difference.

> Which seems like an unconvincing justification for rejecting the
> query: we accept DISTINCT ON with no ORDER BY clause at all, for
> example.

Well, we invent an ORDER BY clause matching the DISTINCT ON in that
case.  The rationale for doing so is weak, I agree, but since you have
not specified a sort order, you can hardly argue that the result is
wrong.

I think you are correct that this restriction is essentially an
efficiency hack.  But DISTINCT ON is in itself an efficiency hack.
I'm not sure I see the point of allowing a less-efficient variation
of the efficiency hack, which is what we'd have if we supported
DISTINCT ON with a non-matching ORDER BY.  Certainly it doesn't seem
important enough to expend significant implementation effort on.
        regards, tom lane

Re: ORDER BY and DISTINCT ON

From

Tom Lane

Date:

14 December 2003, 23:18:02

Bruno Wolff III <bruno@wolff.to> writes:
> If the sort used to select the records sorts on both the distinct
> expressions and the order by expressions you will get a sensible
> deterministic result.

Sensible in what sense?  ;-)

It seems to me that the existing documentation defines the behavior of
DISTINCT ON as selecting the row within each DISTINCT ON group that is
first according to the ORDER BY columns that are less significant than
the DISTINCT ON keys.  Perhaps this is not clear enough and should be
clarified.  But it doesn't seem very useful to me to extend the behavior
to allow other cases ... what are you really buying if you do so, and
what will it cost in execution time?
        regards, tom lane

Re: ORDER BY and DISTINCT ON

From

Bruno Wolff III

Date:

15 December 2003, 08:15:54

On Sun, Dec 14, 2003 at 22:17:35 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruno Wolff III <bruno@wolff.to> writes:
> > If the sort used to select the records sorts on both the distinct
> > expressions and the order by expressions you will get a sensible
> > deterministic result.
> 
> Sensible in what sense?  ;-)

Doing things as above is pretty much the same as normal distinct on
for purposes of which rows get selected. Of the possible rows that
might get returned for a specific set of values from the distinct on
expressions you will get the row that is first as ordered by the
expressions in the order by clause. If the order by clause isn't selective
enough there may be several rows that could be selected, but that is true
for how distinct on works now.

> It seems to me that the existing documentation defines the behavior of
> DISTINCT ON as selecting the row within each DISTINCT ON group that is
> first according to the ORDER BY columns that are less significant than
> the DISTINCT ON keys.  Perhaps this is not clear enough and should be
> clarified.  But it doesn't seem very useful to me to extend the behavior
> to allow other cases ... what are you really buying if you do so, and
> what will it cost in execution time?

What it buys is saving a bit of typing. The main slow down would be in trying
to determine whether you needed to use two sorts or just one in a particular
case. This shouldn't cost too much time, but might be an issue if there
are a lot of columns in both the distinct on and order by.

I do aggree that this isn't an area I would like to see developers working on.
I would much rather see developers work on things that add real functionallity
rather that that save some typing in unusual cases.

Re: ORDER BY and DISTINCT ON

From

Bruno Wolff III

Date:

15 December 2003, 10:04:17

On Mon, Dec 15, 2003 at 06:14:59 -0600, Bruno Wolff III <bruno@wolff.to> wrote:
> 
> Doing things as above is pretty much the same as normal distinct on
> for purposes of which rows get selected. Of the possible rows that
> might get returned for a specific set of values from the distinct on
> expressions you will get the row that is first as ordered by the
> expressions in the order by clause. If the order by clause isn't selective
> enough there may be several rows that could be selected, but that is true
> for how distinct on works now.

Specifically the interpretation I think makes sense is that
SELECT DISTINCT ON (a, b, c) * FROM tablename ORDER BY d, e, f
should be treated as equvialent to
SELECT * FROM (SELECT DISTINCT ON (a, b, c) FROM tablename ORDER BY a, b, c, d, e, f) AS t ORDER BY d, e, f

Re: ORDER BY and DISTINCT ON

From

Tom Lane

Date:

15 December 2003, 10:53:51

Bruno Wolff III <bruno@wolff.to> writes:
> Specifically the interpretation I think makes sense is that
> SELECT DISTINCT ON (a, b, c) * FROM tablename ORDER BY d, e, f
> should be treated as equvialent to
> SELECT * FROM
>   (SELECT DISTINCT ON (a, b, c) FROM tablename ORDER BY a, b, c, d, e, f) AS t
>   ORDER BY d, e, f

Right, that's the same thing Neil said elsewhere in the thread.  I guess
I agree that this would be cleaner, and also that it doesn't seem like a
high priority problem.

BTW, you could imagine implementing DISTINCT and DISTINCT ON by a hash
table, if you didn't expect too many distinct rows.  The hash key would
be the DISTINCT columns, and in the DISTINCT ON case, when you get a new
row with DISTINCT columns matching an existing entry, you compare the
two rows using the ORDER BY key columns to decide which row to keep.  In
this formulation it's fairly obvious that the DISTINCT and ORDER BY keys
don't need to overlap at all.  This technique does not require presorted
input (good) but also delivers unsorted output (bad).  It could still be
a big win if you expect the DISTINCT filtering to reduce the number of
rows substantially, because you'd end up sorting a much smaller number
of rows.  I didn't get around to trying to implement this when I was
doing hash aggregation for 7.4, but it seems like a reasonable item for
the TODO list.
        regards, tom lane