Re: SELECT DISTINCT ON ... ORDER BY ... - Mailing list pgsql-sql

From Tom Lane
Subject Re: SELECT DISTINCT ON ... ORDER BY ...
Date
Msg-id 25791.917629372@sss.pgh.pa.us
Whole thread Raw
In response to SELECT DISTINCT ON ... ORDER BY ...  (Thomas Metz <tmetz@gsf.de>)
Responses Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...  ("Oliver Elphick" <olly@lfix.co.uk>)
List pgsql-sql
Thomas Metz <tmetz@gsf.de> writes:
> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
> [doesn't work as expected]

There have been related discussions before on pg-hackers mail list;
you might care to check the list archives.  The conclusion I recall
is that it's not real clear how the combination of SELECT DISTINCT
on one column and ORDER BY on another *should* work.  Postgres'
current behavior is clearly wrong IMHO, but there isn't a unique
definition of right behavior, because it's not clear which tuples
should get selected for the sort.

This "SELECT DISTINCT ON attribute" option strikes me as even more
bogus.  Where did we get that from --- is it in the SQL92 standard?
If you SELECT DISTINCT on a subset of the attributes to be returned,
then there's no unique definition of which values get returned in the
other columns.  In Thomas' example:

> Assuming the table TEST as follows:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
> 1      Thomas
> 2      Fenella

> SELECT DISTINCT ON id id, name FROM test;
> produces:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver

There's no justifiable reason for preferring this output over
    1      Thomas
    2      Oliver
or
    1      Alex
    2      Fenella
or
    1      Thomas
    2      Fenella

Any of these are "DISTINCT ON id", but it's purely a matter of
happenstance table order and unspecified implementation choices which
one will appear.  Do we really have (or want) a statement with
inherently undefined behavior?

Anyway, to answer Thomas' question, the way SELECT DISTINCT is
implemented is that first there's a sort on the DISTINCT columns,
then there's a pass that eliminates adjacent duplicates (like the Unix
uniq(1) program).  In the current backend, doing an ORDER BY on another
column overrides the sorting on the DISTINCT columns, so when the
duplicate-eliminator runs it will fail to get rid of duplicates that
don't happen to appear consecutively in its input.  That's pretty
broken, but then the entire concept of combining these two options
doesn't seem well defined; the SELECT DISTINCT doesn't make any promises
about which tuples (with the same DISTINCT columns) it's going to pick,
therefore the result of ordering by some other column isn't clear.

If you're willing to live with poorly defined behavior, the fix
is fairly obvious: run the sort and uniq passes for the DISTINCT
columns, *then* run the sort on the ORDER BY columns --- which
will use whichever tuple the DISTINCT phase selected at random
out of each set with the same DISTINCT value.

I think the issue got put on the back burner last time in hopes that
some definition with consistent behavior would come up, but I haven't
seen any hope that there is one.

            regards, tom lane

pgsql-sql by date:

Previous
From: Remigiusz Sokolowski
Date:
Subject: Re: [SQL] Character type name?? How to lower case it?
Next
From: David Hartwig
Date:
Subject: Re: [SQL] ORDER BY