SELECT DISTINCT ON... ORDER BY... - Mailing list pgsql-general

From Arthur M. Kang
Subject SELECT DISTINCT ON... ORDER BY...
Date
Msg-id NDBBJOJLLCCDNFFLDPPNGEDICAAA.arthur@levelogic.com
Whole thread Raw
Responses Re: SELECT DISTINCT ON... ORDER BY...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Did massive amounts of searching and read Tom Lane's post regarding the subject, but that was dated January of 1999.  Was wondering if anyone know if there was any progress on the issue and what the resulting outcome was.
 
Is there a way to select distinct on one column and sort by another?
 
Any help is appreciated...
 
Arthur
 
Thomas Metz <tmetz@gsf.de> writes:
> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
> [doesn't work as expected]
 
There have been related discussions before on pg-hackers mail list;
you might care to check the list archives.  The conclusion I recall
is that it's not real clear how the combination of SELECT DISTINCT
on one column and ORDER BY on another *should* work.  Postgres'
current behavior is clearly wrong IMHO, but there isn't a unique
definition of right behavior, because it's not clear which tuples
should get selected for the sort.
 
This "SELECT DISTINCT ON attribute" option strikes me as even more
bogus.  Where did we get that from --- is it in the SQL92 standard?
If you SELECT DISTINCT on a subset of the attributes to be returned,
then there's no unique definition of which values get returned in the
other columns.  In Thomas' example:
 
> Assuming the table TEST as follows:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
> 1      Thomas
> 2      Fenella
 
> SELECT DISTINCT ON id id, name FROM test;
> produces:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
 
There's no justifiable reason for preferring this output over
        1      Thomas
        2      Oliver
or
        1      Alex
        2      Fenella
or
        1      Thomas
        2      Fenella
 
Any of these are "DISTINCT ON id", but it's purely a matter of
happenstance table order and unspecified implementation choices which
one will appear.  Do we really have (or want) a statement with
inherently undefined behavior?
 
Anyway, to answer Thomas' question, the way SELECT DISTINCT is
implemented is that first there's a sort on the DISTINCT columns,
then there's a pass that eliminates adjacent duplicates (like the Unix
uniq(1) program).  In the current backend, doing an ORDER BY on another
column overrides the sorting on the DISTINCT columns, so when the
duplicate-eliminator runs it will fail to get rid of duplicates that
don't happen to appear consecutively in its input.  That's pretty
broken, but then the entire concept of combining these two options
doesn't seem well defined; the SELECT DISTINCT doesn't make any promises
about which tuples (with the same DISTINCT columns) it's going to pick,
therefore the result of ordering by some other column isn't clear.
 
If you're willing to live with poorly defined behavior, the fix
is fairly obvious: run the sort and uniq passes for the DISTINCT
columns, *then* run the sort on the ORDER BY columns --- which
will use whichever tuple the DISTINCT phase selected at random
out of each set with the same DISTINCT value.
 
I think the issue got put on the back burner last time in hopes that
some definition with consistent behavior would come up, but I haven't
seen any hope that there is one.
 
                        regards, tom lane

pgsql-general by date:

Previous
From: Lamar Owen
Date:
Subject: Re: Postgres 7.0.2-2 on Red Hat 7.0?
Next
From: "Nikolaus Rumm"
Date:
Subject: getBigDecimal() in JDBC driver not yet implemented ?