Re: SELECT DISTINCT ON ... ORDER BY ... - Mailing list pgsql-sql
From | Tom Lane |
---|---|
Subject | Re: SELECT DISTINCT ON ... ORDER BY ... |
Date | |
Msg-id | 25791.917629372@sss.pgh.pa.us Whole thread Raw |
In response to | SELECT DISTINCT ON ... ORDER BY ... (Thomas Metz <tmetz@gsf.de>) |
Responses |
Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...
|
List | pgsql-sql |
Thomas Metz <tmetz@gsf.de> writes: > SELECT DISTINCT ON id id, name FROM test ORDER BY name; > [doesn't work as expected] There have been related discussions before on pg-hackers mail list; you might care to check the list archives. The conclusion I recall is that it's not real clear how the combination of SELECT DISTINCT on one column and ORDER BY on another *should* work. Postgres' current behavior is clearly wrong IMHO, but there isn't a unique definition of right behavior, because it's not clear which tuples should get selected for the sort. This "SELECT DISTINCT ON attribute" option strikes me as even more bogus. Where did we get that from --- is it in the SQL92 standard? If you SELECT DISTINCT on a subset of the attributes to be returned, then there's no unique definition of which values get returned in the other columns. In Thomas' example: > Assuming the table TEST as follows: > ID NAME > - ----------------- > 1 Alex > 2 Oliver > 1 Thomas > 2 Fenella > SELECT DISTINCT ON id id, name FROM test; > produces: > ID NAME > - ----------------- > 1 Alex > 2 Oliver There's no justifiable reason for preferring this output over 1 Thomas 2 Oliver or 1 Alex 2 Fenella or 1 Thomas 2 Fenella Any of these are "DISTINCT ON id", but it's purely a matter of happenstance table order and unspecified implementation choices which one will appear. Do we really have (or want) a statement with inherently undefined behavior? Anyway, to answer Thomas' question, the way SELECT DISTINCT is implemented is that first there's a sort on the DISTINCT columns, then there's a pass that eliminates adjacent duplicates (like the Unix uniq(1) program). In the current backend, doing an ORDER BY on another column overrides the sorting on the DISTINCT columns, so when the duplicate-eliminator runs it will fail to get rid of duplicates that don't happen to appear consecutively in its input. That's pretty broken, but then the entire concept of combining these two options doesn't seem well defined; the SELECT DISTINCT doesn't make any promises about which tuples (with the same DISTINCT columns) it's going to pick, therefore the result of ordering by some other column isn't clear. If you're willing to live with poorly defined behavior, the fix is fairly obvious: run the sort and uniq passes for the DISTINCT columns, *then* run the sort on the ORDER BY columns --- which will use whichever tuple the DISTINCT phase selected at random out of each set with the same DISTINCT value. I think the issue got put on the back burner last time in hopes that some definition with consistent behavior would come up, but I haven't seen any hope that there is one. regards, tom lane