Thread: ORDER BY and DISTINCT ON
We reject the following query: nconway=# create table abc (a int, b int, c int); CREATE TABLE nconway=# select distinct on (a) a, b, c from abc order by b, c, a; ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions This works fine, of course: nconway=# select distinct on (a) a, b, c from abc order by a, b, c;a | b | c ---+---+--- (0 rows) src/backend/parser/parse_clause.c notes: /* * If the user writes both DISTINCT ON and ORDER BY, then the * two expression lists must match (until oneor the other * runs out). Otherwise the ORDER BY requires a different * sort order than the DISTINCT does, andwe can't implement * that with only one sort pass (and if we do two passes, the * results will be rather unpredictable).However, it's OK to * have more DISTINCT ON expressions than ORDER BY * expressions; we can just addthe extra DISTINCT values to * the sort list, much as we did above for ordinary DISTINCT * fields. * *Actually, it'd be OK for the common prefixes of the two * lists to match in any order, but implementing that check * seems like more trouble than it's worth. */ Does this strike anyone else as being wrong? -Neil
On Fri, Dec 12, 2003 at 18:39:20 -0500, Neil Conway <neilc@samurai.com> wrote: > /* > * If the user writes both DISTINCT ON and ORDER BY, then the > * two expression lists must match (until one or the other > * runs out). Otherwise the ORDER BY requires a different > * sort order than the DISTINCT does, and we can't implement > * that with only one sort pass (and if we do two passes, the > * results will be rather unpredictable). However, it's OK to > * have more DISTINCT ON expressions than ORDER BY > * expressions; we can just add the extra DISTINCT values to > * the sort list, much as we did above for ordinary DISTINCT > * fields. > * > * Actually, it'd be OK for the common prefixes of the two > * lists to match in any order, but implementing that check > * seems like more trouble than it's worth. > */ > > Does this strike anyone else as being wrong? These seem like reasonable restrictions. There are easy work arounds for the above restrictions, so the restrictions aren't a significant burden. In a world with unlimited developer resources it would be nice to be able to properly handle any order by list. In the real world I doubt that that benefit is worth having a major developer work on this rather than working on any of a number of other things which will result in more benefit.
Neil Conway <neilc@samurai.com> writes: > We reject the following query: > > nconway=# create table abc (a int, b int, c int); > CREATE TABLE > nconway=# select distinct on (a) a, b, c from abc order by b, c, a; > ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY > expressions What would you expect to happen here? Do you really want: select distinct on (b,c,a) a,b,c from abc order by b,c,a; or is that you want select * from (select distinct on (a) a,b,c order by a) order by b,c,a; Ie, pick a random record for each a and then sort by b,c? Think of DISTINCT ON as a special form of GROUP BY that instead of doing aggregate just returns the first record. So, like DISTINCT ON, GROUP BY also insists on the user providing the ORDER BY clause. I suppose you could argue postgres could implicitly introduce an extra sort step when the user-provided ORDER BY doesn't match the GROUP BY or DISTINCT ON clause but it seems like the user is probably confused if he really wants a random record and then sort on columns that weren't sorted previous to the DISTINCT ON. -- greg
Greg Stark <gsstark@mit.edu> writes: > Do you really want: > > select distinct on (b,c,a) a,b,c from abc order by b,c,a; > > or is that you want > > select * from (select distinct on (a) a,b,c order by a) order by > b,c,a; If I understand you correctly, I don't think I would expect either. - ORDER BY provides a sort order for the result set - DISTINCT ON specifies a list of expressions, and says: "For each set of rows in the result set for which these expressionsare all equal, retain the first row and throw the rest away", where the "first row" is defined by the ORDERBY sort order So I'd expect this query to (a) keep at most one of every distinct 'a' value. When throwing out duplicates, we should keep the row that wouldcome first as specified by the ORDER BY sort order (b) sort the result set by b,c,a ISTM this interpretation is pretty logical, and that the current restriction is made purely for the sake of ease-of-implementation. If that's the case, we should at least document this restriction, and perhaps plan on correcting it in the future. -Neil
On Sat, Dec 13, 2003 at 22:12:32 -0500, Greg Stark <gsstark@mit.edu> wrote: > > So, like DISTINCT ON, GROUP BY also insists on the user providing the ORDER BY > clause. I suppose you could argue postgres could implicitly introduce an extra > sort step when the user-provided ORDER BY doesn't match the GROUP BY or > DISTINCT ON clause but it seems like the user is probably confused if he > really wants a random record and then sort on columns that weren't sorted > previous to the DISTINCT ON. You can make the result deterministic by using an initial order by that uses the distinct expressions followed by the order by expressions. After that is used to get which records will be returned, a second sort is done using just the expressions on the order by clause.
Greg Stark <gsstark@mit.edu> writes: > So, like DISTINCT ON, GROUP BY also insists on the user providing the > ORDER BY clause. I suppose you could argue postgres could implicitly > introduce an extra sort step when the user-provided ORDER BY doesn't > match the GROUP BY or DISTINCT ON clause but it seems like the user is > probably confused if he really wants a random record and then sort on > columns that weren't sorted previous to the DISTINCT ON. This was discussed before --- see the archives. I believe the conclusion was that the results would actually be nondeterministic if we used two sort steps (that's what the code comment means by "rather unpredictable"). This is not unrelated to the reasons why people consider DISTINCT ON to be a messy feature ... ideally it should be orthogonal to ORDER BY, but it simply isn't. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > This was discussed before --- see the archives. I believe the > conclusion was that the results would actually be nondeterministic > if we used two sort steps (that's what the code comment means by > "rather unpredictable"). Does the non-determinism you're referring to result from an ORDER BY on a non-deterministic expression, or the non-determinism that results from picking an effectively random row because the ORDER BY isn't sufficient? I searched the archives and found Stephen Szabo's comment[1] that: The query you've written is potential non-deterministic if you have a people_id that has multiple rows with different lastnames that meet the where clause. Which seems like an unconvincing justification for rejecting the query: we accept DISTINCT ON with no ORDER BY clause at all, for example. -Neil [1] http://archives.postgresql.org/pgsql-hackers/2001-07/msg00588.php
On Sun, Dec 14, 2003 at 18:09:33 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This was discussed before --- see the archives. I believe the > conclusion was that the results would actually be nondeterministic > if we used two sort steps (that's what the code comment means by > "rather unpredictable"). This is not unrelated to the reasons why > people consider DISTINCT ON to be a messy feature ... ideally it > should be orthogonal to ORDER BY, but it simply isn't. If the sort used to select the records sorts on both the distinct expressions and the order by expressions you will get a sensible deterministic result. That result can then be sorted using just the order by expressions.
Neil Conway <neilc@samurai.com> writes: > Does the non-determinism you're referring to result from an ORDER BY > on a non-deterministic expression, or the non-determinism that results > from picking an effectively random row because the ORDER BY isn't > sufficient? The latter --- you don't know which row you'll get, because it depends on what the sorting procedure does with equal keys. (I think. This argument was a few years ago and I've not bothered to review the archives.) With ordinary DISTINCT this does not matter because you can't tell the difference between "equal" rows anyway --- but with DISTINCT ON, you can tell the difference. > Which seems like an unconvincing justification for rejecting the > query: we accept DISTINCT ON with no ORDER BY clause at all, for > example. Well, we invent an ORDER BY clause matching the DISTINCT ON in that case. The rationale for doing so is weak, I agree, but since you have not specified a sort order, you can hardly argue that the result is wrong. I think you are correct that this restriction is essentially an efficiency hack. But DISTINCT ON is in itself an efficiency hack. I'm not sure I see the point of allowing a less-efficient variation of the efficiency hack, which is what we'd have if we supported DISTINCT ON with a non-matching ORDER BY. Certainly it doesn't seem important enough to expend significant implementation effort on. regards, tom lane
Bruno Wolff III <bruno@wolff.to> writes: > If the sort used to select the records sorts on both the distinct > expressions and the order by expressions you will get a sensible > deterministic result. Sensible in what sense? ;-) It seems to me that the existing documentation defines the behavior of DISTINCT ON as selecting the row within each DISTINCT ON group that is first according to the ORDER BY columns that are less significant than the DISTINCT ON keys. Perhaps this is not clear enough and should be clarified. But it doesn't seem very useful to me to extend the behavior to allow other cases ... what are you really buying if you do so, and what will it cost in execution time? regards, tom lane
On Sun, Dec 14, 2003 at 22:17:35 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > If the sort used to select the records sorts on both the distinct > > expressions and the order by expressions you will get a sensible > > deterministic result. > > Sensible in what sense? ;-) Doing things as above is pretty much the same as normal distinct on for purposes of which rows get selected. Of the possible rows that might get returned for a specific set of values from the distinct on expressions you will get the row that is first as ordered by the expressions in the order by clause. If the order by clause isn't selective enough there may be several rows that could be selected, but that is true for how distinct on works now. > It seems to me that the existing documentation defines the behavior of > DISTINCT ON as selecting the row within each DISTINCT ON group that is > first according to the ORDER BY columns that are less significant than > the DISTINCT ON keys. Perhaps this is not clear enough and should be > clarified. But it doesn't seem very useful to me to extend the behavior > to allow other cases ... what are you really buying if you do so, and > what will it cost in execution time? What it buys is saving a bit of typing. The main slow down would be in trying to determine whether you needed to use two sorts or just one in a particular case. This shouldn't cost too much time, but might be an issue if there are a lot of columns in both the distinct on and order by. I do aggree that this isn't an area I would like to see developers working on. I would much rather see developers work on things that add real functionallity rather that that save some typing in unusual cases.
On Mon, Dec 15, 2003 at 06:14:59 -0600, Bruno Wolff III <bruno@wolff.to> wrote: > > Doing things as above is pretty much the same as normal distinct on > for purposes of which rows get selected. Of the possible rows that > might get returned for a specific set of values from the distinct on > expressions you will get the row that is first as ordered by the > expressions in the order by clause. If the order by clause isn't selective > enough there may be several rows that could be selected, but that is true > for how distinct on works now. Specifically the interpretation I think makes sense is that SELECT DISTINCT ON (a, b, c) * FROM tablename ORDER BY d, e, f should be treated as equvialent to SELECT * FROM (SELECT DISTINCT ON (a, b, c) FROM tablename ORDER BY a, b, c, d, e, f) AS t ORDER BY d, e, f
Bruno Wolff III <bruno@wolff.to> writes: > Specifically the interpretation I think makes sense is that > SELECT DISTINCT ON (a, b, c) * FROM tablename ORDER BY d, e, f > should be treated as equvialent to > SELECT * FROM > (SELECT DISTINCT ON (a, b, c) FROM tablename ORDER BY a, b, c, d, e, f) AS t > ORDER BY d, e, f Right, that's the same thing Neil said elsewhere in the thread. I guess I agree that this would be cleaner, and also that it doesn't seem like a high priority problem. BTW, you could imagine implementing DISTINCT and DISTINCT ON by a hash table, if you didn't expect too many distinct rows. The hash key would be the DISTINCT columns, and in the DISTINCT ON case, when you get a new row with DISTINCT columns matching an existing entry, you compare the two rows using the ORDER BY key columns to decide which row to keep. In this formulation it's fairly obvious that the DISTINCT and ORDER BY keys don't need to overlap at all. This technique does not require presorted input (good) but also delivers unsorted output (bad). It could still be a big win if you expect the DISTINCT filtering to reduce the number of rows substantially, because you'd end up sorting a much smaller number of rows. I didn't get around to trying to implement this when I was doing hash aggregation for 7.4, but it seems like a reasonable item for the TODO list. regards, tom lane