Thread: Finding the "most recent" rows
I have a table (representing a set of observations) with datetime fields and a non-unique place field. e.g. create table obs ( the_time datetime, the_place char(8), ...other fields... ) I'd like an efficient way to pull out the most recent row (i.e. highest datatime) belonging to *each* of a number of places selected by a simple query. e.g. given a table such as: the_time the_place ... 0910 London 1130 London 0910 Paris 0930 London 0840 Paris 1020 London 0740 Paris I'd like to select: 1130 London 0910 Paris Most of my attempts at this (as an SQL novice) feel very clumsy and inefficient. Is there an efficient way of doing this in SQL? -- Julian Scarfe
Try SELECT the_place, max(the_time) FROM the_place GROUP BY the_place; Julian Scarfe wrote: > > I have a table (representing a set of observations) with datetime fields and a > non-unique place field. > > e.g. > create table obs ( > the_time datetime, > the_place char(8), > ...other fields... > ) > > I'd like an efficient way to pull out the most recent row (i.e. highest > datatime) belonging to *each* of a number of places selected by a simple > query. > > e.g. given a table such as: > > the_time the_place ... > 0910 London > 1130 London > 0910 Paris > 0930 London > 0840 Paris > 1020 London > 0740 Paris > > I'd like to select: > 1130 London > 0910 Paris > > Most of my attempts at this (as an SQL novice) feel very clumsy and > inefficient. Is there an efficient way of doing this in SQL? > -- > > Julian Scarfe -- Chris Bitmead http://www.bigfoot.com/~chris.bitmead mailto:chris.bitmead@bigfoot.com
Julian Scarfe wrote: <blockquote type="CITE">I have a table (representing a set of observations) with datetime fields anda <br />non-unique place field. <p>e.g. <br />create table obs ( <br />the_time datetime, <br />the_place char(8), <br/>...other fields... <br />) <p>I'd like an efficient way to pull out the most recent row (i.e. highest <br />datatime)belonging to *each* of a number of places selected by a simple <br />query. <p>e.g. given a table such as: <p>the_time the_place ... <br />0910 London <br />1130 London <br />0910 Paris <br />0930 London <br />0840 Paris <br />1020 London <br />0740 Paris <p>I'd like to select: <br />1130 London <br />0910 Paris <p>Most of my attempts at this (as an SQL novice) feel very clumsy and <br />inefficient.Is there an efficient way of doing this in SQL? <br />-- <p>Julian Scarfe</blockquote> If I understund theproblem try this: <br /> <tt></tt><p><tt>test=> create table test (time datetime default now(), place char(16));</tt><br/><tt>CREATE</tt><br /><tt>test=> insert into test (place) values ('London');</tt><br /><tt>INSERT 1948241</tt><br /><tt>test=> insert into test (place) values ('Paris');</tt><br /><tt>INSERT 194825 1</tt><br /><tt>.............</tt><br/><tt>test=> insert into test (place) values ('Berlin');</tt><br /><tt>INSERT 194835 1</tt><br/><tt>test=> insert into test (place) values ('London');</tt><br /><tt>INSERT 194836 1</tt><br /><tt>test=>insert into test (place) values ('Berlin');</tt><br /><tt>INSERT 194837 1</tt><br /><tt>test=> select *from test;</tt><br /><tt>time |place</tt><br /><tt>-----------------------------+----------------</tt><br/><tt>Thu 22 Apr 17:33:23 1999 EEST|London</tt><br /><tt>Thu22 Apr 17:33:30 1999 EEST|Paris</tt><br /><tt>Thu 22 Apr 17:33:36 1999 EEST|London</tt><br /><tt>Thu 22 Apr 17:33:491999 EEST|Madrid</tt><br /><tt>Thu 22 Apr 17:33:54 1999 EEST|Paris</tt><br /><tt>Thu 22 Apr 17:34:03 1999 EEST|Berlin</tt><br/><tt>Thu 22 Apr 17:34:05 1999 EEST|Madrid</tt><br /><tt>Thu 22 Apr 17:34:08 1999 EEST|London</tt><br/><tt>Thu 22 Apr 17:34:12 1999 EEST|Paris</tt><br /><tt>Thu 22 Apr 17:34:14 1999 EEST|Madrid</tt><br /><tt>Thu22 Apr 17:34:16 1999 EEST|Paris</tt><br /><tt>Thu 22 Apr 17:34:20 1999 EEST|Berlin</tt><br /><tt>Thu 22 Apr 17:34:221999 EEST|London</tt><br /><tt>Thu 22 Apr 17:34:31 1999 EEST|Berlin</tt><br /><tt>(14 rows)</tt><tt></tt><p><tt>test=>select place, time from test t where time = (select max(s.time) from test s where s.place= t.place) order by place;</tt><br /><tt>place |time</tt><br /><tt>----------------+-----------------------------</tt><br/><tt>Berlin |Thu 22 Apr 17:34:31 1999 EEST</tt><br/><tt>London |Thu 22 Apr 17:34:22 1999 EEST</tt><br /><tt>Madrid |Thu 22 Apr 17:34:14 1999EEST</tt><br /><tt>Paris |Thu 22 Apr 17:34:16 1999 EEST</tt><br /><tt>(4 rows)</tt><tt></tt><p><tt>test=></tt><p>Iuse PostgreSQL 6.5.0 beta1 on Red Hat LINUX 5.2 with 2.2.2 kernel. <p>-- <br/> Best,<br /> George Moga,<br /> george@flex.ro<br /> Braila, ROMANIA <br />
Chris Bitmead <chris.bitmead@bigfoot.com> writes: > Julian Scarfe wrote: >> I'd like an efficient way to pull out the most recent row (i.e. highest >> datatime) belonging to *each* of a number of places selected by a simple >> query. > > Try > SELECT the_place, max(the_time) FROM the_place GROUP BY the_place; But I'll bet he wants the whole row containing the max time, not just the place and time columns. I've run into similar problems and never felt like I had a clean solution, either. You could do something like SELECT * FROM table AS t1 WHERE NOT EXISTS(SELECT * FROM table AS t2 WHERE t2.place = t1.place AND t2.time > t1.time); but this is ugly, and probably horribly inefficient as well. (It might not be unacceptably slow if the table has indexes on place and time, but it sure looks like a brute-force approach.) What you'd really like is something like a SELECT DISTINCT with a user- specifiable row comparison operator; then you'd just "ORDER BY place, time" and make a comparator that discards all but the last row for each place value. Hmm ... a little experimentation suggests that SELECT DISTINCT ON place * FROM table ORDER BY place, time DESC; might do the right thing. It *seems* to select the first row for each value of place. I've never seen a spec for this feature, however, so I'm not sure if it's reliable or not... regards, tom lane
I'd like an efficient way to pull out the most recent row (i.e. highest datatime) belonging to *each* of a number ofplaces selected by a simple query. The "Practical SQL Handbook" has a description of exactly what you are looking for (don't have it handy or I'd give you the page number). They discuss two ways to do it. One uses the HAVING clause with GROUP BY (I think that is the section of the book to look in), but I don't think psql supports this. The other way uses a subselect which is supported by psql. The script at the bottom illustrates some of the ideas. Cheers, Brook =========================================================================== /* -*- C -*-* recent.sql*/ /** find the most recent entry (order) for each group (customer)*/ -- create tables drop sequence invoices_id_seq; drop table invoices; create table invoices (id serial,customer int,order_no int, unique (customer, order_no) ); insert into invoices (customer, order_no) values (1, 1); insert into invoices (customer, order_no) values (1, 2); insert into invoices (customer, order_no) values (1, 3); insert into invoices (customer, order_no) values (2, 1); insert into invoices (customer, order_no) values (2, 2); insert into invoices (customer, order_no) values (3, 1); select * from invoices order by customer, order_no; select * from invoices rwhere order_no = (select max (order_no) from invoices r2 where r.customer = r2.customer and order_no< 3)order by r.customer, r.order_no;
Tom Lane wrote: > SELECT DISTINCT ON place * FROM table ORDER BY place, time DESC; Clever. But why doesn't this work.... select title, summary, time from story t where time = (select max(s.time) from story s GROUP BY s.title); ERROR: parser: Subselect has too many or too few fields.
Chris Bitmead <chris.bitmead@bigfoot.com> writes: > Clever. But why doesn't this work.... > select title, summary, time from story t where time = (select > max(s.time) from story s GROUP BY s.title); > ERROR: parser: Subselect has too many or too few fields. A subselect used in an expression has to return exactly one value; yours will return as many tuples as there are distinct titles. I think you meant select title, summary, time from story t where time = (select max(s.time) from story s WHERE s.title = t.title); Here the subselect should give a single result each time it's executed. Unfortunately, it's gonna be executed once for each tuple scanned by the outer select :-( regards, tom lane
Tom Lane wrote: > > Chris Bitmead <chris.bitmead@bigfoot.com> writes: > > Clever. But why doesn't this work.... > > > select title, summary, time from story t where time = (select > > max(s.time) from story s GROUP BY s.title); > > ERROR: parser: Subselect has too many or too few fields. > > A subselect used in an expression has to return exactly one value; > yours will return as many tuples as there are distinct titles. Why doesn't replacing "=" with "IN" produce a result? It wouldn't be the desired result, but I thought this was legal.
Your script doesn't show how to get the "most recent" rows. The output from the 1st SQL is : id|customer|order_no --+--------+--------1| 1| 12| 1| 23| 1| 34| 2| 15| 2| 26| 3| 1 The output from the 2nd SQL is: id|customer|order_no --+--------+--------2| 1| 25| 2| 26| 3| 1 What are you trying to show here? I think you could just select the highest id, which means the latest data input. Regards, Chai Brook Milligan wrote: > I'd like an efficient way to pull out the most recent row (i.e. highest > datatime) belonging to *each* of a number of places selected by a simple > query. > > The "Practical SQL Handbook" has a description of exactly what you are > looking for (don't have it handy or I'd give you the page number). > They discuss two ways to do it. One uses the HAVING clause with GROUP > BY (I think that is the section of the book to look in), but I don't > think psql supports this. The other way uses a subselect which is > supported by psql. > > The script at the bottom illustrates some of the ideas. > > Cheers, > Brook > > =========================================================================== > /* -*- C -*- > * recent.sql > */ > > /* > * find the most recent entry (order) for each group (customer) > */ > > -- create tables > > drop sequence invoices_id_seq; > drop table invoices; > create table invoices > ( > id serial, > customer int, > order_no int, > > unique (customer, order_no) > ); > > insert into invoices (customer, order_no) values (1, 1); > insert into invoices (customer, order_no) values (1, 2); > insert into invoices (customer, order_no) values (1, 3); > insert into invoices (customer, order_no) values (2, 1); > insert into invoices (customer, order_no) values (2, 2); > insert into invoices (customer, order_no) values (3, 1); > > select * from invoices order by customer, order_no; > > select * from invoices r > where order_no = (select max (order_no) from invoices r2 where r.customer = r2.customer and order_no < 3) > order by r.customer, r.order_no;
Chris Bitmead <chris.bitmead@bigfoot.com> writes: >>>> select title, summary, time from story t where time = (select >>>> max(s.time) from story s GROUP BY s.title); > Why doesn't replacing "=" with "IN" produce a result? It wouldn't be the > desired result, but I thought this was legal. I thought so too (on both counts). Are you saying it doesn't work? What happens? Which version are you using? regards, tom lane
Your script doesn't show how to get the "most recent" rows. True enough, but that's just because of the < 3 condition within the subselect. Take that out and you'll get all the most recent rows (in this example that means largest order_no, but the same works with dates or whatever). What are you trying to show here? The first query was just showing the table, the second the action of the relevant select. I thought this was what you wanted and that you would recognize the effect of the < 3 condition. I just happened to have this example already that illustrated the general idea and expected that you would get the idea. Sorry if I should have been more complete. I think you could just select the highest id, which means the latest data input. In this particular case the id and the order_no column happen to yield the same results (they are both ordered in the same way). That generally won't be the case. But, if you want the highest id within each customer category a simple change to the select below will do it. In any case, to get the largest (or most recent or whatever) anything in each category you need to do a condition involving a subselect with an aggregate. Something like the following will give you the largest (without additional constraint) order_no within each customer. select * from invoices rwhere order_no = (select max (order_no) from invoices r2 where r.customer = r2.customer)order byr.customer, r.order_no; Sorry for the confusion. Cheers, Brook
Tom Lane wrote: > > Chris Bitmead <chris.bitmead@bigfoot.com> writes: > >>>> select title, summary, time from story t where time = (select > >>>> max(s.time) from story s GROUP BY s.title); > > > Why doesn't replacing "=" with "IN" produce a result? It wouldn't be the > > desired result, but I thought this was legal. > > I thought so too (on both counts). Are you saying it doesn't work? > What happens? Which version are you using? httpd=> select title, summary, time from story t where time IN (select max(s.time) from story s GROUP BY s.title); ERROR: parser: Subselect has too many or too few fields. I'm using postgresql-snap-990329.tgz
Tom Lane wrote: > > SELECT DISTINCT ON place * FROM table ORDER BY place, time DESC; > > might do the right thing. It *seems* to select the first row for each > value of place. I've never seen a spec for this feature, however, so > I'm not sure if it's reliable or not... Works a treat when I try it on the "real" database (using 6.4.2). It seems very efficient. I didn't realise that "DISTINCT ON" worked that way with an "ORDER BY". I must say I'm overwhelmed by the variety of constructive responses to my request. Thank you all very much. It beats the hell out of sitting as no 17 in a premium rate telephone queue for support from a vendor of a "commercial" product. -- Julian Scarfe
Tom Lane wrote: > > SELECT DISTINCT ON place * FROM table ORDER BY place, time DESC; > > might do the right thing. It *seems* to select the first row for each > value of place. I've never seen a spec for this feature, however,so > I'm not sure if it's reliable or not... Cool idea. Is DISTINCT guarranteed to choose the first row that matches or can it choose any row? Cheers, Brook
Chris Bitmead <chris.bitmead@bigfoot.com> writes: >>>>>>> select title, summary, time from story t where time = (select >>>>>>> max(s.time) from story s GROUP BY s.title); >> >>>> Why doesn't replacing "=" with "IN" produce a result? It wouldn't be the >>>> desired result, but I thought this was legal. >> >> I thought so too (on both counts). Are you saying it doesn't work? >> What happens? Which version are you using? > httpd=> select title, summary, time from story t where time IN (select > max(s.time) from story s GROUP BY s.title); > ERROR: parser: Subselect has too many or too few fields. > I'm using postgresql-snap-990329.tgz Yeah, I see it too. This looks like a definite bug to me, but I have other bugs to squash right now :-(. Anyone else want to jump on this? regards, tom lane
Brook Milligan <brook@trillium.NMSU.Edu> writes: > Tom Lane wrote: >> >> SELECT DISTINCT ON place * FROM table ORDER BY place, time DESC; >> >> might do the right thing. It *seems* to select the first row for each >> value of place. I've never seen a spec for this feature, however, so >> I'm not sure if it's reliable or not... > Cool idea. Is DISTINCT guarranteed to choose the first row that > matches or can it choose any row? I dunno, that's why I said I was unsure that this method was reliable. By experimentation it seems that Postgres' DISTINCT code works that way, but I have no idea whether the SQL spec mandates it or allows any row within a group to be chosen. I recall now that when I first heard of "SELECT DISTINCT ON field" I objected that the results weren't well-defined (since it's not clear how DISTINCT will choose which tuple to return). It might be that the SQL spec requires the first tuple to be chosen for each value of "field", which would allow the user to control the results by inserting a preceding ORDER BY step --- or skip the ORDER BY, if he doesn't really care which tuple he gets. That'd actually be a pretty cool design. Anyone have an SQL spec handy to check it? regards, tom lane
> Chris Bitmead <chris.bitmead@bigfoot.com> writes: > >>>>>>> select title, summary, time from story t where time = (select > >>>>>>> max(s.time) from story s GROUP BY s.title); > >> > >>>> Why doesn't replacing "=" with "IN" produce a result? It wouldn't be the > >>>> desired result, but I thought this was legal. > >> > >> I thought so too (on both counts). Are you saying it doesn't work? > >> What happens? Which version are you using? > > > httpd=> select title, summary, time from story t where time IN (select > > max(s.time) from story s GROUP BY s.title); > > ERROR: parser: Subselect has too many or too few fields. This is not legal. If you use GROUP BY, the field must be in the target list. In this case, s.title is not in the target list of the subselect. I realize it can't be in the subselect target list because you can only have one column in the target list, but that is the case. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
How to calculate percentages? What is the correct SQL? CREATE TABLE poll (candidate text, votes int4); I want to do something like (this is wrong)... SELECT candidate, votes, (votes / sum(votes)) * 100) AS percent FROM poll; Fred Smith | 500 | 25 Bill Bloggs | 1000 | 50 Jim Jones | 500 | 25 -- Chris Bitmead http://www.bigfoot.com/~chris.bitmead mailto:chris.bitmead@bigfoot.com
I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when you don't use "ON". Like this... SELECT DISTINCT category.image FROM story, category* WHERE story.category = category.oid ORDER BY datetime DESC; The DISTINCT here has no effect on the output other than changing the ordering. There are duplicates! 2nd question: Why does the following query result in duplicates even though I use DISTINCT ON? If I change the ORDER BY to be on image, then there are no duplicates but that isn't what I want. I want the time to be the sort order because I want the X most recent images but only unique ones. Is this a bug? It certainly seems wierd that DISTINCT would return duplicates. Why should it be up to the user to order the output with reference to the DISTINCT clause? Shouldn't the database take care of that? ... SELECT DISTINCT ON image category.image FROM story, category* WHERE story.category = category.oid ORDER BY datetime DESC; image -------------------- /icon/canon.gif /icon/arca-swiss.gif /icon/canon.gif /icon/hasselblad.gif /icon/nikon.gif /icon/olympus.gif (6 rows) -- Chris Bitmead http://www.bigfoot.com/~chris.bitmead mailto:chris.bitmead@bigfoot.com
Chris Bitmead <chris.bitmead@bigfoot.com> writes: > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when > you don't use "ON". Means the equality test is taken across all columns of the output. "ON" restricts the test for "duplicate row" to look at just some of the columns (thus creating the issue of which tuple out of a group of "duplicates" gets through the filter to determine the values of the other columns). > Like this... > SELECT DISTINCT category.image FROM story, category* WHERE > story.category = category.oid ORDER BY datetime DESC; > The DISTINCT here has no effect on the output other than changing the > ordering. There are duplicates! This is a known bug, or at least IMHO it's a bug; it's triggered by your use of ORDER BY with DISTINCT. You need to sayORDER BY category.image, datetime DESC to make the above example work. The reason is that DISTINCT is implemented by a sort followed by an adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix shell programming). When you put on an explicit ORDER BY clause, you override the sort order that the DISTINCT wants, and so the duplicate filter doesn't necessarily spot the duplicates. You have to make sure to sort in an order that will keep the unwanted duplicates together, ie, list all the DISTINCT columns first in the ORDER BY. A straightforward solution would be to apply the user-specified ORDER BY sort *after* the DISTINCT-generated sort and dup-filter steps. I used to think this was the right fix. However, this would break that nifty technique of letting a user-specified ORDER BY resolve the ambiguity of DISTINCT ON --- the user ordering has to happen before the dup-filter for that to work. Alternatively, the system could alter the user's ORDER BY to ensure the DISTINCT columns are the primary sort keys --- ie, silently change your ORDER BY in the above example. I don't think I like that too much either. For one thing, people would complain that the resulting order wasn't what they asked for. For another, it seems possible that there are applications where applying the dup-filter to columns that aren't the primary keys might be useful. Let's see, if you didSELECT DISTINCT category.image FROM story, category* WHEREstory.category= category.oid ORDER BY datetime DESC, category.image; then you'd get each image listed only once *per date* (I think). Doesn't that sound like it could be a useful behavior? So right at the moment, I think the system ought not tinker with the user's ORDER BY. But it probably should emit a warning message if the ORDER BY fails to mention all the columns being DISTINCTed on. If you aren't sorting by a DISTINCT column at all, you won't even get reproducible results, let alone desirable ones. > 2nd question: Why does the following query result in duplicates even > though I use DISTINCT ON? Same deal. regards, tom lane
Chris Bitmead wrote: > How to calculate percentages? What is the correct SQL? > > CREATE TABLE poll (candidate text, votes int4); > > I want to do something like (this is wrong)... > SELECT candidate, votes, (votes / sum(votes)) * 100) AS percent FROM > poll; > > Fred Smith | 500 | 25 > Bill Bloggs | 1000 | 50 > Jim Jones | 500 | 25 > > -- > Chris Bitmead > http://www.bigfoot.com/~chris.bitmead > mailto:chris.bitmead@bigfoot.com I always solve by a temporary table select sum(votes) as sumvotes into tmp from poll; SELECT candidate, votes, (float4(votes)/float4(sumvotes))*100 as percent from poll, tmp; candidate|votes| percent ---------+-----+---------------- Fred | 500|23.8095238804817 Bill | 1000|47.6190477609634 James | 600| 28.571429848671 (3 rows) The problem is now cutting to only to 2 decimal point (ie: 23.80, 47.61, 28.57) and we need some further help. Nuch
Nuchanard Chiannilkulchai ha scritto:
Chris Bitmead wrote:In v6.5 you can use decimal type.> How to calculate percentages? What is the correct SQL?
>
> CREATE TABLE poll (candidate text, votes int4);
>
> I want to do something like (this is wrong)...
> SELECT candidate, votes, (votes / sum(votes)) * 100) AS percent FROM
> poll;
>
> Fred Smith | 500 | 25
> Bill Bloggs | 1000 | 50
> Jim Jones | 500 | 25
>
> --
> Chris Bitmead
> http://www.bigfoot.com/~chris.bitmead
> mailto:chris.bitmead@bigfoot.comI always solve by a temporary table
select sum(votes) as sumvotes into tmp from poll;
SELECT candidate, votes, (float4(votes)/float4(sumvotes))*100
as percent from poll, tmp;
candidate|votes| percent
---------+-----+----------------
Fred | 500|23.8095238804817
Bill | 1000|47.6190477609634
James | 600| 28.571429848671
(3 rows)The problem is now cutting to only to 2 decimal point (ie: 23.80,
47.61, 28.57)
and we need some further help.Nuch
otherwise try this:
select candidate, votes, substr(cast(percents as text),1,position('.' in cast(pe
rcents as text))) ||
substr(cast(percents as text),position('.' in cast(percents as text)) + 1, 2) as
percent from test;
candidate|votes|percent
---------+-----+-------
Fred | 500| 23.80
Bill | 1000| 47.61
James | 600| 28.57
(3 rows)
or this ...
select candidate, votes, format(percents,'###,##') from test;
candidate|votes|format
---------+-----+------
Fred | 500| 23,80
Bill | 1000| 47,61
James | 600| 28,57
(3 rows)
-- NB: Here in Italy we use comma instead of decimal point.
Attached file contains:
format() function
text(float8) function (to cast percents to text)
José
At 07:39 +0300 on 23/04/1999, Chris Bitmead wrote: > httpd=> select title, summary, time from story t where time IN (select > max(s.time) from story s GROUP BY s.title); > ERROR: parser: Subselect has too many or too few fields. Of course it does. Since you group by a field which is not in the select list, Postgres adds it silently to the fields to be selected. Thus the subselect has two fields in each row, not just a single time field. I think the proper syntax by the standard would be something like: SELECT title, summary, timeFROM story tWHERE (time, title) IN ( SELECT s.title, max( s.time ) FROM story s GROUP BYs.title); But I'm not sure Postgres even supports this format (of comparing against several fields). In any case, the best would be to select just one tuple in the subselect and have it return only the time, by constraining the title. But syntactically, the above is what you were trying to do. Anyway, Postgres adds the group field to the query, which the standard requires and common practice doesn't. I think perhaps after doing the grouping, Postgres should drop that field, since it wasn't originally requested. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
At 19:29 +0300 on 23/04/1999, Tom Lane wrote: > I recall now that when I first heard of "SELECT DISTINCT ON field" > I objected that the results weren't well-defined (since it's not clear > how DISTINCT will choose which tuple to return). It might be that the > SQL spec requires the first tuple to be chosen for each value of > "field", which would allow the user to control the results by inserting > a preceding ORDER BY step --- or skip the ORDER BY, if he doesn't really > care which tuple he gets. That'd actually be a pretty cool design. > Anyone have an SQL spec handy to check it? DISTINCT ON is a non-standard structure. Your initial assertion, that it will return different results in different implementations holds true. Standard SQL should return the same result sets (order not important unless ORDER BY was mentioned). Tables are considered as unordered sets. There can not be anything in the standard that relies on the order in the table. And what do you mean by inserting a preceding ORDER BY step? Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
Herouth Maoz wrote: > Tables are considered as unordered sets. There can not be anything in the > standard that relies on the order in the table. And what do you mean by > inserting a preceding ORDER BY step? He was suggesting that an ORDER BY combined with a DISTINCT might be a good way of solving the difficult problem of finding the, say, max(field) in different groups in the table. (And it does work for postgresql) Like a SELECT ...GROUP BY except getting back not just the group fields but all fields. You say that a table is "considered an unordered set". But surely a table is not an unordered set if you've specified an ORDER BY clause? This idea is so nice, it would be ashame to dismiss it too quickly. -- Chris Bitmead http://www.bigfoot.com/~chris.bitmead mailto:chris.bitmead@bigfoot.com
At 17:28 +0300 on 29/04/1999, Chris Bitmead wrote: > > You say that a table is "considered an unordered set". But surely a > table is not an unordered set if you've specified an ORDER BY clause? > This idea is so nice, it would be ashame to dismiss it too quickly. Charming, but alas, non standard. Moreover, what if you want to get the maximal value, but order in an ascending in the output? No, if you want compatible SQL, you won't use DISTINCT ON, and you will specify the exact method of picking the distinguished value. Furthermore, ORDER BY has to be a late stage, if it should work on the entire results of unions, for example, or the result of groups. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
Nice summary of the issue. > Chris Bitmead <chris.bitmead@bigfoot.com> writes: > > I'm a bit confused about DISTINCT. Firstly, what does DISTINCT mean when > > you don't use "ON". > > Means the equality test is taken across all columns of the output. > "ON" restricts the test for "duplicate row" to look at just some > of the columns (thus creating the issue of which tuple out of a > group of "duplicates" gets through the filter to determine the values > of the other columns). > > > Like this... > > SELECT DISTINCT category.image FROM story, category* WHERE > > story.category = category.oid ORDER BY datetime DESC; > > The DISTINCT here has no effect on the output other than changing the > > ordering. There are duplicates! > > This is a known bug, or at least IMHO it's a bug; it's triggered by your > use of ORDER BY with DISTINCT. You need to say > ORDER BY category.image, datetime DESC > to make the above example work. > > The reason is that DISTINCT is implemented by a sort followed by an > adjacent-duplicate-elimination pass (just like "sort | uniq" in Unix > shell programming). When you put on an explicit ORDER BY clause, > you override the sort order that the DISTINCT wants, and so the > duplicate filter doesn't necessarily spot the duplicates. You have > to make sure to sort in an order that will keep the unwanted duplicates > together, ie, list all the DISTINCT columns first in the ORDER BY. > > A straightforward solution would be to apply the user-specified ORDER BY > sort *after* the DISTINCT-generated sort and dup-filter steps. I used > to think this was the right fix. However, this would break that nifty > technique of letting a user-specified ORDER BY resolve the ambiguity of > DISTINCT ON --- the user ordering has to happen before the dup-filter > for that to work. > > Alternatively, the system could alter the user's ORDER BY to ensure the > DISTINCT columns are the primary sort keys --- ie, silently change your > ORDER BY in the above example. I don't think I like that too much > either. For one thing, people would complain that the resulting order > wasn't what they asked for. For another, it seems possible that there > are applications where applying the dup-filter to columns that aren't > the primary keys might be useful. Let's see, if you did > SELECT DISTINCT category.image FROM story, category* WHERE > story.category = category.oid ORDER BY datetime DESC, category.image; > then you'd get each image listed only once *per date* (I think). > Doesn't that sound like it could be a useful behavior? > > So right at the moment, I think the system ought not tinker with the > user's ORDER BY. But it probably should emit a warning message if the > ORDER BY fails to mention all the columns being DISTINCTed on. If you > aren't sorting by a DISTINCT column at all, you won't even get > reproducible results, let alone desirable ones. > > > 2nd question: Why does the following query result in duplicates even > > though I use DISTINCT ON? > > Same deal. > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026