Thread: DISTINCT ON: speak now or forever hold your peace
If I don't hear loud hollers very soon, I'm going to eliminate the DISTINCT ON "feature" for 7.0. As previously discussed, this feature is not standard SQL and has no clear semantic interpretation. I hadn't been planning to touch DISTINCT before 7.0, but changed my mind when I noticed this little gem: create table foo1 (f1 int, f2 int, f3 int); insert into foo1 values(1,2,3); insert into foo1 values(1,2,null); insert into foo1 values(1,null,2); insert into foo1 values(1,2,4); insert into foo1 values(1,2,4); select * from foo1;f1 | f2 | f3 ----+----+---- 1 | 2 | 3 1 | 2 | 1 | | 2 1 | 2 | 4 1 | 2 | 4 (5 rows) select distinct * from foo1;f1 | f2 | f3 ----+----+---- 1 | 2 | 3 1 | 2 | 4 1 | 2 | (3 rows) Didn't know that (NULL, 2) == (2, NULL), did you? The problem here is that nodeUnique is doing a bitwise comparison of the tuple contents (which is bad enough --- not all datatypes think equality is bitwise), and it's neglecting to include the null-field bitmap in what it compares. Rather than just band-aid the null-field problem, I'm going to fix it right. As long as I have to touch it, I'll deal with DISTINCT ON too. regards, tom lane
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. I don't feel overly strongly about this, but if I remember right you can do some pretty cool things with this feature, provided you do define some semantics clearly. Like I think you can find the first tuple (given some ORDER BY clause) that fulfills some criteria. I think it is SELECT DISTINCT ON name name, age ORDER BY age; will get the youngest person. This might not be clearly specified now, but as long as it's useful, how about clearly defining it? I don't know that there is an easy way of doing this in standard SQL. I don't see any problems with useful extensions to SQL. If people want standards, they don't have to use it.
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. > I grepped our source code and found this query: INSERT INTO temp_sales SELECT DISTINCT on key supplysources.supplysource, incharges.supply, targets.target, incharges.saledate, incharges.supplyunit, '', incharges.quantity, incharges.company, incharges.costcntr, 'Replenish', incharges.price, '','','', 0, text(supplysources.supplysource)|| text(incharges.supply)|| text(targets.target)|| text(incharges.saledate) as key FROM supplysources, incharges, targets WHERE supplysources.warehouse = incharges.warehouse AND (targets.site,targets.area) = (incharges.site,incharges.area); What happens is that a large charges file which is transferred to a mainframe ERP application is first brought into PostgreSQL. Depending upon certain race conditions, duplicate "sales" records can appear in the data file. We use DISTINCT ON to pick (as you point out) an arbitrary record when duplicates appear. I suppose we could do a DELETE ... WHERE NOT EXISTS after the import. Using DISTINCT ON just saves a step. I don't have any arguments beyond the grounds that we're using it in existing code as a duplicate record filter - :-( Just FYI, Mike Mascari
At 13:41 25/01/00 +1100, Chris Bitmead wrote: >Tom Lane wrote: >> >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > >I don't feel overly strongly about this, but if I remember right you can >do some pretty cool things with this feature, provided you do define >some semantics clearly. Like I think you can find the first tuple >(given some ORDER BY clause) that fulfills some criteria. I think it is > >SELECT DISTINCT ON name name, age ORDER BY age; > >will get the youngest person. This might not be clearly specified now, >but >as long as it's useful, how about clearly defining it? I don't know that >there is an easy way of doing this in standard SQL. I don't know about PGSQL, but in other systems, I use: Select <whatever> from <wherever> order by age asc limit to 1 row; I *think* the PGSQL syntax is: Select <whatever> from <wherever> order by age asc limit 1; I have no idea if the optimizer does 'fast first' optimizations, so I don't know how quick this would be on a large table. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > >I don't feel overly strongly about this, but if I remember right you can > >do some pretty cool things with this feature, provided you do define > >some semantics clearly. Like I think you can find the first tuple > >(given some ORDER BY clause) that fulfills some criteria. I think it is > > > >SELECT DISTINCT ON name name, age ORDER BY age; > > > >will get the youngest person. This might not be clearly specified now, > >but > >as long as it's useful, how about clearly defining it? I don't know that > >there is an easy way of doing this in standard SQL. > > I don't know about PGSQL, but in other systems, I use: > > Select <whatever> from <wherever> order by age asc limit to 1 row; > > I *think* the PGSQL syntax is: > > Select <whatever> from <wherever> order by age asc limit 1; I think what I really meant was... SELECT DISTINCT ON firstname firstname, age ORDER BY age. Which would find the youngest person called "fred", the youngest person called "paul", the youngest person called "jim" etc etc. which your limit example wouldn't do.
> INSERT INTO temp_sales > SELECT DISTINCT on key supplysources.supplysource, > incharges.supply, targets.target, incharges.saledate, > incharges.supplyunit, '', incharges.quantity, incharges.company, > incharges.costcntr, 'Replenish', incharges.price, '','','', 0, > text(supplysources.supplysource)|| > text(incharges.supply)|| > text(targets.target)|| > text(incharges.saledate) as key > FROM supplysources, incharges, targets WHERE > supplysources.warehouse = incharges.warehouse AND > (targets.site,targets.area) = (incharges.site,incharges.area); > > What happens is that a large charges file which is transferred to > a mainframe ERP application is first brought into PostgreSQL. > Depending upon certain race conditions, duplicate "sales" records > can appear in the data file. We use DISTINCT ON to pick (as you > point out) an arbitrary record when duplicates appear. I suppose > we could do a DELETE ... WHERE NOT EXISTS after the import. Using > DISTINCT ON just saves a step. I don't have any arguments beyond > the grounds that we're using it in existing code as a duplicate > record filter - :-( Why not just throw a NOTICE and keep the feature. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Chris Bitmead <chris@bitmead.com> writes: > Tom Lane wrote: >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > I don't feel overly strongly about this, but if I remember right you can > do some pretty cool things with this feature, provided you do define > some semantics clearly. We did talk about that, but I didn't hear any strong support for doing it, as opposed to pulling the feature completely... in particular, I didn't hear anyone volunteering to do the work... > as long as it's useful, how about clearly defining it? I don't know that > there is an easy way of doing this in standard SQL. I don't see any > problems with useful extensions to SQL. The only reason it came to my notice in the first place was people posting questions asking why they weren't getting the results they expected from it (whatever the heck those were; they weren't what you actually get from the current implementation, anyway). The problem with a poorly-specified nonstandard feature is support costs: you have to document it, answer questions about it, keep it working, etc. In this case we'd also have to define how it should work and alter the existing code to produce reasonable and predictable results. The existing code is not merely unpredictable, it is definitely broken. For example: regression=# select q1,q2 from int8_tbl; q1 | q2 ------------------+------------------- 123 | 456 123 | 45678901234567894567890123456789| 1234567890123456789 | 45678901234567894567890123456789 | -4567890123456789 (5 rows) regression=# select distinct on q1 q1,q2 from int8_tbl; q1 | q2 ------------------+----- 123 | 4564567890123456789 | 123 (2 rows) -- OK so far, but: regression=# select distinct on q1 q1,q2 from int8_tbl order by q2; q1 | q2 ------------------+-------------------4567890123456789 | -4567890123456789 123 | 4564567890123456789| 4567890123456789 (3 rows) -- which is not "distinct on q1" by my notions... In short, it's not clear to me that supporting DISTINCT ON is a good use of our limited resources. I'm willing to pull it out, but not to fix it. Does someone else want to take responsibility for it? regards, tom lane
> In short, it's not clear to me that supporting DISTINCT ON is a good use > of our limited resources. I'm willing to pull it out, but not to fix it. > Does someone else want to take responsibility for it? OK, we can disable it and put it on the TODO list. It is already there. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hi, This morning when I was inserting data to a table, after 15 minutes running, I received this ERROR: cannot flush block 9782 of tbs_billing_record to stable store. What does it mean? Regards, Chai
At 14:09 25/01/00 +1100, Chris Bitmead wrote: >Philip Warner wrote: >> >I don't feel overly strongly about this, but if I remember right you can >> >do some pretty cool things with this feature, provided you do define >> >some semantics clearly. Like I think you can find the first tuple >> >(given some ORDER BY clause) that fulfills some criteria. I think it is >> > >> >SELECT DISTINCT ON name name, age ORDER BY age; >> > >> >will get the youngest person. This might not be clearly specified now, >> >but >> >as long as it's useful, how about clearly defining it? I don't know that >> >there is an easy way of doing this in standard SQL. >> >> I don't know about PGSQL, but in other systems, I use: >> >> Select <whatever> from <wherever> order by age asc limit to 1 row; >> >> I *think* the PGSQL syntax is: >> >> Select <whatever> from <wherever> order by age asc limit 1; > >I think what I really meant was... > >SELECT DISTINCT ON firstname firstname, age ORDER BY age. > >Which would find the youngest person called "fred", the youngest person >called "paul", the youngest person called "jim" etc etc. which your >limit example wouldn't do. Just a thought, but would I be right in saying that this could be easily done with the addition of a new aggregate function 'FIRST', which simply returns the first value sent to it? Since the aggregates operate a row at a time, you are guaranteed a consistent set of values, I think. eg. SELECT firstname, FIRST(age) ORDER BY age. Just an idea, but it seems like a cute solution for the more general problem. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Just a thought, but would I be right in saying that this could be easily > done with the addition of a new aggregate function 'FIRST', which simply > returns the first value sent to it? Since the aggregates operate a row at > a time, you are guaranteed a consistent set of values, I think. No, because the system doesn't guarantee to deliver tuples to the aggregate in any particular order. The physical order of tuples on the disk will depend on the order in which they were last updated (plus random perturbations introduced by VACUUM); and the order in which they are processed by a query will depend on the query plan chosen by the planner (sequential scan vs. indexscan, etc). SQL in general doesn't believe that tuple ordering has any semantic significance --- you can ask for ORDER BY, but that's only honored at the end stage of SELECT result delivery, not necessarily anywhere in the bowels of a query. Also, if you are going to use an aggregate, ISTM you might as well use one like MIN() or MAX() and get a fully predictable result. I suppose a FIRST() aggregate would save a few cycles by not running comparisons, but I bet it'd be tough to notice any difference. regards, tom lane
At 12:01 AM 1/25/00 -0500, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> Just a thought, but would I be right in saying that this could be easily >> done with the addition of a new aggregate function 'FIRST', which simply >> returns the first value sent to it? Since the aggregates operate a row at >> a time, you are guaranteed a consistent set of values, I think. > >No, because the system doesn't guarantee to deliver tuples to the >aggregate in any particular order. In fact, this is a KEY notion behind RDBMS systems...queries return an unordered set, pure and simple, unless you supply an "order" clause (though "group by" appears to be implemented by ordering in various RDBMS systems, that's an efficiency hack not a given AFAIK). >SQL in general doesn't believe that tuple ordering has any semantic >significance --- you can ask for ORDER BY, but that's only honored >at the end stage of SELECT result delivery, not necessarily anywhere >in the bowels of a query. Yep... - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
At 21:20 24/01/00 -0800, Don Baccus wrote: >At 12:01 AM 1/25/00 -0500, Tom Lane wrote: >>Philip Warner <pjw@rhyme.com.au> writes: >>> Just a thought, but would I be right in saying that this could be easily >>> done with the addition of a new aggregate function 'FIRST', which simply >>> returns the first value sent to it? Since the aggregates operate a row at >>> a time, you are guaranteed a consistent set of values, I think. >> >>No, because the system doesn't guarantee to deliver tuples to the >>aggregate in any particular order. > >In fact, this is a KEY notion behind RDBMS systems...queries return an >unordered set, pure and simple, unless you supply an "order" clause >(though "group by" appears to be implemented by ordering in various >RDBMS systems, that's an efficiency hack not a given AFAIK). That's why there was an 'order by' in the query (there should have also been a 'group by', to make it valid SQL). I had (naievely) assumed that the aggregates would be passed the rows in sorted order. >>SQL in general doesn't believe that tuple ordering has any semantic >>significance --- you can ask for ORDER BY, but that's only honored >>at the end stage of SELECT result delivery, not necessarily anywhere >>in the bowels of a query. > >Yep... > Out of curiosity, does the SQL spec give any rules or guidelines about when aggregates should be applied to resultant rows? Or is it one of the implementation-dependant things? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > >In fact, this is a KEY notion behind RDBMS systems...queries return an > >unordered set, pure and simple, unless you supply an "order" clause > >(though "group by" appears to be implemented by ordering in various > >RDBMS systems, that's an efficiency hack not a given AFAIK). On second thoughts, maybe I agree DISTINCT ON should be dumped. In my example, if I wanted the youngest people with a given name sorted oldest first things would be confusing. It's an inconsistant command, I can see that now.
Philip Warner <pjw@rhyme.com.au> writes: >>> SQL in general doesn't believe that tuple ordering has any semantic >>> significance --- you can ask for ORDER BY, but that's only honored >>> at the end stage of SELECT result delivery, not necessarily anywhere >>> in the bowels of a query. > Out of curiosity, does the SQL spec give any rules or guidelines about when > aggregates should be applied to resultant rows? Or is it one of the > implementation-dependant things? Well, SQL's conceptual model is perfectly clear about the processing work flow of a SELECT: after you've constructed the tuple set (which might involve joining multiple relations), you apply the WHERE condition to filter out uninteresting tuples. Then you apply GROUP BY (if given) to divvy the tuples into groups. Then you apply HAVING to eliminate uninteresting groups. Then you apply aggregate functions (if any) to individual groups, or to the whole filtered result set if no groups. Finally you apply ORDER BY to whatever's left. Note that at no point except the final output is there any notion of the tuples being generated or processed in a particular order. Given appropriate hardware, much of this could be done in parallel. Parallel or not, an implementation is free to choose the processing order for its convenience. regards, tom lane
Tom Lane wrote: > Chris Bitmead <chris@bitmead.com> writes: > > Tom Lane wrote: > >> If I don't hear loud hollers very soon, I'm going to eliminate the > >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature > >> is not standard SQL and has no clear semantic interpretation. > > > I don't feel overly strongly about this, but if I remember right you can > > do some pretty cool things with this feature, provided you do define > > some semantics clearly. > > We did talk about that, but I didn't hear any strong support for doing > it, as opposed to pulling the feature completely... in particular, > I didn't hear anyone volunteering to do the work... I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it happened again last night, FATAL 1: VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2 ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT ... INTO ... I can get rid of them. A few table renames later I'm back in business. So how do I do this without DISTINCT? Better would be of course to get rid of the BlowawayRelationBuffers problem. I cannot reproduce it yet, but I think I can make it happen by running vacuum and updating tables that are linked by triggers. Does this sound plausible? And yes, I'm running on Alpha, i.e. 64 bits. Adriaan
Adriaan Joubert <a.joubert@albourne.com> writes: > I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it > happened again last night, > FATAL 1: VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2 > ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT > ... INTO ... I can get rid of them. A few table renames later I'm back in > business. So how do I do this without DISTINCT? I didn't say anything about eliminating SELECT DISTINCT; I was talking about SELECT DISTINCT ON, which is a whole 'nother thing. > Better would be of course to get rid of the BlowawayRelationBuffers problem. Quite so. I think this is fixed for 7.0 ... that error message indicates a buffer refcount leak, and I fixed several problems along that line a couple months ago. We shall see if I got 'em all or not ... regards, tom lane
At 01:12 25/01/00 -0500, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >>>> SQL in general doesn't believe that tuple ordering has any semantic >>>> significance --- you can ask for ORDER BY, but that's only honored >>>> at the end stage of SELECT result delivery, not necessarily anywhere >>>> in the bowels of a query. > >> Out of curiosity, does the SQL spec give any rules or guidelines about when >> aggregates should be applied to resultant rows? Or is it one of the >> implementation-dependant things? > >Well, SQL's conceptual model is perfectly clear about the processing >work flow of a SELECT: after you've constructed the tuple set (which >might involve joining multiple relations), you apply the WHERE condition >to filter out uninteresting tuples. Then you apply GROUP BY (if given) >to divvy the tuples into groups. Then you apply HAVING to eliminate >uninteresting groups. Then you apply aggregate functions (if any) to >individual groups, or to the whole filtered result set if no groups. >Finally you apply ORDER BY to whatever's left. Thanks, but now I'm confused. I would have thought that aggregates went *before* the having clause, since at least one DB I know allows: select job_type,avg(age) from <wherever> where <stuff> group by job_type having avg(age) > 70; ie. the use of aggregate results in the 'having' clause. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Thanks, but now I'm confused. I would have thought that aggregates went > *before* the having clause, Arrgh. You are right of course. Shouldn't pontificate late at night... regards, tom lane
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. I use 'distinct on' to get the most recent reports for each of a group of locations. E.g.: create table reports ( location varchar(16), report_time datetime, report_text text); select distinct on location * from reports where location ~~ 'Lond%' order by location, reporttime desc; to get the tuples that offer the most recent reports for each of London, Londonderry, Londy etc. Is there an alternative? Julian Scarfe
On Mon, Jan 24, 2000 at 10:55:02PM -0500, Bruce Momjian wrote: > > In short, it's not clear to me that supporting DISTINCT ON is a good use > > of our limited resources. I'm willing to pull it out, but not to fix it. > > Does someone else want to take responsibility for it? > > OK, we can disable it and put it on the TODO list. It is already there. Or, as you suggested earlier, throw a NOTICE instead of disabling it. But then I doubt Tom talked about physically removing the code anyway. Disabling certainly is enough. Michael -- Michael Meskes | Go SF 49ers! Th.-Heuss-Str. 61, D-41812 Erkelenz | Go Rhein Fire! Tel.: (+49) 2431/72651 | Use Debian GNU/Linux! Email: Michael@Fam-Meskes.De | Use PostgreSQL!
Re: [SQL] Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace
From
Peter Eisentraut
Date:
On Mon, 24 Jan 2000, Bruce Momjian wrote: > > In short, it's not clear to me that supporting DISTINCT ON is a good use > > of our limited resources. I'm willing to pull it out, but not to fix it. > > Does someone else want to take responsibility for it? > > OK, we can disable it and put it on the TODO list. It is already there. > Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all cases? -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
On Tue, 25 Jan 2000, Julian Scarfe wrote: > Tom Lane wrote: > > > > If I don't hear loud hollers very soon, I'm going to eliminate the > > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > > is not standard SQL and has no clear semantic interpretation. > > I use 'distinct on' to get the most recent reports for each of a group of > locations. E.g.: > > create table reports ( > location varchar(16), > report_time datetime, > report_text text); > > select distinct on location * from reports where location ~~ 'Lond%' order by > location, reporttime desc; > > to get the tuples that offer the most recent reports for each of London, > Londonderry, Londy etc. > > Is there an alternative? select location, max(report_time) from reports group by location This also has the negligible advantage that it gives you determinate results. -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
At 05:00 PM 1/25/00 +1100, Philip Warner wrote: >Out of curiosity, does the SQL spec give any rules or guidelines about when >aggregates should be applied to resultant rows? Or is it one of the >implementation-dependant things? Well...my copy of Date's now actually in the mail, rather than in Boston, but at the moment that's not much help! "order by" happens after everything, AFAIK. For instance... select ... union select ... order by orders the result of the union, which pretty much implies that aggregates will happen first. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
On 2000-01-25, Chris Bitmead mentioned: > I think what I really meant was... > > SELECT DISTINCT ON firstname firstname, age ORDER BY age. > > Which would find the youngest person called "fred", the youngest person > called "paul", the youngest person called "jim" etc etc. which your > limit example wouldn't do. SELECT firstname, min(age) FROM x GROUP BY firstname [ ORDER BY min(age) ] -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
On 2000-01-24, Tom Lane mentioned: > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. Our documents say that DISTINCT ON is equivalent to GROUP BY. I still don't see why that wouldn't be true. You can always rewrite select distinct on a a,b from test as select a, xxx(b) from test group by a where xxx is some aggregate function (presumably min or max). You can also rewrite select distinct on a a,b,c from test as select a, b, c from test group by a, b, c or using some aggregates here as well. At least you can control your results that way. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes: > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still > don't see why that wouldn't be true. You can always rewrite > select distinct on a a,b from test > as > select a, xxx(b) from test group by a > where xxx is some aggregate function (presumably min or max). Not really. Look at Julian's example. He can't rewrite as select a, min(b), min(c) from test group by a because the idea is to get the c that corresponds to the min b. If you do it with two independent aggregates then the b and c you get back may be from different tuples. I could imagine fixing this with a two-input aggregate, say select a, min(b), keyofmin(b, c) from test group by a where keyofmin is defined to return the c associated with the min b. But that'd be a pain to implement, first because we have no support for multi-argument aggregates, and second because you'd need a ton of separate keyofmin implementations for the cross-product of the data types you might want to deal with. So this is nearly as klugy as the SELECT DISTINCT ON approach --- and not any more standard, either. regards, tom lane
At 19:34 26/01/00 +0100, Peter Eisentraut wrote: >On 2000-01-24, Tom Lane mentioned: > >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > >Our documents say that DISTINCT ON is equivalent to GROUP BY. I still >don't see why that wouldn't be true. You can always rewrite > >select distinct on a a,b from test > >as > >select a, xxx(b) from test group by a > >where xxx is some aggregate function (presumably min or max). > >You can also rewrite > >select distinct on a a,b,c from test > >as > >select a, b, c from test group by a, b, c > >or using some aggregates here as well. At least you can control your >results that way. I only learned about DISTINCT ON in this discussion, but my impression is that it has one advantage over GROUP BY, i that it produces a 'consistent' tuple. ie. the tuple it produces is guaranteed to exist in the database, whereas using GROUP BY with aggregates will not produce a 'real' row: f1|f2|f3 -------- 1 2 3 1 3 2 2 3 1 2 1 3 3 1 2 3 2 1 'select distinct on f1 f1,f2,f3 from <somewhere> order by f1, f2, f3' will produce (I think): f1|f2|f3 -------- 1 2 3 2 1 3 3 1 2 ...where each resulting tuple actually exists in the DB. I can't see a way of doing this with aggreagates unless a 'FIRST' function is defined, but that (as already discussed) will not obey the 'order by' clause. Maybe an 'ANY' aggregate would do it, but then it would not always produce the same results. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
On 2000-01-26, Tom Lane mentioned: > Peter Eisentraut <peter_e@gmx.net> writes: > > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still > > don't see why that wouldn't be true. You can always rewrite > > select distinct on a a,b from test > > as > > select a, xxx(b) from test group by a > > where xxx is some aggregate function (presumably min or max). > > Not really. Look at Julian's example. He can't rewrite as > > select a, min(b), min(c) from test group by a > > because the idea is to get the c that corresponds to the min b. I see. I believe what you want is this: select one.a, two.b, two.c from (select a, min(b) as "min_b" from test group by a) as one, (select b, c from test) as two where one."min_b" = two.b Not sure if this is completely legal as it stands but at least the idea would be to join the grouped select with the plain one to get the c corresponding to the minimum b. But of course we don't offer that, so it's distinct on until then. (It would really surprise me if the distinct on functionality was not at all possible to emulate using SQL, since in my experience it is fairly complete with regards to querying options at least.) -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
At 23:28 27/01/00 +0100, Peter Eisentraut wrote: > >select one.a, two.b, two.c >from > (select a, min(b) as "min_b" from test group by a) as one, > (select b, c from test) as two >where one."min_b" = two.b > >Not sure if this is completely legal as it stands but at least the idea >would be to join the grouped select with the plain one to get the c >corresponding to the minimum b. But of course we don't offer that, so it's >distinct on until then. (It would really surprise me if the distinct on >functionality was not at all possible to emulate using SQL, since in my >experience it is fairly complete with regards to querying options at >least.) You are quite right - with a complete SQL impleentation, DISTINCT ON becomes superfluous. Although it may give the optimizer usefull hints as to how to approach the query. You actually have to be a bit more carefull to avoid duplicates, something like: Select a,b,c >From (Select a, min(b) from test group by a) as one(a,min_b), (Select b, c from test t2 where t2.a = one.a limit to1 row) as two This is legal on the DB I use most of the time. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/