Thread: DISTINCT ON: speak now or forever hold your peace
If I don't hear loud hollers very soon, I'm going to eliminate the DISTINCT ON "feature" for 7.0. As previously discussed, this feature is not standard SQL and has no clear semantic interpretation. I hadn't been planning to touch DISTINCT before 7.0, but changed my mind when I noticed this little gem: create table foo1 (f1 int, f2 int, f3 int); insert into foo1 values(1,2,3); insert into foo1 values(1,2,null); insert into foo1 values(1,null,2); insert into foo1 values(1,2,4); insert into foo1 values(1,2,4); select * from foo1;f1 | f2 | f3 ----+----+---- 1 | 2 | 3 1 | 2 | 1 | | 2 1 | 2 | 4 1 | 2 | 4 (5 rows) select distinct * from foo1;f1 | f2 | f3 ----+----+---- 1 | 2 | 3 1 | 2 | 4 1 | 2 | (3 rows) Didn't know that (NULL, 2) == (2, NULL), did you? The problem here is that nodeUnique is doing a bitwise comparison of the tuple contents (which is bad enough --- not all datatypes think equality is bitwise), and it's neglecting to include the null-field bitmap in what it compares. Rather than just band-aid the null-field problem, I'm going to fix it right. As long as I have to touch it, I'll deal with DISTINCT ON too. regards, tom lane
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. I don't feel overly strongly about this, but if I remember right you can do some pretty cool things with this feature, provided you do define some semantics clearly. Like I think you can find the first tuple (given some ORDER BY clause) that fulfills some criteria. I think it is SELECT DISTINCT ON name name, age ORDER BY age; will get the youngest person. This might not be clearly specified now, but as long as it's useful, how about clearly defining it? I don't know that there is an easy way of doing this in standard SQL. I don't see any problems with useful extensions to SQL. If people want standards, they don't have to use it.
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. > I grepped our source code and found this query: INSERT INTO temp_sales SELECT DISTINCT on key supplysources.supplysource, incharges.supply, targets.target, incharges.saledate, incharges.supplyunit, '', incharges.quantity, incharges.company, incharges.costcntr, 'Replenish', incharges.price, '','','', 0, text(supplysources.supplysource)|| text(incharges.supply)|| text(targets.target)|| text(incharges.saledate) as key FROM supplysources, incharges, targets WHERE supplysources.warehouse = incharges.warehouse AND (targets.site,targets.area) = (incharges.site,incharges.area); What happens is that a large charges file which is transferred to a mainframe ERP application is first brought into PostgreSQL. Depending upon certain race conditions, duplicate "sales" records can appear in the data file. We use DISTINCT ON to pick (as you point out) an arbitrary record when duplicates appear. I suppose we could do a DELETE ... WHERE NOT EXISTS after the import. Using DISTINCT ON just saves a step. I don't have any arguments beyond the grounds that we're using it in existing code as a duplicate record filter - :-( Just FYI, Mike Mascari
At 13:41 25/01/00 +1100, Chris Bitmead wrote: >Tom Lane wrote: >> >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > >I don't feel overly strongly about this, but if I remember right you can >do some pretty cool things with this feature, provided you do define >some semantics clearly. Like I think you can find the first tuple >(given some ORDER BY clause) that fulfills some criteria. I think it is > >SELECT DISTINCT ON name name, age ORDER BY age; > >will get the youngest person. This might not be clearly specified now, >but >as long as it's useful, how about clearly defining it? I don't know that >there is an easy way of doing this in standard SQL. I don't know about PGSQL, but in other systems, I use: Select <whatever> from <wherever> order by age asc limit to 1 row; I *think* the PGSQL syntax is: Select <whatever> from <wherever> order by age asc limit 1; I have no idea if the optimizer does 'fast first' optimizations, so I don't know how quick this would be on a large table. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > >I don't feel overly strongly about this, but if I remember right you can > >do some pretty cool things with this feature, provided you do define > >some semantics clearly. Like I think you can find the first tuple > >(given some ORDER BY clause) that fulfills some criteria. I think it is > > > >SELECT DISTINCT ON name name, age ORDER BY age; > > > >will get the youngest person. This might not be clearly specified now, > >but > >as long as it's useful, how about clearly defining it? I don't know that > >there is an easy way of doing this in standard SQL. > > I don't know about PGSQL, but in other systems, I use: > > Select <whatever> from <wherever> order by age asc limit to 1 row; > > I *think* the PGSQL syntax is: > > Select <whatever> from <wherever> order by age asc limit 1; I think what I really meant was... SELECT DISTINCT ON firstname firstname, age ORDER BY age. Which would find the youngest person called "fred", the youngest person called "paul", the youngest person called "jim" etc etc. which your limit example wouldn't do.
> INSERT INTO temp_sales > SELECT DISTINCT on key supplysources.supplysource, > incharges.supply, targets.target, incharges.saledate, > incharges.supplyunit, '', incharges.quantity, incharges.company, > incharges.costcntr, 'Replenish', incharges.price, '','','', 0, > text(supplysources.supplysource)|| > text(incharges.supply)|| > text(targets.target)|| > text(incharges.saledate) as key > FROM supplysources, incharges, targets WHERE > supplysources.warehouse = incharges.warehouse AND > (targets.site,targets.area) = (incharges.site,incharges.area); > > What happens is that a large charges file which is transferred to > a mainframe ERP application is first brought into PostgreSQL. > Depending upon certain race conditions, duplicate "sales" records > can appear in the data file. We use DISTINCT ON to pick (as you > point out) an arbitrary record when duplicates appear. I suppose > we could do a DELETE ... WHERE NOT EXISTS after the import. Using > DISTINCT ON just saves a step. I don't have any arguments beyond > the grounds that we're using it in existing code as a duplicate > record filter - :-( Why not just throw a NOTICE and keep the feature. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Chris Bitmead <chris@bitmead.com> writes: > Tom Lane wrote: >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > I don't feel overly strongly about this, but if I remember right you can > do some pretty cool things with this feature, provided you do define > some semantics clearly. We did talk about that, but I didn't hear any strong support for doing it, as opposed to pulling the feature completely... in particular, I didn't hear anyone volunteering to do the work... > as long as it's useful, how about clearly defining it? I don't know that > there is an easy way of doing this in standard SQL. I don't see any > problems with useful extensions to SQL. The only reason it came to my notice in the first place was people posting questions asking why they weren't getting the results they expected from it (whatever the heck those were; they weren't what you actually get from the current implementation, anyway). The problem with a poorly-specified nonstandard feature is support costs: you have to document it, answer questions about it, keep it working, etc. In this case we'd also have to define how it should work and alter the existing code to produce reasonable and predictable results. The existing code is not merely unpredictable, it is definitely broken. For example: regression=# select q1,q2 from int8_tbl; q1 | q2 ------------------+------------------- 123 | 456 123 | 45678901234567894567890123456789| 1234567890123456789 | 45678901234567894567890123456789 | -4567890123456789 (5 rows) regression=# select distinct on q1 q1,q2 from int8_tbl; q1 | q2 ------------------+----- 123 | 4564567890123456789 | 123 (2 rows) -- OK so far, but: regression=# select distinct on q1 q1,q2 from int8_tbl order by q2; q1 | q2 ------------------+-------------------4567890123456789 | -4567890123456789 123 | 4564567890123456789| 4567890123456789 (3 rows) -- which is not "distinct on q1" by my notions... In short, it's not clear to me that supporting DISTINCT ON is a good use of our limited resources. I'm willing to pull it out, but not to fix it. Does someone else want to take responsibility for it? regards, tom lane
> In short, it's not clear to me that supporting DISTINCT ON is a good use > of our limited resources. I'm willing to pull it out, but not to fix it. > Does someone else want to take responsibility for it? OK, we can disable it and put it on the TODO list. It is already there. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hi, This morning when I was inserting data to a table, after 15 minutes running, I received this ERROR: cannot flush block 9782 of tbs_billing_record to stable store. What does it mean? Regards, Chai
Tom Lane wrote: > Chris Bitmead <chris@bitmead.com> writes: > > Tom Lane wrote: > >> If I don't hear loud hollers very soon, I'm going to eliminate the > >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature > >> is not standard SQL and has no clear semantic interpretation. > > > I don't feel overly strongly about this, but if I remember right you can > > do some pretty cool things with this feature, provided you do define > > some semantics clearly. > > We did talk about that, but I didn't hear any strong support for doing > it, as opposed to pulling the feature completely... in particular, > I didn't hear anyone volunteering to do the work... I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it happened again last night, FATAL 1: VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2 ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT ... INTO ... I can get rid of them. A few table renames later I'm back in business. So how do I do this without DISTINCT? Better would be of course to get rid of the BlowawayRelationBuffers problem. I cannot reproduce it yet, but I think I can make it happen by running vacuum and updating tables that are linked by triggers. Does this sound plausible? And yes, I'm running on Alpha, i.e. 64 bits. Adriaan
Adriaan Joubert <a.joubert@albourne.com> writes: > I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it > happened again last night, > FATAL 1: VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2 > ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT > ... INTO ... I can get rid of them. A few table renames later I'm back in > business. So how do I do this without DISTINCT? I didn't say anything about eliminating SELECT DISTINCT; I was talking about SELECT DISTINCT ON, which is a whole 'nother thing. > Better would be of course to get rid of the BlowawayRelationBuffers problem. Quite so. I think this is fixed for 7.0 ... that error message indicates a buffer refcount leak, and I fixed several problems along that line a couple months ago. We shall see if I got 'em all or not ... regards, tom lane
Tom Lane wrote: > > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. I use 'distinct on' to get the most recent reports for each of a group of locations. E.g.: create table reports ( location varchar(16), report_time datetime, report_text text); select distinct on location * from reports where location ~~ 'Lond%' order by location, reporttime desc; to get the tuples that offer the most recent reports for each of London, Londonderry, Londy etc. Is there an alternative? Julian Scarfe
Re: [SQL] Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace
From
Peter Eisentraut
Date:
On Mon, 24 Jan 2000, Bruce Momjian wrote: > > In short, it's not clear to me that supporting DISTINCT ON is a good use > > of our limited resources. I'm willing to pull it out, but not to fix it. > > Does someone else want to take responsibility for it? > > OK, we can disable it and put it on the TODO list. It is already there. > Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all cases? -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
On Tue, 25 Jan 2000, Julian Scarfe wrote: > Tom Lane wrote: > > > > If I don't hear loud hollers very soon, I'm going to eliminate the > > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > > is not standard SQL and has no clear semantic interpretation. > > I use 'distinct on' to get the most recent reports for each of a group of > locations. E.g.: > > create table reports ( > location varchar(16), > report_time datetime, > report_text text); > > select distinct on location * from reports where location ~~ 'Lond%' order by > location, reporttime desc; > > to get the tuples that offer the most recent reports for each of London, > Londonderry, Londy etc. > > Is there an alternative? select location, max(report_time) from reports group by location This also has the negligible advantage that it gives you determinate results. -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
on 25/1/00 10:59, Peter Eisentraut at e99re41@DoCS.UU.SE wrote: > On Tue, 25 Jan 2000, Julian Scarfe wrote: >> I use 'distinct on' to get the most recent reports for each of a group of >> locations. E.g.: >> >> create table reports ( >> location varchar(16), >> report_time datetime, >> report_text text); >> >> select distinct on location * from reports where location ~~ 'Lond%' order by >> location, reporttime desc;<pgsql-hackers@postgreSQL.org> >> >> to get the tuples that offer the most recent reports for each of London, >> Londonderry, Londy etc. >> >> Is there an alternative? > > select location, max(report_time) from reports group by location > > This also has the negligible advantage that it gives you determinate > results. But I want report_text *too* -- that's the important bit, and it's not an aggregate or common column in the group. So: > Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all > cases? I don't think so. It has to be something like: select * from reports as r1 where r1.report_time = (select max(report_time) from reports as r2 where r2.location = r1.location) and r1.location ~~ 'Lond%'; However, to my surprise, that's as quick as doing the 'distinct on'. host=> explain select * from reports as r1 where r1.report_time = (select max(report_time) from reports as r2 where r2.location = r1.location) and r1.location ~~ 'Lond%'; NOTICE: QUERY PLAN: Index Scan using reports_by_location_issuetime on reports r1 (cost=186.24 size=2 width=334) SubPlan -> Aggregate (cost=3.62 size=0 width=0) -> Index Scan using reports_by_location_report_timeon reports r2 (cost=3.62 size=13 width=8) EXPLAIN host=> explain select distinct on location * from reports where location ~~ 'Lond%' order by location, report_time desc; NOTICE: QUERY PLAN: Unique (cost=186.24 size=0 width=0) -> Sort (cost=186.24 size=0 width=0) -> Index Scan using reports_by_location_issuetimeon reports (cost=186.24 size=2 width=334) I'm surprise that the "reports_by_location_issuetime" index got used, since it involves the issuetime field which is not the same as the report_time field. I edited it and numerous other fields from the quoted 'create table' for simplicity. There's also a "reports_by_location_report_time" which indexes on, obviously, (location, report_time). Comments? Julian Scarfe, learning...
Julian Scarfe <jscarfe@callnetuk.com> writes: >>> I use 'distinct on' to get the most recent reports for each of a group of >>> locations. E.g.: >>> >>> select distinct on location * from reports where location ~~ 'Lond%' >>> order by location, reporttime desc; >>> >>> to get the tuples that offer the most recent reports for each of London, >>> Londonderry, Londy etc. >>> >>> Is there an alternative? >> >> select location, max(report_time) from reports group by location > But I want report_text *too* -- that's the important bit, and it's not an > aggregate or common column in the group. Right. That is the motivation that was given for keeping DISTINCT ON and trying to clean it up: if you ORDER BY the DISTINCT field plus enough other fields to give a unique ordering, and then rely on the (undocumented) implementation property that you'll get the first tuple out of each group with the same DISTINCT field, then you can get useful effects like this one. It's a kluge, and it's not readily extensible to cases such as needing to apply the DISTINCT filter across two fields, but it does solve problems that you can't straightforwardly rewrite into GROUP BY. Tell you what: there are two things that are really bugging me about DISTINCT ON. If I could get consensus on patching these things, I'd be willing to do that rather than yank it. One gripe is the lame-brain syntax: because there is no separator between the DISTINCT ON column name and the first select targetlist item, the DISTINCT ON target cannot be anything more complex than an unadorned column name. For example "SELECT DISTINCT ON A+B+C, ..." must be interpreted as DISTINCT ON A with a first target of +B+C; if we tried to allow more complex expressions then it becomes ambiguous whether that was meant or DISTINCT ON A+B with a first target of +C. To fix this we need some kind of syntactic separator. The cleanest idea that comes to my mind is to require parentheses around the ON target: SELECT DISTINCT ON (expression) target1, ... One immediate advantage of allowing an expression is that you can do distinct-on-two-fields in a rather klugy way, eg SELECT DISTINCT ON (field1 || ' ' || field2) ... We might someday extend it to allow multiple DISTINCT fields, eg, SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... but I'm not promising to do that now (unless it's really easy ;-)). The other gripe is the bad interaction with ORDER BY, as was illustrated in my example of last night: if you ORDER BY something that's not the DISTINCT field, it doesn't work. We could get something that kind of works by applying two sorting passes: first sort by the DISTINCT field, then run the duplicate-elimination filter, then re-sort by the ORDER BY field(s). The trouble with that is that it means the user no longer has control of the sort order within the same-DISTINCT-field-value groups during the first sort phase, so there's no way for him to control which tuple gets kept from each group. What I'd prefer to do is put in an error check that says "if you use both DISTINCT ON and ORDER BY, then the DISTINCT ON expression must be the first ORDER BY item". That way the ORDER BY ordering can be the same one used for the duplicate-elimination pass, and we don't have the semantic problem. Note that Julian's example meets this constraint. (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try to ORDER BY an expression that doesn't appear in the target list. SQL92 avoids the issue by not allowing you to ORDER BY expressions that aren't in the target list, period. We do allow that --- but not when you use DISTINCT. Essentially, I want to enforce that same restriction for DISTINCT ON.) The other piece of the puzzle would be to document that DISTINCT ON keeps the first tuple out of each set with the same DISTINCT ON value. Does that sound like a plan? regards, tom lane
> (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try > to ORDER BY an expression that doesn't appear in the target list. > SQL92 avoids the issue by not allowing you to ORDER BY expressions that > aren't in the target list, period. We do allow that --- but not when > you use DISTINCT. Essentially, I want to enforce that same restriction > for DISTINCT ON.) > > The other piece of the puzzle would be to document that DISTINCT ON > keeps the first tuple out of each set with the same DISTINCT ON value. > > Does that sound like a plan? Yes, very clear. Good. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On 2000-01-25, Chris Bitmead mentioned: > I think what I really meant was... > > SELECT DISTINCT ON firstname firstname, age ORDER BY age. > > Which would find the youngest person called "fred", the youngest person > called "paul", the youngest person called "jim" etc etc. which your > limit example wouldn't do. SELECT firstname, min(age) FROM x GROUP BY firstname [ ORDER BY min(age) ] -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
On 2000-01-24, Tom Lane mentioned: > If I don't hear loud hollers very soon, I'm going to eliminate the > DISTINCT ON "feature" for 7.0. As previously discussed, this feature > is not standard SQL and has no clear semantic interpretation. Our documents say that DISTINCT ON is equivalent to GROUP BY. I still don't see why that wouldn't be true. You can always rewrite select distinct on a a,b from test as select a, xxx(b) from test group by a where xxx is some aggregate function (presumably min or max). You can also rewrite select distinct on a a,b,c from test as select a, b, c from test group by a, b, c or using some aggregates here as well. At least you can control your results that way. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes: > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still > don't see why that wouldn't be true. You can always rewrite > select distinct on a a,b from test > as > select a, xxx(b) from test group by a > where xxx is some aggregate function (presumably min or max). Not really. Look at Julian's example. He can't rewrite as select a, min(b), min(c) from test group by a because the idea is to get the c that corresponds to the min b. If you do it with two independent aggregates then the b and c you get back may be from different tuples. I could imagine fixing this with a two-input aggregate, say select a, min(b), keyofmin(b, c) from test group by a where keyofmin is defined to return the c associated with the min b. But that'd be a pain to implement, first because we have no support for multi-argument aggregates, and second because you'd need a ton of separate keyofmin implementations for the cross-product of the data types you might want to deal with. So this is nearly as klugy as the SELECT DISTINCT ON approach --- and not any more standard, either. regards, tom lane
At 19:34 26/01/00 +0100, Peter Eisentraut wrote: >On 2000-01-24, Tom Lane mentioned: > >> If I don't hear loud hollers very soon, I'm going to eliminate the >> DISTINCT ON "feature" for 7.0. As previously discussed, this feature >> is not standard SQL and has no clear semantic interpretation. > >Our documents say that DISTINCT ON is equivalent to GROUP BY. I still >don't see why that wouldn't be true. You can always rewrite > >select distinct on a a,b from test > >as > >select a, xxx(b) from test group by a > >where xxx is some aggregate function (presumably min or max). > >You can also rewrite > >select distinct on a a,b,c from test > >as > >select a, b, c from test group by a, b, c > >or using some aggregates here as well. At least you can control your >results that way. I only learned about DISTINCT ON in this discussion, but my impression is that it has one advantage over GROUP BY, i that it produces a 'consistent' tuple. ie. the tuple it produces is guaranteed to exist in the database, whereas using GROUP BY with aggregates will not produce a 'real' row: f1|f2|f3 -------- 1 2 3 1 3 2 2 3 1 2 1 3 3 1 2 3 2 1 'select distinct on f1 f1,f2,f3 from <somewhere> order by f1, f2, f3' will produce (I think): f1|f2|f3 -------- 1 2 3 2 1 3 3 1 2 ...where each resulting tuple actually exists in the DB. I can't see a way of doing this with aggreagates unless a 'FIRST' function is defined, but that (as already discussed) will not obey the 'order by' clause. Maybe an 'ANY' aggregate would do it, but then it would not always produce the same results. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Tom Lane wrote: > SELECT DISTINCT ON (expression) target1, ... > One immediate advantage of allowing an expression is that you can do > distinct-on-two-fields in a rather klugy way, eg > SELECT DISTINCT ON (field1 || ' ' || field2) ... As long as we're fixing the syntax, I'm wondering if it wouldn't be more logical to have DISTINCT ON somewhere later in the syntax. I'm wondering if that might also avoid the need for () as a side effect. Like this perhaps.... SELECT x, y, z FROM foo WHERE z DISTINCT ON x, y ORDER BY x, y; > What I'd prefer to do is put in an > error check that says "if you use both DISTINCT ON and ORDER BY, then > the DISTINCT ON expression must be the first ORDER BY item". Better, but still a little kludgy. What about a syntax that really supports everything you want? Like maybe... SELECT x, y, z FROM foo DISTINCT ON x, y DESC ORDER BY z ASC; Distinct on now has a similar syntax to the order by clause. What this means is, do the DISTINCT ON test by ordering DESC (so you get the distinct item with the largest value of x, y), and then order the final result by z ascending. Unless I'm missing something that gives everybody what they want. HANG ON, I've got a better idea..... The other alternative is to make DISTINCT ON a bit like GROUP BY. So you would have something like SELECT x, y, max(z) AS mmm FROM foo DISTINCT ON x, y ORDER BY mmm; Like GROUP BY where you group by the non-aggregate fields in the SELECT clause, you use DISTINCT ON, to mention the non-aggregate clauses in the SELECT. The aggregate clause are used to select WHICH of the resulting fields are selected for presentation. This would have the benefit of being more general so you could select, say the city with the longest name in each state... SELECT state, zipcode, longest_string(cityname) FROM cities DISTINCT ON state ORDER BY zipcode; I don't know how far I'm stepping here from the SQL paradigm, but it sure seems cool.... Chris.
Chris Bitmead <chris@bitmead.com> writes: > As long as we're fixing the syntax, I'm wondering if it wouldn't be more > logical to have DISTINCT ON somewhere later in the syntax. Well, SELECT DISTINCT is that way because SQL92 says so. Putting the DISTINCT ON variant somewhere else might be logically purer, but I think it'd be confusing. Also, isn't the reason we have DISTINCT ON at all that it's there to be compatible with MySQL or someone? I figured adding parens would be about the least-surprising variant syntax for a person used to those other products. regards, tom lane
> Chris Bitmead <chris@bitmead.com> writes: > > As long as we're fixing the syntax, I'm wondering if it wouldn't be more > > logical to have DISTINCT ON somewhere later in the syntax. > > Well, SELECT DISTINCT is that way because SQL92 says so. Putting the > DISTINCT ON variant somewhere else might be logically purer, but I think > it'd be confusing. > > Also, isn't the reason we have DISTINCT ON at all that it's there to > be compatible with MySQL or someone? I figured adding parens would be > about the least-surprising variant syntax for a person used to those > other products. > Makes sense to me. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I saw someone mention here about porting Ars Digita to PostgreSQL. Can they elaborate on this? Is there a web page? When it's done will all versions of Ars Digita work with postgres? How advanced? How long to go?
At 04:41 PM 1/27/00 +1100, Chris Bitmead wrote: > >I saw someone mention here about porting Ars Digita to PostgreSQL. Can >they elaborate on this? Is there a web page? When it's done will all >versions of Ars Digita work with postgres? How advanced? How long to go? Check out http://dsl-dhogaza.pacifier.net:2000 for a snapshot of the work-in-progress. Feel free to poke around and play. Much of it is there. That's my box, it will be moving to a larger box over the next few weeks, enough funding for that popped up, apparently. That version is based on ACS 2.4, mostly ported by me. Ben Adida (an early arsDigita employee now on his own), me and a few others plan to move what we've got to ACS3 over the next few days. We also will be moving to the open source version of AOLserver, which is now in beta (they're using a postgres-like definition of beta, it appears to be very stable). We really need some of the upcoming V7.0 features, like "group by" that doesn't always return a row, avg(numeric), and other bug fixes. We'll love having referential integrity, too. And >8KB query returns will be great, means we can define all text columns as "text" and not worry about selects dying if we select more than one column. We're REALLY eager for TOAST and outer joins :) So, target is: Linux RH6.1 AOLserver 3.0b beta PG V7.0 beta ACS 3.1 By mid-February. We've been using sourceforge but they're quickly getting overloaded, bogged down, and maybe just a teensy bit flaky. We plan to move to start using the software management features of the ACS + CVS + other stuff (later to be moved to the ACS) and move to our own box as soon as we get things up and running. I've manhandled the AOLserver postgres driver, too, and it's now considerably more robust. It will be distributed with whatever comes after AOLserver 3.0b beta. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
I wrote: > To fix this we need some kind of syntactic separator. The cleanest > idea that comes to my mind is to require parentheses around the ON > target: > SELECT DISTINCT ON (expression) target1, ... > One immediate advantage of allowing an expression is that you can do > distinct-on-two-fields in a rather klugy way, eg > SELECT DISTINCT ON (field1 || ' ' || field2) ... > We might someday extend it to allow multiple DISTINCT fields, eg, > SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... > but I'm not promising to do that now (unless it's really easy ;-)). FYI, I have committed this change. It did turn out to be just as easy to support multiple DISTINCT ON expressions, so 7.0 will acceptSELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... > What I'd prefer to do is put in an > error check that says "if you use both DISTINCT ON and ORDER BY, then > the DISTINCT ON expression must be the first ORDER BY item". More generally, if you use both then the DISTINCT and ORDER lists must match until one or the other runs out. regards, tom lane
On 2000-01-26, Tom Lane mentioned: > Peter Eisentraut <peter_e@gmx.net> writes: > > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still > > don't see why that wouldn't be true. You can always rewrite > > select distinct on a a,b from test > > as > > select a, xxx(b) from test group by a > > where xxx is some aggregate function (presumably min or max). > > Not really. Look at Julian's example. He can't rewrite as > > select a, min(b), min(c) from test group by a > > because the idea is to get the c that corresponds to the min b. I see. I believe what you want is this: select one.a, two.b, two.c from (select a, min(b) as "min_b" from test group by a) as one, (select b, c from test) as two where one."min_b" = two.b Not sure if this is completely legal as it stands but at least the idea would be to join the grouped select with the plain one to get the c corresponding to the minimum b. But of course we don't offer that, so it's distinct on until then. (It would really surprise me if the distinct on functionality was not at all possible to emulate using SQL, since in my experience it is fairly complete with regards to querying options at least.) -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
At 23:28 27/01/00 +0100, Peter Eisentraut wrote: > >select one.a, two.b, two.c >from > (select a, min(b) as "min_b" from test group by a) as one, > (select b, c from test) as two >where one."min_b" = two.b > >Not sure if this is completely legal as it stands but at least the idea >would be to join the grouped select with the plain one to get the c >corresponding to the minimum b. But of course we don't offer that, so it's >distinct on until then. (It would really surprise me if the distinct on >functionality was not at all possible to emulate using SQL, since in my >experience it is fairly complete with regards to querying options at >least.) You are quite right - with a complete SQL impleentation, DISTINCT ON becomes superfluous. Although it may give the optimizer usefull hints as to how to approach the query. You actually have to be a bit more carefull to avoid duplicates, something like: Select a,b,c >From (Select a, min(b) from test group by a) as one(a,min_b), (Select b, c from test t2 where t2.a = one.a limit to1 row) as two This is legal on the DB I use most of the time. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: +61-03-5367 7422 | _________ \ Fax: +61-03-5367 7430 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/