Thread: Re: [SQL] DISTINCT ON: speak now or forever hold your peace
Julian Scarfe <jscarfe@callnetuk.com> writes: >>> I use 'distinct on' to get the most recent reports for each of a group of >>> locations. E.g.: >>> >>> select distinct on location * from reports where location ~~ 'Lond%' >>> order by location, reporttime desc; >>> >>> to get the tuples that offer the most recent reports for each of London, >>> Londonderry, Londy etc. >>> >>> Is there an alternative? >> >> select location, max(report_time) from reports group by location > But I want report_text *too* -- that's the important bit, and it's not an > aggregate or common column in the group. Right. That is the motivation that was given for keeping DISTINCT ON and trying to clean it up: if you ORDER BY the DISTINCT field plus enough other fields to give a unique ordering, and then rely on the (undocumented) implementation property that you'll get the first tuple out of each group with the same DISTINCT field, then you can get useful effects like this one. It's a kluge, and it's not readily extensible to cases such as needing to apply the DISTINCT filter across two fields, but it does solve problems that you can't straightforwardly rewrite into GROUP BY. Tell you what: there are two things that are really bugging me about DISTINCT ON. If I could get consensus on patching these things, I'd be willing to do that rather than yank it. One gripe is the lame-brain syntax: because there is no separator between the DISTINCT ON column name and the first select targetlist item, the DISTINCT ON target cannot be anything more complex than an unadorned column name. For example "SELECT DISTINCT ON A+B+C, ..." must be interpreted as DISTINCT ON A with a first target of +B+C; if we tried to allow more complex expressions then it becomes ambiguous whether that was meant or DISTINCT ON A+B with a first target of +C. To fix this we need some kind of syntactic separator. The cleanest idea that comes to my mind is to require parentheses around the ON target: SELECT DISTINCT ON (expression) target1, ... One immediate advantage of allowing an expression is that you can do distinct-on-two-fields in a rather klugy way, eg SELECT DISTINCT ON (field1 || ' ' || field2) ... We might someday extend it to allow multiple DISTINCT fields, eg, SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... but I'm not promising to do that now (unless it's really easy ;-)). The other gripe is the bad interaction with ORDER BY, as was illustrated in my example of last night: if you ORDER BY something that's not the DISTINCT field, it doesn't work. We could get something that kind of works by applying two sorting passes: first sort by the DISTINCT field, then run the duplicate-elimination filter, then re-sort by the ORDER BY field(s). The trouble with that is that it means the user no longer has control of the sort order within the same-DISTINCT-field-value groups during the first sort phase, so there's no way for him to control which tuple gets kept from each group. What I'd prefer to do is put in an error check that says "if you use both DISTINCT ON and ORDER BY, then the DISTINCT ON expression must be the first ORDER BY item". That way the ORDER BY ordering can be the same one used for the duplicate-elimination pass, and we don't have the semantic problem. Note that Julian's example meets this constraint. (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try to ORDER BY an expression that doesn't appear in the target list. SQL92 avoids the issue by not allowing you to ORDER BY expressions that aren't in the target list, period. We do allow that --- but not when you use DISTINCT. Essentially, I want to enforce that same restriction for DISTINCT ON.) The other piece of the puzzle would be to document that DISTINCT ON keeps the first tuple out of each set with the same DISTINCT ON value. Does that sound like a plan? regards, tom lane
> (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try > to ORDER BY an expression that doesn't appear in the target list. > SQL92 avoids the issue by not allowing you to ORDER BY expressions that > aren't in the target list, period. We do allow that --- but not when > you use DISTINCT. Essentially, I want to enforce that same restriction > for DISTINCT ON.) > > The other piece of the puzzle would be to document that DISTINCT ON > keeps the first tuple out of each set with the same DISTINCT ON value. > > Does that sound like a plan? Yes, very clear. Good. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > SELECT DISTINCT ON (expression) target1, ... > One immediate advantage of allowing an expression is that you can do > distinct-on-two-fields in a rather klugy way, eg > SELECT DISTINCT ON (field1 || ' ' || field2) ... As long as we're fixing the syntax, I'm wondering if it wouldn't be more logical to have DISTINCT ON somewhere later in the syntax. I'm wondering if that might also avoid the need for () as a side effect. Like this perhaps.... SELECT x, y, z FROM foo WHERE z DISTINCT ON x, y ORDER BY x, y; > What I'd prefer to do is put in an > error check that says "if you use both DISTINCT ON and ORDER BY, then > the DISTINCT ON expression must be the first ORDER BY item". Better, but still a little kludgy. What about a syntax that really supports everything you want? Like maybe... SELECT x, y, z FROM foo DISTINCT ON x, y DESC ORDER BY z ASC; Distinct on now has a similar syntax to the order by clause. What this means is, do the DISTINCT ON test by ordering DESC (so you get the distinct item with the largest value of x, y), and then order the final result by z ascending. Unless I'm missing something that gives everybody what they want. HANG ON, I've got a better idea..... The other alternative is to make DISTINCT ON a bit like GROUP BY. So you would have something like SELECT x, y, max(z) AS mmm FROM foo DISTINCT ON x, y ORDER BY mmm; Like GROUP BY where you group by the non-aggregate fields in the SELECT clause, you use DISTINCT ON, to mention the non-aggregate clauses in the SELECT. The aggregate clause are used to select WHICH of the resulting fields are selected for presentation. This would have the benefit of being more general so you could select, say the city with the longest name in each state... SELECT state, zipcode, longest_string(cityname) FROM cities DISTINCT ON state ORDER BY zipcode; I don't know how far I'm stepping here from the SQL paradigm, but it sure seems cool.... Chris.
Chris Bitmead <chris@bitmead.com> writes: > As long as we're fixing the syntax, I'm wondering if it wouldn't be more > logical to have DISTINCT ON somewhere later in the syntax. Well, SELECT DISTINCT is that way because SQL92 says so. Putting the DISTINCT ON variant somewhere else might be logically purer, but I think it'd be confusing. Also, isn't the reason we have DISTINCT ON at all that it's there to be compatible with MySQL or someone? I figured adding parens would be about the least-surprising variant syntax for a person used to those other products. regards, tom lane
> Chris Bitmead <chris@bitmead.com> writes: > > As long as we're fixing the syntax, I'm wondering if it wouldn't be more > > logical to have DISTINCT ON somewhere later in the syntax. > > Well, SELECT DISTINCT is that way because SQL92 says so. Putting the > DISTINCT ON variant somewhere else might be logically purer, but I think > it'd be confusing. > > Also, isn't the reason we have DISTINCT ON at all that it's there to > be compatible with MySQL or someone? I figured adding parens would be > about the least-surprising variant syntax for a person used to those > other products. > Makes sense to me. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I saw someone mention here about porting Ars Digita to PostgreSQL. Can they elaborate on this? Is there a web page? When it's done will all versions of Ars Digita work with postgres? How advanced? How long to go?
At 04:41 PM 1/27/00 +1100, Chris Bitmead wrote: > >I saw someone mention here about porting Ars Digita to PostgreSQL. Can >they elaborate on this? Is there a web page? When it's done will all >versions of Ars Digita work with postgres? How advanced? How long to go? Check out http://dsl-dhogaza.pacifier.net:2000 for a snapshot of the work-in-progress. Feel free to poke around and play. Much of it is there. That's my box, it will be moving to a larger box over the next few weeks, enough funding for that popped up, apparently. That version is based on ACS 2.4, mostly ported by me. Ben Adida (an early arsDigita employee now on his own), me and a few others plan to move what we've got to ACS3 over the next few days. We also will be moving to the open source version of AOLserver, which is now in beta (they're using a postgres-like definition of beta, it appears to be very stable). We really need some of the upcoming V7.0 features, like "group by" that doesn't always return a row, avg(numeric), and other bug fixes. We'll love having referential integrity, too. And >8KB query returns will be great, means we can define all text columns as "text" and not worry about selects dying if we select more than one column. We're REALLY eager for TOAST and outer joins :) So, target is: Linux RH6.1 AOLserver 3.0b beta PG V7.0 beta ACS 3.1 By mid-February. We've been using sourceforge but they're quickly getting overloaded, bogged down, and maybe just a teensy bit flaky. We plan to move to start using the software management features of the ACS + CVS + other stuff (later to be moved to the ACS) and move to our own box as soon as we get things up and running. I've manhandled the AOLserver postgres driver, too, and it's now considerably more robust. It will be distributed with whatever comes after AOLserver 3.0b beta. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
I wrote: > To fix this we need some kind of syntactic separator. The cleanest > idea that comes to my mind is to require parentheses around the ON > target: > SELECT DISTINCT ON (expression) target1, ... > One immediate advantage of allowing an expression is that you can do > distinct-on-two-fields in a rather klugy way, eg > SELECT DISTINCT ON (field1 || ' ' || field2) ... > We might someday extend it to allow multiple DISTINCT fields, eg, > SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... > but I'm not promising to do that now (unless it's really easy ;-)). FYI, I have committed this change. It did turn out to be just as easy to support multiple DISTINCT ON expressions, so 7.0 will acceptSELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... > What I'd prefer to do is put in an > error check that says "if you use both DISTINCT ON and ORDER BY, then > the DISTINCT ON expression must be the first ORDER BY item". More generally, if you use both then the DISTINCT and ORDER lists must match until one or the other runs out. regards, tom lane