Re: [SQL] DISTINCT ON: speak now or forever hold your peace - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [SQL] DISTINCT ON: speak now or forever hold your peace |
Date | |
Msg-id | 13995.948820606@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your
peace
Re: [SQL] DISTINCT ON: speak now or forever hold your peace |
List | pgsql-hackers |
Julian Scarfe <jscarfe@callnetuk.com> writes: >>> I use 'distinct on' to get the most recent reports for each of a group of >>> locations. E.g.: >>> >>> select distinct on location * from reports where location ~~ 'Lond%' >>> order by location, reporttime desc; >>> >>> to get the tuples that offer the most recent reports for each of London, >>> Londonderry, Londy etc. >>> >>> Is there an alternative? >> >> select location, max(report_time) from reports group by location > But I want report_text *too* -- that's the important bit, and it's not an > aggregate or common column in the group. Right. That is the motivation that was given for keeping DISTINCT ON and trying to clean it up: if you ORDER BY the DISTINCT field plus enough other fields to give a unique ordering, and then rely on the (undocumented) implementation property that you'll get the first tuple out of each group with the same DISTINCT field, then you can get useful effects like this one. It's a kluge, and it's not readily extensible to cases such as needing to apply the DISTINCT filter across two fields, but it does solve problems that you can't straightforwardly rewrite into GROUP BY. Tell you what: there are two things that are really bugging me about DISTINCT ON. If I could get consensus on patching these things, I'd be willing to do that rather than yank it. One gripe is the lame-brain syntax: because there is no separator between the DISTINCT ON column name and the first select targetlist item, the DISTINCT ON target cannot be anything more complex than an unadorned column name. For example "SELECT DISTINCT ON A+B+C, ..." must be interpreted as DISTINCT ON A with a first target of +B+C; if we tried to allow more complex expressions then it becomes ambiguous whether that was meant or DISTINCT ON A+B with a first target of +C. To fix this we need some kind of syntactic separator. The cleanest idea that comes to my mind is to require parentheses around the ON target: SELECT DISTINCT ON (expression) target1, ... One immediate advantage of allowing an expression is that you can do distinct-on-two-fields in a rather klugy way, eg SELECT DISTINCT ON (field1 || ' ' || field2) ... We might someday extend it to allow multiple DISTINCT fields, eg, SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ... but I'm not promising to do that now (unless it's really easy ;-)). The other gripe is the bad interaction with ORDER BY, as was illustrated in my example of last night: if you ORDER BY something that's not the DISTINCT field, it doesn't work. We could get something that kind of works by applying two sorting passes: first sort by the DISTINCT field, then run the duplicate-elimination filter, then re-sort by the ORDER BY field(s). The trouble with that is that it means the user no longer has control of the sort order within the same-DISTINCT-field-value groups during the first sort phase, so there's no way for him to control which tuple gets kept from each group. What I'd prefer to do is put in an error check that says "if you use both DISTINCT ON and ORDER BY, then the DISTINCT ON expression must be the first ORDER BY item". That way the ORDER BY ordering can be the same one used for the duplicate-elimination pass, and we don't have the semantic problem. Note that Julian's example meets this constraint. (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try to ORDER BY an expression that doesn't appear in the target list. SQL92 avoids the issue by not allowing you to ORDER BY expressions that aren't in the target list, period. We do allow that --- but not when you use DISTINCT. Essentially, I want to enforce that same restriction for DISTINCT ON.) The other piece of the puzzle would be to document that DISTINCT ON keeps the first tuple out of each set with the same DISTINCT ON value. Does that sound like a plan? regards, tom lane
pgsql-hackers by date: