Thread: Re: [SQL] DISTINCT ON: speak now or forever hold your peace

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 15:17:56

Julian Scarfe <jscarfe@callnetuk.com> writes:
>>> I use 'distinct on' to get the most recent reports for each of a group of
>>> locations.  E.g.:
>>> 
>>> select distinct on location * from reports where location ~~ 'Lond%'
>>> order by location, reporttime desc;
>>> 
>>> to get the tuples that offer the most recent reports for each of London,
>>> Londonderry, Londy etc.
>>> 
>>> Is there an alternative?
>> 
>> select location, max(report_time) from reports group by location

> But I want report_text *too* -- that's the important bit, and it's not an
> aggregate or common column in the group.

Right.  That is the motivation that was given for keeping DISTINCT ON
and trying to clean it up: if you ORDER BY the DISTINCT field plus
enough other fields to give a unique ordering, and then rely on the
(undocumented) implementation property that you'll get the first tuple
out of each group with the same DISTINCT field, then you can get useful
effects like this one.

It's a kluge, and it's not readily extensible to cases such as needing to
apply the DISTINCT filter across two fields, but it does solve problems
that you can't straightforwardly rewrite into GROUP BY.

Tell you what: there are two things that are really bugging me about
DISTINCT ON.  If I could get consensus on patching these things,
I'd be willing to do that rather than yank it.

One gripe is the lame-brain syntax: because there is no separator
between the DISTINCT ON column name and the first select targetlist
item, the DISTINCT ON target cannot be anything more complex than an
unadorned column name.  For example "SELECT DISTINCT ON A+B+C, ..."
must be interpreted as DISTINCT ON A with a first target of +B+C;
if we tried to allow more complex expressions then it becomes ambiguous
whether that was meant or DISTINCT ON A+B with a first target of +C.
To fix this we need some kind of syntactic separator.  The cleanest
idea that comes to my mind is to require parentheses around the ON
target:    SELECT DISTINCT ON (expression) target1, ...
One immediate advantage of allowing an expression is that you can do
distinct-on-two-fields in a rather klugy way, eg    SELECT DISTINCT ON (field1 || ' ' || field2) ...
We might someday extend it to allow multiple DISTINCT fields, eg,    SELECT DISTINCT ON (expr1 [, expr2 ...]) target1,
...
but I'm not promising to do that now (unless it's really easy ;-)).

The other gripe is the bad interaction with ORDER BY, as was illustrated
in my example of last night: if you ORDER BY something that's not the
DISTINCT field, it doesn't work.  We could get something that kind of
works by applying two sorting passes: first sort by the DISTINCT field,
then run the duplicate-elimination filter, then re-sort by the ORDER BY
field(s).  The trouble with that is that it means the user no longer has
control of the sort order within the same-DISTINCT-field-value groups
during the first sort phase, so there's no way for him to control which
tuple gets kept from each group.  What I'd prefer to do is put in an
error check that says "if you use both DISTINCT ON and ORDER BY, then
the DISTINCT ON expression must be the first ORDER BY item".  That way
the ORDER BY ordering can be the same one used for the
duplicate-elimination pass, and we don't have the semantic problem.
Note that Julian's example meets this constraint.

(BTW, ordinary SELECT DISTINCT has this same sort of problem if you try
to ORDER BY an expression that doesn't appear in the target list.
SQL92 avoids the issue by not allowing you to ORDER BY expressions that
aren't in the target list, period.  We do allow that --- but not when
you use DISTINCT.  Essentially, I want to enforce that same restriction
for DISTINCT ON.)

The other piece of the puzzle would be to document that DISTINCT ON
keeps the first tuple out of each set with the same DISTINCT ON value.

Does that sound like a plan?
        regards, tom lane

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

25 January 2000, 20:44:55

> (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try
> to ORDER BY an expression that doesn't appear in the target list.
> SQL92 avoids the issue by not allowing you to ORDER BY expressions that
> aren't in the target list, period.  We do allow that --- but not when
> you use DISTINCT.  Essentially, I want to enforce that same restriction
> for DISTINCT ON.)
> 
> The other piece of the puzzle would be to document that DISTINCT ON
> keeps the first tuple out of each set with the same DISTINCT ON value.
> 
> Does that sound like a plan?

Yes, very clear.  Good.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

26 January 2000, 20:57:14

Tom Lane wrote:
>      SELECT DISTINCT ON (expression) target1, ...
> One immediate advantage of allowing an expression is that you can do
> distinct-on-two-fields in a rather klugy way, eg
>                 SELECT DISTINCT ON (field1 || ' ' || field2) ...

As long as we're fixing the syntax, I'm wondering if it wouldn't be more
logical to have DISTINCT ON somewhere later in the syntax. I'm wondering
if that might also avoid the need for () as a side effect. Like this
perhaps....

SELECT x, y, z FROM foo WHERE z DISTINCT ON x, y ORDER BY x, y;

> What I'd prefer to do is put in an
> error check that says "if you use both DISTINCT ON and ORDER BY, then
> the DISTINCT ON expression must be the first ORDER BY item".  

Better, but still a little kludgy. What about a syntax that really
supports everything you want? Like maybe...

SELECT x, y, z FROM foo DISTINCT ON x, y DESC ORDER BY z ASC;

Distinct on now has a similar syntax to the order by clause. What this
means is, do the DISTINCT ON test by ordering DESC (so you get the
distinct item with the largest value of x, y), and then order the final
result by z ascending.

Unless I'm missing something that gives everybody what they want.

HANG ON, I've got a better idea.....

The other alternative is to make DISTINCT ON a bit like GROUP BY. So you
would have
something like

SELECT x, y, max(z) AS mmm FROM foo DISTINCT ON x, y ORDER BY mmm;

Like GROUP BY where you group by the non-aggregate fields in the SELECT
clause, you use DISTINCT ON, to mention the non-aggregate clauses in the
SELECT. The aggregate clause are used to select WHICH of the resulting
fields are selected for presentation. This would have the benefit of
being more general so you could select, say the city with the longest
name in each state...

SELECT state, zipcode, longest_string(cityname) FROM cities DISTINCT ON
state ORDER BY zipcode;

I don't know how far I'm stepping here from the SQL paradigm, but it
sure seems cool....

Chris.

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

26 January 2000, 21:06:14

Chris Bitmead <chris@bitmead.com> writes:
> As long as we're fixing the syntax, I'm wondering if it wouldn't be more
> logical to have DISTINCT ON somewhere later in the syntax.

Well, SELECT DISTINCT is that way because SQL92 says so.  Putting the
DISTINCT ON variant somewhere else might be logically purer, but I think
it'd be confusing.

Also, isn't the reason we have DISTINCT ON at all that it's there to
be compatible with MySQL or someone?  I figured adding parens would be
about the least-surprising variant syntax for a person used to those
other products.
        regards, tom lane

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

26 January 2000, 21:33:13

> Chris Bitmead <chris@bitmead.com> writes:
> > As long as we're fixing the syntax, I'm wondering if it wouldn't be more
> > logical to have DISTINCT ON somewhere later in the syntax.
> 
> Well, SELECT DISTINCT is that way because SQL92 says so.  Putting the
> DISTINCT ON variant somewhere else might be logically purer, but I think
> it'd be confusing.
> 
> Also, isn't the reason we have DISTINCT ON at all that it's there to
> be compatible with MySQL or someone?  I figured adding parens would be
> about the least-surprising variant syntax for a person used to those
> other products.
> 

Makes sense to me.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Ars Digita and PostgreSQL

From

Chris Bitmead

Date:

27 January 2000, 03:44:04

I saw someone mention here about porting Ars Digita to PostgreSQL. Can
they elaborate on this? Is there a web page? When it's done will all
versions of Ars Digita work with postgres? How advanced? How long to go?

Re: [HACKERS] Ars Digita and PostgreSQL

From

Don Baccus

Date:

27 January 2000, 14:46:15

At 04:41 PM 1/27/00 +1100, Chris Bitmead wrote:
>
>I saw someone mention here about porting Ars Digita to PostgreSQL. Can
>they elaborate on this? Is there a web page? When it's done will all
>versions of Ars Digita work with postgres? How advanced? How long to go?

Check out http://dsl-dhogaza.pacifier.net:2000 for a snapshot of the
work-in-progress.  Feel free to poke around and play.  Much of it is
there.

That's my box, it will be moving to a larger box over the next 
few weeks, enough funding for that popped up, apparently.

That version is based on ACS 2.4, mostly ported by me.  Ben Adida
(an early arsDigita employee now on his own), me and a few others
plan to move what we've got to ACS3 over the next few days.

We also will be moving to the open source version of AOLserver,
which is now in beta (they're using a postgres-like definition
of beta, it appears to be very stable).

We really need some of the upcoming V7.0 features, like "group
by" that doesn't always return a row, avg(numeric), and other
bug fixes.  We'll love having referential integrity, too.  And
>8KB query returns will be great, means we can define all text
columns as "text" and not worry about selects dying if we select
more than one column.  We're REALLY eager for TOAST and outer
joins :)

So, target is:

Linux RH6.1
AOLserver 3.0b beta
PG V7.0 beta
ACS 3.1

By mid-February.  

We've been using sourceforge but they're quickly getting overloaded,
bogged down, and maybe just a teensy bit flaky.  We plan to move to
start using the software management features of the ACS + CVS + other
stuff (later to be moved to the ACS) and move to our own box as soon
as we get things up and running.

I've manhandled the AOLserver postgres driver, too, and it's now
considerably more robust.  It will be distributed with whatever
comes after AOLserver 3.0b beta.

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

27 January 2000, 16:34:17

I wrote:
> To fix this we need some kind of syntactic separator.  The cleanest
> idea that comes to my mind is to require parentheses around the ON
> target:
>         SELECT DISTINCT ON (expression) target1, ...
> One immediate advantage of allowing an expression is that you can do
> distinct-on-two-fields in a rather klugy way, eg
>         SELECT DISTINCT ON (field1 || ' ' || field2) ...
> We might someday extend it to allow multiple DISTINCT fields, eg,
>         SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ...
> but I'm not promising to do that now (unless it's really easy ;-)).

FYI, I have committed this change.  It did turn out to be just as easy
to support multiple DISTINCT ON expressions, so 7.0 will acceptSELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ...

> What I'd prefer to do is put in an
> error check that says "if you use both DISTINCT ON and ORDER BY, then
> the DISTINCT ON expression must be the first ORDER BY item".

More generally, if you use both then the DISTINCT and ORDER lists must
match until one or the other runs out.
        regards, tom lane