Thread: DISTINCT ON: speak now or forever hold your peace

DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

24 January 2000, 20:50:41

If I don't hear loud hollers very soon, I'm going to eliminate the
DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
is not standard SQL and has no clear semantic interpretation.

I hadn't been planning to touch DISTINCT before 7.0, but changed my
mind when I noticed this little gem:

create table foo1 (f1 int, f2 int, f3 int);
insert into foo1 values(1,2,3);
insert into foo1 values(1,2,null);
insert into foo1 values(1,null,2);
insert into foo1 values(1,2,4);
insert into foo1 values(1,2,4);

select * from foo1;f1 | f2 | f3
----+----+---- 1 |  2 |  3 1 |  2 | 1 |    |  2 1 |  2 |  4 1 |  2 |  4
(5 rows)

select distinct * from foo1;f1 | f2 | f3
----+----+---- 1 |  2 |  3 1 |  2 |  4 1 |  2 |
(3 rows)

Didn't know that (NULL, 2) == (2, NULL), did you?  The problem here
is that nodeUnique is doing a bitwise comparison of the tuple contents
(which is bad enough --- not all datatypes think equality is bitwise),
and it's neglecting to include the null-field bitmap in what it
compares.  Rather than just band-aid the null-field problem, I'm
going to fix it right.  As long as I have to touch it, I'll deal
with DISTINCT ON too.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

24 January 2000, 21:44:42

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

I don't feel overly strongly about this, but if I remember right you can
do some pretty cool things with this feature, provided you do define
some semantics clearly. Like I think you can find the first tuple
(given some ORDER BY clause) that fulfills some criteria. I think it is

SELECT DISTINCT ON name name, age ORDER BY age;

will get the youngest person. This might not be clearly specified now,
but
as long as it's useful, how about clearly defining it? I don't know that
there is an easy way of doing this in standard SQL. I don't see any
problems with useful extensions to SQL. If people want standards, they
don't have to use it.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Mike Mascari

Date:

24 January 2000, 21:55:42

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.
> 

I grepped our source code and found this query:

INSERT INTO temp_sales
SELECT DISTINCT on key supplysources.supplysource, 
incharges.supply, targets.target, incharges.saledate, 
incharges.supplyunit, '', incharges.quantity, incharges.company, 
incharges.costcntr, 'Replenish', incharges.price, '','','', 0, 
text(supplysources.supplysource)||
text(incharges.supply)||
text(targets.target)||
text(incharges.saledate) as key 
FROM supplysources, incharges, targets WHERE 
supplysources.warehouse = incharges.warehouse AND 
(targets.site,targets.area) = (incharges.site,incharges.area);

What happens is that a large charges file which is transferred to
a mainframe ERP application is first brought into PostgreSQL.
Depending upon certain race conditions, duplicate "sales" records
can appear in the data file. We use DISTINCT ON to pick (as you
point out) an arbitrary record when duplicates appear. I suppose
we could do a DELETE ... WHERE NOT EXISTS after the import. Using
DISTINCT ON just saves a step. I don't have any arguments beyond
the grounds that we're using it in existing code as a duplicate
record filter - :-(

Just FYI, 

Mike Mascari

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

24 January 2000, 22:04:45

At 13:41 25/01/00 +1100, Chris Bitmead wrote:
>Tom Lane wrote:
>> 
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.
>
>I don't feel overly strongly about this, but if I remember right you can
>do some pretty cool things with this feature, provided you do define
>some semantics clearly. Like I think you can find the first tuple
>(given some ORDER BY clause) that fulfills some criteria. I think it is
>
>SELECT DISTINCT ON name name, age ORDER BY age;
>
>will get the youngest person. This might not be clearly specified now,
>but
>as long as it's useful, how about clearly defining it? I don't know that
>there is an easy way of doing this in standard SQL. 

I don't know about PGSQL, but in other systems, I use:
   Select <whatever> from <wherever> order by age asc limit to 1 row;

I *think* the PGSQL syntax is:
   Select <whatever> from <wherever> order by age asc limit 1;

I have no idea if the optimizer does 'fast first' optimizations, so I don't
know how quick this would be on a large table.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

24 January 2000, 22:12:42

Philip Warner wrote:
> >I don't feel overly strongly about this, but if I remember right you can
> >do some pretty cool things with this feature, provided you do define
> >some semantics clearly. Like I think you can find the first tuple
> >(given some ORDER BY clause) that fulfills some criteria. I think it is
> >
> >SELECT DISTINCT ON name name, age ORDER BY age;
> >
> >will get the youngest person. This might not be clearly specified now,
> >but
> >as long as it's useful, how about clearly defining it? I don't know that
> >there is an easy way of doing this in standard SQL.
> 
> I don't know about PGSQL, but in other systems, I use:
> 
>     Select <whatever> from <wherever> order by age asc limit to 1 row;
> 
> I *think* the PGSQL syntax is:
> 
>     Select <whatever> from <wherever> order by age asc limit 1;

I think what I really meant was...

SELECT DISTINCT ON firstname firstname, age ORDER BY age.

Which would find the youngest person called "fred", the youngest person
called "paul", the youngest person called "jim" etc etc. which your
limit example wouldn't do.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

24 January 2000, 22:13:43

> INSERT INTO temp_sales
> SELECT DISTINCT on key supplysources.supplysource, 
> incharges.supply, targets.target, incharges.saledate, 
> incharges.supplyunit, '', incharges.quantity, incharges.company, 
> incharges.costcntr, 'Replenish', incharges.price, '','','', 0, 
> text(supplysources.supplysource)||
> text(incharges.supply)||
> text(targets.target)||
> text(incharges.saledate) as key 
> FROM supplysources, incharges, targets WHERE 
> supplysources.warehouse = incharges.warehouse AND 
> (targets.site,targets.area) = (incharges.site,incharges.area);
> 
> What happens is that a large charges file which is transferred to
> a mainframe ERP application is first brought into PostgreSQL.
> Depending upon certain race conditions, duplicate "sales" records
> can appear in the data file. We use DISTINCT ON to pick (as you
> point out) an arbitrary record when duplicates appear. I suppose
> we could do a DELETE ... WHERE NOT EXISTS after the import. Using
> DISTINCT ON just saves a step. I don't have any arguments beyond
> the grounds that we're using it in existing code as a duplicate
> record filter - :-(

Why not just throw a NOTICE and keep the feature.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

24 January 2000, 22:46:43

Chris Bitmead <chris@bitmead.com> writes:
> Tom Lane wrote:
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.

> I don't feel overly strongly about this, but if I remember right you can
> do some pretty cool things with this feature, provided you do define
> some semantics clearly.

We did talk about that, but I didn't hear any strong support for doing
it, as opposed to pulling the feature completely... in particular,
I didn't hear anyone volunteering to do the work...

> as long as it's useful, how about clearly defining it? I don't know that
> there is an easy way of doing this in standard SQL. I don't see any
> problems with useful extensions to SQL.

The only reason it came to my notice in the first place was people
posting questions asking why they weren't getting the results they
expected from it (whatever the heck those were; they weren't what you
actually get from the current implementation, anyway).  The problem
with a poorly-specified nonstandard feature is support costs: you
have to document it, answer questions about it, keep it working, etc.
In this case we'd also have to define how it should work and alter
the existing code to produce reasonable and predictable results.  The
existing code is not merely unpredictable, it is definitely broken.
For example:

regression=# select q1,q2 from int8_tbl;       q1        |        q2
------------------+-------------------             123 |               456             123 |
45678901234567894567890123456789|               1234567890123456789 |  45678901234567894567890123456789 |
-4567890123456789
(5 rows)

regression=# select distinct on q1 q1,q2 from int8_tbl;       q1        | q2
------------------+-----             123 | 4564567890123456789 | 123
(2 rows)

-- OK so far, but:

regression=# select distinct on q1 q1,q2 from int8_tbl order by q2;       q1        |        q2
------------------+-------------------4567890123456789 | -4567890123456789             123 |
4564567890123456789|  4567890123456789

(3 rows)

-- which is not "distinct on q1" by my notions...

In short, it's not clear to me that supporting DISTINCT ON is a good use
of our limited resources.  I'm willing to pull it out, but not to fix it.
Does someone else want to take responsibility for it?
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

24 January 2000, 22:57:42

> In short, it's not clear to me that supporting DISTINCT ON is a good use
> of our limited resources.  I'm willing to pull it out, but not to fix it.
> Does someone else want to take responsibility for it?

OK, we can disable it and put it on the TODO list.  It is already there.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Cannot flush block 9782

From

Chairudin Sentosa Harjo

Date:

24 January 2000, 23:06:43

Hi,

This morning when I was inserting data to a table, after 15 minutes
running,
I received this

ERROR: cannot flush block 9782 of tbs_billing_record to stable store.

What does it mean?

Regards,
Chai

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

24 January 2000, 23:06:43

At 14:09 25/01/00 +1100, Chris Bitmead wrote:
>Philip Warner wrote:
>> >I don't feel overly strongly about this, but if I remember right you can
>> >do some pretty cool things with this feature, provided you do define
>> >some semantics clearly. Like I think you can find the first tuple
>> >(given some ORDER BY clause) that fulfills some criteria. I think it is
>> >
>> >SELECT DISTINCT ON name name, age ORDER BY age;
>> >
>> >will get the youngest person. This might not be clearly specified now,
>> >but
>> >as long as it's useful, how about clearly defining it? I don't know that
>> >there is an easy way of doing this in standard SQL.
>> 
>> I don't know about PGSQL, but in other systems, I use:
>> 
>>     Select <whatever> from <wherever> order by age asc limit to 1 row;
>> 
>> I *think* the PGSQL syntax is:
>> 
>>     Select <whatever> from <wherever> order by age asc limit 1;
>
>I think what I really meant was...
>
>SELECT DISTINCT ON firstname firstname, age ORDER BY age.
>
>Which would find the youngest person called "fred", the youngest person
>called "paul", the youngest person called "jim" etc etc. which your
>limit example wouldn't do.

Just a thought, but would I be right in saying that this could be easily
done with the addition of a new aggregate function 'FIRST', which simply
returns the first value sent to it? Since the aggregates operate a row at a
time, you are guaranteed a consistent set of values, I think.

eg.
   SELECT firstname, FIRST(age) ORDER BY age.

Just an idea, but it seems like a cute solution for the more general problem.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 00:02:44

Philip Warner <pjw@rhyme.com.au> writes:
> Just a thought, but would I be right in saying that this could be easily
> done with the addition of a new aggregate function 'FIRST', which simply
> returns the first value sent to it? Since the aggregates operate a row at
> a time, you are guaranteed a consistent set of values, I think.

No, because the system doesn't guarantee to deliver tuples to the
aggregate in any particular order.  The physical order of tuples
on the disk will depend on the order in which they were last updated
(plus random perturbations introduced by VACUUM); and the order in
which they are processed by a query will depend on the query plan
chosen by the planner (sequential scan vs. indexscan, etc).
SQL in general doesn't believe that tuple ordering has any semantic
significance --- you can ask for ORDER BY, but that's only honored
at the end stage of SELECT result delivery, not necessarily anywhere
in the bowels of a query.

Also, if you are going to use an aggregate, ISTM you might as well
use one like MIN() or MAX() and get a fully predictable result.
I suppose a FIRST() aggregate would save a few cycles by not running
comparisons, but I bet it'd be tough to notice any difference.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Don Baccus

Date:

25 January 2000, 00:23:43

At 12:01 AM 1/25/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>> Just a thought, but would I be right in saying that this could be easily
>> done with the addition of a new aggregate function 'FIRST', which simply
>> returns the first value sent to it? Since the aggregates operate a row at
>> a time, you are guaranteed a consistent set of values, I think.
>
>No, because the system doesn't guarantee to deliver tuples to the
>aggregate in any particular order.

In fact, this is a KEY notion behind RDBMS systems...queries return an
unordered set, pure and simple, unless you supply an "order" clause
(though "group by" appears to be implemented by ordering in various 
RDBMS systems, that's an efficiency hack not a given AFAIK).

>SQL in general doesn't believe that tuple ordering has any semantic
>significance --- you can ask for ORDER BY, but that's only honored
>at the end stage of SELECT result delivery, not necessarily anywhere
>in the bowels of a query.

Yep...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

25 January 2000, 01:00:44

At 21:20 24/01/00 -0800, Don Baccus wrote:
>At 12:01 AM 1/25/00 -0500, Tom Lane wrote:
>>Philip Warner <pjw@rhyme.com.au> writes:
>>> Just a thought, but would I be right in saying that this could be easily
>>> done with the addition of a new aggregate function 'FIRST', which simply
>>> returns the first value sent to it? Since the aggregates operate a row at
>>> a time, you are guaranteed a consistent set of values, I think.
>>
>>No, because the system doesn't guarantee to deliver tuples to the
>>aggregate in any particular order.
>
>In fact, this is a KEY notion behind RDBMS systems...queries return an
>unordered set, pure and simple, unless you supply an "order" clause
>(though "group by" appears to be implemented by ordering in various 
>RDBMS systems, that's an efficiency hack not a given AFAIK).

That's why there was an 'order by' in the query (there should have also
been a 'group by', to make it valid SQL). I had (naievely) assumed that the
aggregates would be passed the rows in sorted order.

>>SQL in general doesn't believe that tuple ordering has any semantic
>>significance --- you can ask for ORDER BY, but that's only honored
>>at the end stage of SELECT result delivery, not necessarily anywhere
>>in the bowels of a query.
>
>Yep...
>

Out of curiosity, does the SQL spec give any rules or guidelines about when
aggregates should be applied to resultant rows? Or is it one of the
implementation-dependant things?



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

25 January 2000, 01:10:44

Philip Warner wrote:

> >In fact, this is a KEY notion behind RDBMS systems...queries return an
> >unordered set, pure and simple, unless you supply an "order" clause
> >(though "group by" appears to be implemented by ordering in various
> >RDBMS systems, that's an efficiency hack not a given AFAIK).

On second thoughts, maybe I agree DISTINCT ON should be dumped. In my
example, if I wanted the youngest people with a given name sorted oldest
first things would be confusing. It's an inconsistant command, I can see
that now.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 01:13:45

Philip Warner <pjw@rhyme.com.au> writes:
>>> SQL in general doesn't believe that tuple ordering has any semantic
>>> significance --- you can ask for ORDER BY, but that's only honored
>>> at the end stage of SELECT result delivery, not necessarily anywhere
>>> in the bowels of a query.

> Out of curiosity, does the SQL spec give any rules or guidelines about when
> aggregates should be applied to resultant rows? Or is it one of the
> implementation-dependant things?

Well, SQL's conceptual model is perfectly clear about the processing
work flow of a SELECT: after you've constructed the tuple set (which
might involve joining multiple relations), you apply the WHERE condition
to filter out uninteresting tuples.  Then you apply GROUP BY (if given)
to divvy the tuples into groups.  Then you apply HAVING to eliminate
uninteresting groups.  Then you apply aggregate functions (if any) to
individual groups, or to the whole filtered result set if no groups.
Finally you apply ORDER BY to whatever's left.

Note that at no point except the final output is there any notion
of the tuples being generated or processed in a particular order.
Given appropriate hardware, much of this could be done in parallel.
Parallel or not, an implementation is free to choose the processing
order for its convenience.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Adriaan Joubert

Date:

25 January 2000, 02:16:45

Tom Lane wrote:

> Chris Bitmead <chris@bitmead.com> writes:
> > Tom Lane wrote:
> >> If I don't hear loud hollers very soon, I'm going to eliminate the
> >> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> >> is not standard SQL and has no clear semantic interpretation.
>
> > I don't feel overly strongly about this, but if I remember right you can
> > do some pretty cool things with this feature, provided you do define
> > some semantics clearly.
>
> We did talk about that, but I didn't hear any strong support for doing
> it, as opposed to pulling the feature completely... in particular,
> I didn't hear anyone volunteering to do the work...

I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it
happened again last night,
FATAL 1:  VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2
). I then have loads of duplicates in a table and by doing a SELECT DISTINCT
... INTO ... I can get rid of them. A few table renames later I'm back in
business. So how do I do this without DISTINCT?

Better would be of course to get rid of the BlowawayRelationBuffers problem.
I cannot reproduce it yet, but I think I can make it happen by running vacuum
and updating tables that are linked by triggers. Does this sound plausible?
And yes, I'm running on Alpha, i.e. 64 bits.

Adriaan

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 02:45:45

Adriaan Joubert <a.joubert@albourne.com> writes:
> I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it
> happened again last night,
> FATAL 1:  VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2
> ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT
> ... INTO ... I can get rid of them. A few table renames later I'm back in
> business. So how do I do this without DISTINCT?

I didn't say anything about eliminating SELECT DISTINCT; I was talking
about SELECT DISTINCT ON, which is a whole 'nother thing.

> Better would be of course to get rid of the BlowawayRelationBuffers problem.

Quite so.  I think this is fixed for 7.0 ... that error message
indicates a buffer refcount leak, and I fixed several problems
along that line a couple months ago.  We shall see if I got 'em
all or not ...
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

25 January 2000, 02:55:45

At 01:12 25/01/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>>>> SQL in general doesn't believe that tuple ordering has any semantic
>>>> significance --- you can ask for ORDER BY, but that's only honored
>>>> at the end stage of SELECT result delivery, not necessarily anywhere
>>>> in the bowels of a query.
>
>> Out of curiosity, does the SQL spec give any rules or guidelines about when
>> aggregates should be applied to resultant rows? Or is it one of the
>> implementation-dependant things?
>
>Well, SQL's conceptual model is perfectly clear about the processing
>work flow of a SELECT: after you've constructed the tuple set (which
>might involve joining multiple relations), you apply the WHERE condition
>to filter out uninteresting tuples.  Then you apply GROUP BY (if given)
>to divvy the tuples into groups.  Then you apply HAVING to eliminate
>uninteresting groups.  Then you apply aggregate functions (if any) to
>individual groups, or to the whole filtered result set if no groups.
>Finally you apply ORDER BY to whatever's left.

Thanks, but now I'm confused. I would have thought that aggregates went
*before* the having clause, since at least one DB I know allows:

select job_type,avg(age) from <wherever> where <stuff> group by job_type
having avg(age) > 70;

ie. the use of aggregate results in the 'having' clause.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 03:22:45

Philip Warner <pjw@rhyme.com.au> writes:
> Thanks, but now I'm confused. I would have thought that aggregates went
> *before* the having clause,

Arrgh.  You are right of course.  Shouldn't pontificate late at night...
        regards, tom lane

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Julian Scarfe

Date:

25 January 2000, 03:30:46

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

I use 'distinct on' to get the most recent reports for each of a group of
locations.  E.g.:

create table reports (
location     varchar(16),
report_time  datetime,
report_text  text);

select distinct on location * from reports where location ~~ 'Lond%' order by
location, reporttime desc;

to get the tuples that offer the most recent reports for each of London,
Londonderry, Londy etc.

Is there an alternative?

Julian Scarfe

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Michael Meskes

Date:

25 January 2000, 05:12:46

On Mon, Jan 24, 2000 at 10:55:02PM -0500, Bruce Momjian wrote:
> > In short, it's not clear to me that supporting DISTINCT ON is a good use
> > of our limited resources.  I'm willing to pull it out, but not to fix it.
> > Does someone else want to take responsibility for it?
> 
> OK, we can disable it and put it on the TODO list.  It is already there.

Or, as you suggested earlier, throw a NOTICE instead of disabling it. But
then I doubt Tom talked about physically removing the code anyway. Disabling
certainly is enough.

Michael
-- 
Michael Meskes                         | Go SF 49ers!
Th.-Heuss-Str. 61, D-41812 Erkelenz    | Go Rhein Fire!
Tel.: (+49) 2431/72651                 | Use Debian GNU/Linux!
Email: Michael@Fam-Meskes.De           | Use PostgreSQL!

Re: [SQL] Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

25 January 2000, 05:30:47

On Mon, 24 Jan 2000, Bruce Momjian wrote:

> > In short, it's not clear to me that supporting DISTINCT ON is a good use
> > of our limited resources.  I'm willing to pull it out, but not to fix it.
> > Does someone else want to take responsibility for it?
> 
> OK, we can disable it and put it on the TODO list.  It is already there.
> 

Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all
cases?

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

25 January 2000, 05:59:47

On Tue, 25 Jan 2000, Julian Scarfe wrote:

> Tom Lane wrote:
> > 
> > If I don't hear loud hollers very soon, I'm going to eliminate the
> > DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> > is not standard SQL and has no clear semantic interpretation.
> 
> I use 'distinct on' to get the most recent reports for each of a group of
> locations.  E.g.:
> 
> create table reports (
> location     varchar(16),
> report_time  datetime,
> report_text  text);
> 
> select distinct on location * from reports where location ~~ 'Lond%' order by
> location, reporttime desc;
> 
> to get the tuples that offer the most recent reports for each of London,
> Londonderry, Londy etc.
> 
> Is there an alternative?

select location, max(report_time) from reports group by location

This also has the negligible advantage that it gives you determinate
results.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Don Baccus

Date:

25 January 2000, 10:44:51

At 05:00 PM 1/25/00 +1100, Philip Warner wrote:

>Out of curiosity, does the SQL spec give any rules or guidelines about when
>aggregates should be applied to resultant rows? Or is it one of the
>implementation-dependant things?

Well...my copy of Date's now actually in the mail, rather than in 
Boston, but at the moment that's not much help!

"order by" happens after everything, AFAIK.  For instance...

select ...
union
select ...
order by

orders the result of the union, which pretty much implies that
aggregates will happen first.

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

26 January 2000, 13:29:29

On 2000-01-25, Chris Bitmead mentioned:

> I think what I really meant was...
> 
> SELECT DISTINCT ON firstname firstname, age ORDER BY age.
> 
> Which would find the youngest person called "fred", the youngest person
> called "paul", the youngest person called "jim" etc etc. which your
> limit example wouldn't do.

SELECT firstname, min(age) FROM x GROUP BY firstname [ ORDER BY min(age) ]

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

26 January 2000, 13:30:10

On 2000-01-24, Tom Lane mentioned:

> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
don't see why that wouldn't be true. You can always rewrite

select distinct on a a,b from test

as

select a, xxx(b) from test group by a

where xxx is some aggregate function (presumably min or max).

You can also rewrite

select distinct on a a,b,c from test

as

select a, b, c from test group by a, b, c

or using some aggregates here as well. At least you can control your
results that way.

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

26 January 2000, 13:41:10

Peter Eisentraut <peter_e@gmx.net> writes:
> Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
> don't see why that wouldn't be true. You can always rewrite
> select distinct on a a,b from test
> as
> select a, xxx(b) from test group by a
> where xxx is some aggregate function (presumably min or max).

Not really.  Look at Julian's example.  He can't rewrite as
select a, min(b), min(c) from test group by a

because the idea is to get the c that corresponds to the min b.
If you do it with two independent aggregates then the b and c
you get back may be from different tuples.

I could imagine fixing this with a two-input aggregate, say
select a, min(b), keyofmin(b, c) from test group by a

where keyofmin is defined to return the c associated with the min b.
But that'd be a pain to implement, first because we have no support
for multi-argument aggregates, and second because you'd need a ton
of separate keyofmin implementations for the cross-product of the
data types you might want to deal with.  So this is nearly as
klugy as the SELECT DISTINCT ON approach --- and not any more
standard, either.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

26 January 2000, 17:24:12

At 19:34 26/01/00 +0100, Peter Eisentraut wrote:
>On 2000-01-24, Tom Lane mentioned:
>
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.
>
>Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
>don't see why that wouldn't be true. You can always rewrite
>
>select distinct on a a,b from test
>
>as
>
>select a, xxx(b) from test group by a
>
>where xxx is some aggregate function (presumably min or max).
>
>You can also rewrite
>
>select distinct on a a,b,c from test
>
>as
>
>select a, b, c from test group by a, b, c
>
>or using some aggregates here as well. At least you can control your
>results that way.


I only learned about DISTINCT ON in this discussion, but my impression is
that it has one advantage over GROUP BY, i that it produces a 'consistent'
tuple. ie. the tuple it produces is guaranteed to exist in the database,
whereas using GROUP BY with aggregates will not produce a 'real' row:

f1|f2|f3
--------
1  2  3
1  3  2
2  3  1
2  1  3
3  1  2
3  2  1

'select distinct on f1 f1,f2,f3 from <somewhere> order by f1, f2, f3' will
produce (I think):

f1|f2|f3
--------
1  2  3
2  1  3
3  1  2

...where each resulting tuple actually exists in the DB. I can't see a way
of doing this with aggreagates unless a 'FIRST' function is defined, but
that (as already discussed) will not obey the 'order by' clause. Maybe an
'ANY' aggregate would do it, but then it would not always produce the same
results.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

27 January 2000, 17:22:06

On 2000-01-26, Tom Lane mentioned:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
> > don't see why that wouldn't be true. You can always rewrite
> > select distinct on a a,b from test
> > as
> > select a, xxx(b) from test group by a
> > where xxx is some aggregate function (presumably min or max).
> 
> Not really.  Look at Julian's example.  He can't rewrite as
> 
>     select a, min(b), min(c) from test group by a
> 
> because the idea is to get the c that corresponds to the min b.

I see. I believe what you want is this:

select one.a, two.b, two.c
from   (select a, min(b) as "min_b" from test group by a) as one,   (select b, c from test) as two
where one."min_b" = two.b

Not sure if this is completely legal as it stands but at least the idea
would be to join the grouped select with the plain one to get the c
corresponding to the minimum b. But of course we don't offer that, so it's
distinct on until then. (It would really surprise me if the distinct on
functionality was not at all possible to emulate using SQL, since in my
experience it is fairly complete with regards to querying options at
least.)

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

27 January 2000, 20:22:21

At 23:28 27/01/00 +0100, Peter Eisentraut wrote:
>
>select one.a, two.b, two.c
>from
>    (select a, min(b) as "min_b" from test group by a) as one,
>    (select b, c from test) as two
>where one."min_b" = two.b
>
>Not sure if this is completely legal as it stands but at least the idea
>would be to join the grouped select with the plain one to get the c
>corresponding to the minimum b. But of course we don't offer that, so it's
>distinct on until then. (It would really surprise me if the distinct on
>functionality was not at all possible to emulate using SQL, since in my
>experience it is fairly complete with regards to querying options at
>least.)

You are quite right - with a complete SQL impleentation, DISTINCT ON
becomes superfluous. Although it may give the optimizer usefull hints as to
how to approach the query. You actually have to be a bit more carefull to
avoid duplicates, something like:

Select   a,b,c
>From    (Select a, min(b) from test group by a) as one(a,min_b),   (Select b, c from test t2 where t2.a = one.a limit
to1 row) as two
 

This is legal on the DB I use most of the time.






----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/