Thread: DISTINCT ON: speak now or forever hold your peace

DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

24 January 2000, 20:50:41

If I don't hear loud hollers very soon, I'm going to eliminate the
DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
is not standard SQL and has no clear semantic interpretation.

I hadn't been planning to touch DISTINCT before 7.0, but changed my
mind when I noticed this little gem:

create table foo1 (f1 int, f2 int, f3 int);
insert into foo1 values(1,2,3);
insert into foo1 values(1,2,null);
insert into foo1 values(1,null,2);
insert into foo1 values(1,2,4);
insert into foo1 values(1,2,4);

select * from foo1;f1 | f2 | f3
----+----+---- 1 |  2 |  3 1 |  2 | 1 |    |  2 1 |  2 |  4 1 |  2 |  4
(5 rows)

select distinct * from foo1;f1 | f2 | f3
----+----+---- 1 |  2 |  3 1 |  2 |  4 1 |  2 |
(3 rows)

Didn't know that (NULL, 2) == (2, NULL), did you?  The problem here
is that nodeUnique is doing a bitwise comparison of the tuple contents
(which is bad enough --- not all datatypes think equality is bitwise),
and it's neglecting to include the null-field bitmap in what it
compares.  Rather than just band-aid the null-field problem, I'm
going to fix it right.  As long as I have to touch it, I'll deal
with DISTINCT ON too.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

24 January 2000, 21:44:42

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

I don't feel overly strongly about this, but if I remember right you can
do some pretty cool things with this feature, provided you do define
some semantics clearly. Like I think you can find the first tuple
(given some ORDER BY clause) that fulfills some criteria. I think it is

SELECT DISTINCT ON name name, age ORDER BY age;

will get the youngest person. This might not be clearly specified now,
but
as long as it's useful, how about clearly defining it? I don't know that
there is an easy way of doing this in standard SQL. I don't see any
problems with useful extensions to SQL. If people want standards, they
don't have to use it.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Mike Mascari

Date:

24 January 2000, 21:55:42

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.
> 

I grepped our source code and found this query:

INSERT INTO temp_sales
SELECT DISTINCT on key supplysources.supplysource, 
incharges.supply, targets.target, incharges.saledate, 
incharges.supplyunit, '', incharges.quantity, incharges.company, 
incharges.costcntr, 'Replenish', incharges.price, '','','', 0, 
text(supplysources.supplysource)||
text(incharges.supply)||
text(targets.target)||
text(incharges.saledate) as key 
FROM supplysources, incharges, targets WHERE 
supplysources.warehouse = incharges.warehouse AND 
(targets.site,targets.area) = (incharges.site,incharges.area);

What happens is that a large charges file which is transferred to
a mainframe ERP application is first brought into PostgreSQL.
Depending upon certain race conditions, duplicate "sales" records
can appear in the data file. We use DISTINCT ON to pick (as you
point out) an arbitrary record when duplicates appear. I suppose
we could do a DELETE ... WHERE NOT EXISTS after the import. Using
DISTINCT ON just saves a step. I don't have any arguments beyond
the grounds that we're using it in existing code as a duplicate
record filter - :-(

Just FYI, 

Mike Mascari

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

24 January 2000, 22:04:45

At 13:41 25/01/00 +1100, Chris Bitmead wrote:
>Tom Lane wrote:
>> 
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.
>
>I don't feel overly strongly about this, but if I remember right you can
>do some pretty cool things with this feature, provided you do define
>some semantics clearly. Like I think you can find the first tuple
>(given some ORDER BY clause) that fulfills some criteria. I think it is
>
>SELECT DISTINCT ON name name, age ORDER BY age;
>
>will get the youngest person. This might not be clearly specified now,
>but
>as long as it's useful, how about clearly defining it? I don't know that
>there is an easy way of doing this in standard SQL. 

I don't know about PGSQL, but in other systems, I use:
   Select <whatever> from <wherever> order by age asc limit to 1 row;

I *think* the PGSQL syntax is:
   Select <whatever> from <wherever> order by age asc limit 1;

I have no idea if the optimizer does 'fast first' optimizations, so I don't
know how quick this would be on a large table.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

24 January 2000, 22:12:42

Philip Warner wrote:
> >I don't feel overly strongly about this, but if I remember right you can
> >do some pretty cool things with this feature, provided you do define
> >some semantics clearly. Like I think you can find the first tuple
> >(given some ORDER BY clause) that fulfills some criteria. I think it is
> >
> >SELECT DISTINCT ON name name, age ORDER BY age;
> >
> >will get the youngest person. This might not be clearly specified now,
> >but
> >as long as it's useful, how about clearly defining it? I don't know that
> >there is an easy way of doing this in standard SQL.
> 
> I don't know about PGSQL, but in other systems, I use:
> 
>     Select <whatever> from <wherever> order by age asc limit to 1 row;
> 
> I *think* the PGSQL syntax is:
> 
>     Select <whatever> from <wherever> order by age asc limit 1;

I think what I really meant was...

SELECT DISTINCT ON firstname firstname, age ORDER BY age.

Which would find the youngest person called "fred", the youngest person
called "paul", the youngest person called "jim" etc etc. which your
limit example wouldn't do.

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

24 January 2000, 22:13:43

> INSERT INTO temp_sales
> SELECT DISTINCT on key supplysources.supplysource, 
> incharges.supply, targets.target, incharges.saledate, 
> incharges.supplyunit, '', incharges.quantity, incharges.company, 
> incharges.costcntr, 'Replenish', incharges.price, '','','', 0, 
> text(supplysources.supplysource)||
> text(incharges.supply)||
> text(targets.target)||
> text(incharges.saledate) as key 
> FROM supplysources, incharges, targets WHERE 
> supplysources.warehouse = incharges.warehouse AND 
> (targets.site,targets.area) = (incharges.site,incharges.area);
> 
> What happens is that a large charges file which is transferred to
> a mainframe ERP application is first brought into PostgreSQL.
> Depending upon certain race conditions, duplicate "sales" records
> can appear in the data file. We use DISTINCT ON to pick (as you
> point out) an arbitrary record when duplicates appear. I suppose
> we could do a DELETE ... WHERE NOT EXISTS after the import. Using
> DISTINCT ON just saves a step. I don't have any arguments beyond
> the grounds that we're using it in existing code as a duplicate
> record filter - :-(

Why not just throw a NOTICE and keep the feature.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

24 January 2000, 22:46:43

Chris Bitmead <chris@bitmead.com> writes:
> Tom Lane wrote:
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.

> I don't feel overly strongly about this, but if I remember right you can
> do some pretty cool things with this feature, provided you do define
> some semantics clearly.

We did talk about that, but I didn't hear any strong support for doing
it, as opposed to pulling the feature completely... in particular,
I didn't hear anyone volunteering to do the work...

> as long as it's useful, how about clearly defining it? I don't know that
> there is an easy way of doing this in standard SQL. I don't see any
> problems with useful extensions to SQL.

The only reason it came to my notice in the first place was people
posting questions asking why they weren't getting the results they
expected from it (whatever the heck those were; they weren't what you
actually get from the current implementation, anyway).  The problem
with a poorly-specified nonstandard feature is support costs: you
have to document it, answer questions about it, keep it working, etc.
In this case we'd also have to define how it should work and alter
the existing code to produce reasonable and predictable results.  The
existing code is not merely unpredictable, it is definitely broken.
For example:

regression=# select q1,q2 from int8_tbl;       q1        |        q2
------------------+-------------------             123 |               456             123 |
45678901234567894567890123456789|               1234567890123456789 |  45678901234567894567890123456789 |
-4567890123456789
(5 rows)

regression=# select distinct on q1 q1,q2 from int8_tbl;       q1        | q2
------------------+-----             123 | 4564567890123456789 | 123
(2 rows)

-- OK so far, but:

regression=# select distinct on q1 q1,q2 from int8_tbl order by q2;       q1        |        q2
------------------+-------------------4567890123456789 | -4567890123456789             123 |
4564567890123456789|  4567890123456789

(3 rows)

-- which is not "distinct on q1" by my notions...

In short, it's not clear to me that supporting DISTINCT ON is a good use
of our limited resources.  I'm willing to pull it out, but not to fix it.
Does someone else want to take responsibility for it?
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

24 January 2000, 22:57:42

> In short, it's not clear to me that supporting DISTINCT ON is a good use
> of our limited resources.  I'm willing to pull it out, but not to fix it.
> Does someone else want to take responsibility for it?

OK, we can disable it and put it on the TODO list.  It is already there.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Cannot flush block 9782

From

Chairudin Sentosa Harjo

Date:

24 January 2000, 23:06:43

Hi,

This morning when I was inserting data to a table, after 15 minutes
running,
I received this

ERROR: cannot flush block 9782 of tbs_billing_record to stable store.

What does it mean?

Regards,
Chai

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Adriaan Joubert

Date:

25 January 2000, 02:16:45

Tom Lane wrote:

> Chris Bitmead <chris@bitmead.com> writes:
> > Tom Lane wrote:
> >> If I don't hear loud hollers very soon, I'm going to eliminate the
> >> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> >> is not standard SQL and has no clear semantic interpretation.
>
> > I don't feel overly strongly about this, but if I remember right you can
> > do some pretty cool things with this feature, provided you do define
> > some semantics clearly.
>
> We did talk about that, but I didn't hear any strong support for doing
> it, as opposed to pulling the feature completely... in particular,
> I didn't hear anyone volunteering to do the work...

I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it
happened again last night,
FATAL 1:  VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2
). I then have loads of duplicates in a table and by doing a SELECT DISTINCT
... INTO ... I can get rid of them. A few table renames later I'm back in
business. So how do I do this without DISTINCT?

Better would be of course to get rid of the BlowawayRelationBuffers problem.
I cannot reproduce it yet, but I think I can make it happen by running vacuum
and updating tables that are linked by triggers. Does this sound plausible?
And yes, I'm running on Alpha, i.e. 64 bits.

Adriaan

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 02:45:45

Adriaan Joubert <a.joubert@albourne.com> writes:
> I've been using DISTINCT to fix things when vacuum gets into a mess (yes, it
> happened again last night,
> FATAL 1:  VACUUM (vc_rpfheap): BlowawayRelationBuffers returned -2
> ). I then have loads of duplicates in a table and by doing a SELECT DISTINCT
> ... INTO ... I can get rid of them. A few table renames later I'm back in
> business. So how do I do this without DISTINCT?

I didn't say anything about eliminating SELECT DISTINCT; I was talking
about SELECT DISTINCT ON, which is a whole 'nother thing.

> Better would be of course to get rid of the BlowawayRelationBuffers problem.

Quite so.  I think this is fixed for 7.0 ... that error message
indicates a buffer refcount leak, and I fixed several problems
along that line a couple months ago.  We shall see if I got 'em
all or not ...
        regards, tom lane

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Julian Scarfe

Date:

25 January 2000, 03:30:46

Tom Lane wrote:
> 
> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

I use 'distinct on' to get the most recent reports for each of a group of
locations.  E.g.:

create table reports (
location     varchar(16),
report_time  datetime,
report_text  text);

select distinct on location * from reports where location ~~ 'Lond%' order by
location, reporttime desc;

to get the tuples that offer the most recent reports for each of London,
Londonderry, Londy etc.

Is there an alternative?

Julian Scarfe

Re: [SQL] Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

25 January 2000, 05:30:47

On Mon, 24 Jan 2000, Bruce Momjian wrote:

> > In short, it's not clear to me that supporting DISTINCT ON is a good use
> > of our limited resources.  I'm willing to pull it out, but not to fix it.
> > Does someone else want to take responsibility for it?
> 
> OK, we can disable it and put it on the TODO list.  It is already there.
> 

Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all
cases?

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

25 January 2000, 05:59:47

On Tue, 25 Jan 2000, Julian Scarfe wrote:

> Tom Lane wrote:
> > 
> > If I don't hear loud hollers very soon, I'm going to eliminate the
> > DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> > is not standard SQL and has no clear semantic interpretation.
> 
> I use 'distinct on' to get the most recent reports for each of a group of
> locations.  E.g.:
> 
> create table reports (
> location     varchar(16),
> report_time  datetime,
> report_text  text);
> 
> select distinct on location * from reports where location ~~ 'Lond%' order by
> location, reporttime desc;
> 
> to get the tuples that offer the most recent reports for each of London,
> Londonderry, Londy etc.
> 
> Is there an alternative?

select location, max(report_time) from reports group by location

This also has the negligible advantage that it gives you determinate
results.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Julian Scarfe

Date:

25 January 2000, 08:31:49

on 25/1/00 10:59, Peter Eisentraut at e99re41@DoCS.UU.SE wrote:

> On Tue, 25 Jan 2000, Julian Scarfe wrote:

>> I use 'distinct on' to get the most recent reports for each of a group of
>> locations.  E.g.:
>> 
>> create table reports (
>> location     varchar(16),
>> report_time  datetime,
>> report_text  text);
>> 
>> select distinct on location * from reports where location ~~ 'Lond%' order by
>> location, reporttime desc;<pgsql-hackers@postgreSQL.org>
>> 
>> to get the tuples that offer the most recent reports for each of London,
>> Londonderry, Londy etc.
>> 
>> Is there an alternative?
> 
> select location, max(report_time) from reports group by location
> 
> This also has the negligible advantage that it gives you determinate
> results.

But I want report_text *too* -- that's the important bit, and it's not an
aggregate or common column in the group. So:

> Am I wrong or can DISTINCT ON not be rewritten in terms of GROUP BY in all
> cases?

I don't think so.

It has to be something like:

select * from reports as r1 where r1.report_time = (select max(report_time)
from reports as r2 where r2.location = r1.location) and r1.location ~~
'Lond%';

However, to my surprise, that's as quick as doing the 'distinct on'.


host=> explain select * from reports as r1 where r1.report_time = (select
max(report_time) from reports as r2 where r2.location = r1.location) and
r1.location ~~ 'Lond%';
NOTICE:  QUERY PLAN:

Index Scan using reports_by_location_issuetime on reports r1  (cost=186.24
size=2 width=334) SubPlan   ->  Aggregate  (cost=3.62 size=0 width=0)         ->  Index Scan using
reports_by_location_report_timeon reports r2
 
(cost=3.62 size=13 width=8)

EXPLAIN

host=> explain select distinct on location * from reports where location ~~
'Lond%' order by location, report_time desc;
NOTICE:  QUERY PLAN:

Unique  (cost=186.24 size=0 width=0) ->  Sort  (cost=186.24 size=0 width=0)       ->  Index Scan using
reports_by_location_issuetimeon reports
 
(cost=186.24 size=2 width=334)

I'm surprise that the "reports_by_location_issuetime" index got used, since
it involves the issuetime field which is not the same as the report_time
field. I edited it and numerous other fields from the quoted 'create table'
for simplicity. There's also a "reports_by_location_report_time" which
indexes on, obviously, (location, report_time). Comments?

Julian Scarfe, learning...

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

25 January 2000, 12:17:56

Julian Scarfe <jscarfe@callnetuk.com> writes:
>>> I use 'distinct on' to get the most recent reports for each of a group of
>>> locations.  E.g.:
>>> 
>>> select distinct on location * from reports where location ~~ 'Lond%'
>>> order by location, reporttime desc;
>>> 
>>> to get the tuples that offer the most recent reports for each of London,
>>> Londonderry, Londy etc.
>>> 
>>> Is there an alternative?
>> 
>> select location, max(report_time) from reports group by location

> But I want report_text *too* -- that's the important bit, and it's not an
> aggregate or common column in the group.

Right.  That is the motivation that was given for keeping DISTINCT ON
and trying to clean it up: if you ORDER BY the DISTINCT field plus
enough other fields to give a unique ordering, and then rely on the
(undocumented) implementation property that you'll get the first tuple
out of each group with the same DISTINCT field, then you can get useful
effects like this one.

It's a kluge, and it's not readily extensible to cases such as needing to
apply the DISTINCT filter across two fields, but it does solve problems
that you can't straightforwardly rewrite into GROUP BY.

Tell you what: there are two things that are really bugging me about
DISTINCT ON.  If I could get consensus on patching these things,
I'd be willing to do that rather than yank it.

One gripe is the lame-brain syntax: because there is no separator
between the DISTINCT ON column name and the first select targetlist
item, the DISTINCT ON target cannot be anything more complex than an
unadorned column name.  For example "SELECT DISTINCT ON A+B+C, ..."
must be interpreted as DISTINCT ON A with a first target of +B+C;
if we tried to allow more complex expressions then it becomes ambiguous
whether that was meant or DISTINCT ON A+B with a first target of +C.
To fix this we need some kind of syntactic separator.  The cleanest
idea that comes to my mind is to require parentheses around the ON
target:    SELECT DISTINCT ON (expression) target1, ...
One immediate advantage of allowing an expression is that you can do
distinct-on-two-fields in a rather klugy way, eg    SELECT DISTINCT ON (field1 || ' ' || field2) ...
We might someday extend it to allow multiple DISTINCT fields, eg,    SELECT DISTINCT ON (expr1 [, expr2 ...]) target1,
...
but I'm not promising to do that now (unless it's really easy ;-)).

The other gripe is the bad interaction with ORDER BY, as was illustrated
in my example of last night: if you ORDER BY something that's not the
DISTINCT field, it doesn't work.  We could get something that kind of
works by applying two sorting passes: first sort by the DISTINCT field,
then run the duplicate-elimination filter, then re-sort by the ORDER BY
field(s).  The trouble with that is that it means the user no longer has
control of the sort order within the same-DISTINCT-field-value groups
during the first sort phase, so there's no way for him to control which
tuple gets kept from each group.  What I'd prefer to do is put in an
error check that says "if you use both DISTINCT ON and ORDER BY, then
the DISTINCT ON expression must be the first ORDER BY item".  That way
the ORDER BY ordering can be the same one used for the
duplicate-elimination pass, and we don't have the semantic problem.
Note that Julian's example meets this constraint.

(BTW, ordinary SELECT DISTINCT has this same sort of problem if you try
to ORDER BY an expression that doesn't appear in the target list.
SQL92 avoids the issue by not allowing you to ORDER BY expressions that
aren't in the target list, period.  We do allow that --- but not when
you use DISTINCT.  Essentially, I want to enforce that same restriction
for DISTINCT ON.)

The other piece of the puzzle would be to document that DISTINCT ON
keeps the first tuple out of each set with the same DISTINCT ON value.

Does that sound like a plan?
        regards, tom lane

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

25 January 2000, 17:44:55

> (BTW, ordinary SELECT DISTINCT has this same sort of problem if you try
> to ORDER BY an expression that doesn't appear in the target list.
> SQL92 avoids the issue by not allowing you to ORDER BY expressions that
> aren't in the target list, period.  We do allow that --- but not when
> you use DISTINCT.  Essentially, I want to enforce that same restriction
> for DISTINCT ON.)
> 
> The other piece of the puzzle would be to document that DISTINCT ON
> keeps the first tuple out of each set with the same DISTINCT ON value.
> 
> Does that sound like a plan?

Yes, very clear.  Good.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

26 January 2000, 13:29:29

On 2000-01-25, Chris Bitmead mentioned:

> I think what I really meant was...
> 
> SELECT DISTINCT ON firstname firstname, age ORDER BY age.
> 
> Which would find the youngest person called "fred", the youngest person
> called "paul", the youngest person called "jim" etc etc. which your
> limit example wouldn't do.

SELECT firstname, min(age) FROM x GROUP BY firstname [ ORDER BY min(age) ]

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

26 January 2000, 13:30:10

On 2000-01-24, Tom Lane mentioned:

> If I don't hear loud hollers very soon, I'm going to eliminate the
> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
> is not standard SQL and has no clear semantic interpretation.

Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
don't see why that wouldn't be true. You can always rewrite

select distinct on a a,b from test

as

select a, xxx(b) from test group by a

where xxx is some aggregate function (presumably min or max).

You can also rewrite

select distinct on a a,b,c from test

as

select a, b, c from test group by a, b, c

or using some aggregates here as well. At least you can control your
results that way.

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

26 January 2000, 13:41:10

Peter Eisentraut <peter_e@gmx.net> writes:
> Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
> don't see why that wouldn't be true. You can always rewrite
> select distinct on a a,b from test
> as
> select a, xxx(b) from test group by a
> where xxx is some aggregate function (presumably min or max).

Not really.  Look at Julian's example.  He can't rewrite as
select a, min(b), min(c) from test group by a

because the idea is to get the c that corresponds to the min b.
If you do it with two independent aggregates then the b and c
you get back may be from different tuples.

I could imagine fixing this with a two-input aggregate, say
select a, min(b), keyofmin(b, c) from test group by a

where keyofmin is defined to return the c associated with the min b.
But that'd be a pain to implement, first because we have no support
for multi-argument aggregates, and second because you'd need a ton
of separate keyofmin implementations for the cross-product of the
data types you might want to deal with.  So this is nearly as
klugy as the SELECT DISTINCT ON approach --- and not any more
standard, either.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

26 January 2000, 17:24:12

At 19:34 26/01/00 +0100, Peter Eisentraut wrote:
>On 2000-01-24, Tom Lane mentioned:
>
>> If I don't hear loud hollers very soon, I'm going to eliminate the
>> DISTINCT ON "feature" for 7.0.  As previously discussed, this feature
>> is not standard SQL and has no clear semantic interpretation.
>
>Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
>don't see why that wouldn't be true. You can always rewrite
>
>select distinct on a a,b from test
>
>as
>
>select a, xxx(b) from test group by a
>
>where xxx is some aggregate function (presumably min or max).
>
>You can also rewrite
>
>select distinct on a a,b,c from test
>
>as
>
>select a, b, c from test group by a, b, c
>
>or using some aggregates here as well. At least you can control your
>results that way.


I only learned about DISTINCT ON in this discussion, but my impression is
that it has one advantage over GROUP BY, i that it produces a 'consistent'
tuple. ie. the tuple it produces is guaranteed to exist in the database,
whereas using GROUP BY with aggregates will not produce a 'real' row:

f1|f2|f3
--------
1  2  3
1  3  2
2  3  1
2  1  3
3  1  2
3  2  1

'select distinct on f1 f1,f2,f3 from <somewhere> order by f1, f2, f3' will
produce (I think):

f1|f2|f3
--------
1  2  3
2  1  3
3  1  2

...where each resulting tuple actually exists in the DB. I can't see a way
of doing this with aggreagates unless a 'FIRST' function is defined, but
that (as already discussed) will not obey the 'order by' clause. Maybe an
'ANY' aggregate would do it, but then it would not always produce the same
results.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Chris Bitmead

Date:

26 January 2000, 17:57:14

Tom Lane wrote:
>      SELECT DISTINCT ON (expression) target1, ...
> One immediate advantage of allowing an expression is that you can do
> distinct-on-two-fields in a rather klugy way, eg
>                 SELECT DISTINCT ON (field1 || ' ' || field2) ...

As long as we're fixing the syntax, I'm wondering if it wouldn't be more
logical to have DISTINCT ON somewhere later in the syntax. I'm wondering
if that might also avoid the need for () as a side effect. Like this
perhaps....

SELECT x, y, z FROM foo WHERE z DISTINCT ON x, y ORDER BY x, y;

> What I'd prefer to do is put in an
> error check that says "if you use both DISTINCT ON and ORDER BY, then
> the DISTINCT ON expression must be the first ORDER BY item".  

Better, but still a little kludgy. What about a syntax that really
supports everything you want? Like maybe...

SELECT x, y, z FROM foo DISTINCT ON x, y DESC ORDER BY z ASC;

Distinct on now has a similar syntax to the order by clause. What this
means is, do the DISTINCT ON test by ordering DESC (so you get the
distinct item with the largest value of x, y), and then order the final
result by z ascending.

Unless I'm missing something that gives everybody what they want.

HANG ON, I've got a better idea.....

The other alternative is to make DISTINCT ON a bit like GROUP BY. So you
would have
something like

SELECT x, y, max(z) AS mmm FROM foo DISTINCT ON x, y ORDER BY mmm;

Like GROUP BY where you group by the non-aggregate fields in the SELECT
clause, you use DISTINCT ON, to mention the non-aggregate clauses in the
SELECT. The aggregate clause are used to select WHICH of the resulting
fields are selected for presentation. This would have the benefit of
being more general so you could select, say the city with the longest
name in each state...

SELECT state, zipcode, longest_string(cityname) FROM cities DISTINCT ON
state ORDER BY zipcode;

I don't know how far I'm stepping here from the SQL paradigm, but it
sure seems cool....

Chris.

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

26 January 2000, 18:06:14

Chris Bitmead <chris@bitmead.com> writes:
> As long as we're fixing the syntax, I'm wondering if it wouldn't be more
> logical to have DISTINCT ON somewhere later in the syntax.

Well, SELECT DISTINCT is that way because SQL92 says so.  Putting the
DISTINCT ON variant somewhere else might be logically purer, but I think
it'd be confusing.

Also, isn't the reason we have DISTINCT ON at all that it's there to
be compatible with MySQL or someone?  I figured adding parens would be
about the least-surprising variant syntax for a person used to those
other products.
        regards, tom lane

Re: [HACKERS] Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Bruce Momjian

Date:

26 January 2000, 18:33:13

> Chris Bitmead <chris@bitmead.com> writes:
> > As long as we're fixing the syntax, I'm wondering if it wouldn't be more
> > logical to have DISTINCT ON somewhere later in the syntax.
> 
> Well, SELECT DISTINCT is that way because SQL92 says so.  Putting the
> DISTINCT ON variant somewhere else might be logically purer, but I think
> it'd be confusing.
> 
> Also, isn't the reason we have DISTINCT ON at all that it's there to
> be compatible with MySQL or someone?  I figured adding parens would be
> about the least-surprising variant syntax for a person used to those
> other products.
> 

Makes sense to me.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Ars Digita and PostgreSQL

From

Chris Bitmead

Date:

27 January 2000, 00:44:04

I saw someone mention here about porting Ars Digita to PostgreSQL. Can
they elaborate on this? Is there a web page? When it's done will all
versions of Ars Digita work with postgres? How advanced? How long to go?

Re: [HACKERS] Ars Digita and PostgreSQL

From

Don Baccus

Date:

27 January 2000, 11:46:15

At 04:41 PM 1/27/00 +1100, Chris Bitmead wrote:
>
>I saw someone mention here about porting Ars Digita to PostgreSQL. Can
>they elaborate on this? Is there a web page? When it's done will all
>versions of Ars Digita work with postgres? How advanced? How long to go?

Check out http://dsl-dhogaza.pacifier.net:2000 for a snapshot of the
work-in-progress.  Feel free to poke around and play.  Much of it is
there.

That's my box, it will be moving to a larger box over the next 
few weeks, enough funding for that popped up, apparently.

That version is based on ACS 2.4, mostly ported by me.  Ben Adida
(an early arsDigita employee now on his own), me and a few others
plan to move what we've got to ACS3 over the next few days.

We also will be moving to the open source version of AOLserver,
which is now in beta (they're using a postgres-like definition
of beta, it appears to be very stable).

We really need some of the upcoming V7.0 features, like "group
by" that doesn't always return a row, avg(numeric), and other
bug fixes.  We'll love having referential integrity, too.  And
>8KB query returns will be great, means we can define all text
columns as "text" and not worry about selects dying if we select
more than one column.  We're REALLY eager for TOAST and outer
joins :)

So, target is:

Linux RH6.1
AOLserver 3.0b beta
PG V7.0 beta
ACS 3.1

By mid-February.  

We've been using sourceforge but they're quickly getting overloaded,
bogged down, and maybe just a teensy bit flaky.  We plan to move to
start using the software management features of the ACS + CVS + other
stuff (later to be moved to the ACS) and move to our own box as soon
as we get things up and running.

I've manhandled the AOLserver postgres driver, too, and it's now
considerably more robust.  It will be distributed with whatever
comes after AOLserver 3.0b beta.

- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From

Tom Lane

Date:

27 January 2000, 13:34:17

I wrote:
> To fix this we need some kind of syntactic separator.  The cleanest
> idea that comes to my mind is to require parentheses around the ON
> target:
>         SELECT DISTINCT ON (expression) target1, ...
> One immediate advantage of allowing an expression is that you can do
> distinct-on-two-fields in a rather klugy way, eg
>         SELECT DISTINCT ON (field1 || ' ' || field2) ...
> We might someday extend it to allow multiple DISTINCT fields, eg,
>         SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ...
> but I'm not promising to do that now (unless it's really easy ;-)).

FYI, I have committed this change.  It did turn out to be just as easy
to support multiple DISTINCT ON expressions, so 7.0 will acceptSELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ...

> What I'd prefer to do is put in an
> error check that says "if you use both DISTINCT ON and ORDER BY, then
> the DISTINCT ON expression must be the first ORDER BY item".

More generally, if you use both then the DISTINCT and ORDER lists must
match until one or the other runs out.
        regards, tom lane

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Peter Eisentraut

Date:

27 January 2000, 17:22:06

On 2000-01-26, Tom Lane mentioned:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > Our documents say that DISTINCT ON is equivalent to GROUP BY. I still
> > don't see why that wouldn't be true. You can always rewrite
> > select distinct on a a,b from test
> > as
> > select a, xxx(b) from test group by a
> > where xxx is some aggregate function (presumably min or max).
> 
> Not really.  Look at Julian's example.  He can't rewrite as
> 
>     select a, min(b), min(c) from test group by a
> 
> because the idea is to get the c that corresponds to the min b.

I see. I believe what you want is this:

select one.a, two.b, two.c
from   (select a, min(b) as "min_b" from test group by a) as one,   (select b, c from test) as two
where one."min_b" = two.b

Not sure if this is completely legal as it stands but at least the idea
would be to join the grouped select with the plain one to get the c
corresponding to the minimum b. But of course we don't offer that, so it's
distinct on until then. (It would really surprise me if the distinct on
functionality was not at all possible to emulate using SQL, since in my
experience it is fairly complete with regards to querying options at
least.)

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden

Re: [HACKERS] DISTINCT ON: speak now or forever hold your peace

From

Philip Warner

Date:

27 January 2000, 20:22:21

At 23:28 27/01/00 +0100, Peter Eisentraut wrote:
>
>select one.a, two.b, two.c
>from
>    (select a, min(b) as "min_b" from test group by a) as one,
>    (select b, c from test) as two
>where one."min_b" = two.b
>
>Not sure if this is completely legal as it stands but at least the idea
>would be to join the grouped select with the plain one to get the c
>corresponding to the minimum b. But of course we don't offer that, so it's
>distinct on until then. (It would really surprise me if the distinct on
>functionality was not at all possible to emulate using SQL, since in my
>experience it is fairly complete with regards to querying options at
>least.)

You are quite right - with a complete SQL impleentation, DISTINCT ON
becomes superfluous. Although it may give the optimizer usefull hints as to
how to approach the query. You actually have to be a bit more carefull to
avoid duplicates, something like:

Select   a,b,c
>From    (Select a, min(b) from test group by a) as one(a,min_b),   (Select b, c from test t2 where t2.a = one.a limit
to1 row) as two
 

This is legal on the DB I use most of the time.






----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/