Thread: DISTINCT ON

DISTINCT ON

From

"Jeremy Palmer"

Date:

18 November 2005, 18:49:57

Hi,

I have a table:

observation (  id int4 NOT NULL [PRIMARY KEY],  vector_id NOT NULL [FORGIEN KEY],  obs_type VARCHAR(4) NOT NULL,  date
TIMESTAMPNULL
 
)

I need to select the newest observation id, classify by type, for each
vector (there can be multiple observation ids on each vector). I have read
the postgresql manual and see that there is a "DISTINCT ON" statement which
can do this.

i.e.

SELECT DISTINCT ON (vector_id, obs_type)      id
FROM   observation
ORDER  BY vector_id,      obs_type,      date DESC;

However the documentation also states that "DISTINCT ON" is not part of the
SQL standard and should be avoided when possible, stating that aggregations
and sub-queries should be used instead...

How would this be done in this scenario?

Thanks for you help
Jeremy

Re: DISTINCT ON

From

Michael Glaesemann

Date:

18 November 2005, 19:28:03

On Nov 19, 2005, at 7:49 , Jeremy Palmer wrote:

> SELECT DISTINCT ON (vector_id, obs_type)
>        id
> FROM   observation
> ORDER  BY vector_id,
>        obs_type,
>        date DESC;
>
> However the documentation also states that "DISTINCT ON" is not  
> part of the
> SQL standard and should be avoided when possible, stating that  
> aggregations
> and sub-queries should be used instead...
>
> How would this be done in this scenario?


Something like:

select max(date), id
from observation
group by vector_id, obs_type;

Do test to see which is better in your situation.

Michael Glaesemann
grzm myrealbox com

Re: DISTINCT ON

From

"Jeremy Palmer"

Date:

18 November 2005, 22:50:58

Thanks for the reply.

Unfortunately that does not work as "id" column needs to be included in the
group by statement or be used in an aggregate function. If I did this it
definitely would note return the correct answer, as the "id" column is the
primary key for the table.

Any further suggestions?

-----Original Message-----
From: Michael Glaesemann [mailto:grzm@myrealbox.com] 
Sent: Saturday, 19 November 2005 12:28 p.m.
To: Jeremy Palmer
Cc: pgsql-sql@postgresql.org
Subject: Re: [SQL] DISTINCT ON

On Nov 19, 2005, at 7:49 , Jeremy Palmer wrote:

> SELECT DISTINCT ON (vector_id, obs_type)
>        id
> FROM   observation
> ORDER  BY vector_id,
>        obs_type,
>        date DESC;
>
> However the documentation also states that "DISTINCT ON" is not  
> part of the
> SQL standard and should be avoided when possible, stating that  
> aggregations
> and sub-queries should be used instead...
>
> How would this be done in this scenario?

Something like:

select max(date), id
from observation
group by vector_id, obs_type;

Do test to see which is better in your situation.

Michael Glaesemann
grzm myrealbox com

Re: DISTINCT ON

From

Michael Glaesemann

Date:

18 November 2005, 23:06:50


> -----Original Message-----
> From: Michael Glaesemann [mailto:grzm@myrealbox.com]
> Sent: Saturday, 19 November 2005 12:28 p.m.

> On Nov 19, 2005, at 7:49 , Jeremy Palmer wrote:
>
>> SELECT DISTINCT ON (vector_id, obs_type)
>>        id
>> FROM   observation
>> ORDER  BY vector_id,
>>        obs_type,
>>        date DESC;
>>
>> However the documentation also states that "DISTINCT ON" is not
>> part of the
>> SQL standard and should be avoided when possible, stating that
>> aggregations
>> and sub-queries should be used instead...

> Something like:
>
> select max(date), id
> from observation
> group by vector_id, obs_type;

On Nov 19, 2005, at 11:50 , Jeremy Palmer wrote:

> Unfortunately that does not work as "id" column needs to be  
> included in the
> group by statement or be used in an aggregate function. If I did  
> this it
> definitely would note return the correct answer, as the "id" column  
> is the
> primary key for the table.

[Please don't top post. It makes the post harder to read. I've  
reordered the post for readability.]

Try something like this:

select id
from (select max(date) as date    vector_id, obs_typefrom observationgroup by vector_id, obs_type) latest_observations
join observation using (date, vector_id, obs_type)

Michael Glaesemann
grzm myrealbox com

Re: DISTINCT ON

From

"Jeremy Palmer"

Date:

19 November 2005, 00:06:39

> -----Original Message-----
> From: Michael Glaesemann [mailto:grzm@myrealbox.com]
> Sent: Saturday, 19 November 2005 4:07 p.m.
> To: Jeremy Palmer
> Cc: pgsql-sql@postgresql.org
> Subject: Re: [SQL] DISTINCT ON
> 
> [Please don't top post. It makes the post harder to read. I've
> reordered the post for readability.]
> 
> Try something like this:
> 
> select id
> from (
>     select max(date) as date
>         vector_id, obs_type
>     from observation
>     group by vector_id, obs_type
>     ) latest_observations
> join observation using (date, vector_id, obs_type)
> 
> Michael Glaesemann
> grzm myrealbox com

Thanks again for the help. This query looks close to what I need. I think
the only change is to use an explicit join criteria, as the max date alias
can't be used in the "using" clause - only table columns can.

Interesting enough, on my server the "distinct on" clause that I originally
ran takes 10% of execution time that the query you provided does.

Thanks for your help.

Jeremy Palmer

Re: DISTINCT ON

From

Bruno Wolff III

Date:

19 November 2005, 10:28:06

On Sat, Nov 19, 2005 at 17:06:27 +1300, Jeremy Palmer <palmerj@xtra.co.nz> wrote:
> 
> Interesting enough, on my server the "distinct on" clause that I originally
> ran takes 10% of execution time that the query you provided does.

If DISTINCT ON wasn't useful, I doubt it would have been added.