Thread: SELECT DISTINCT and ORDER BY

SELECT DISTINCT and ORDER BY

From
Joerg Eriskat
Date:
Hi,

I'm not sure it is a bug, so I'm asking the SQL gurus.  I think
there is a inconsistency with SELECT DISTINCT queries when used in
combination with ORDER BY and column alias names (sorry for the rather
long example):      

create table t1 (f1 int4);
insert into t1 values (1);
insert into t1 values (2);
insert into t1 values (2); -- double 

create table t2 (f1 int4);
insert into t2 values (1);
insert into t2 values (2);
insert into t2 values (3);

-- this query seems ok:
select distinct a.f1 as x1, b.f1 as x2  into t3  from t1 a, t2 b;

-- this query does not select *distinct*:
select distinct a.f1 as x1, b.f1 as x2  into t4  from t1 a, t2 b  order by a.f1, b.f1; 
-- but this one does:
select distinct a.f1 as x1, b.f1 as x2  into t5  from t1 a, t2 b  order by x1, x2;

select count(*) from t3;
select count(*) from t4;
select count(*) from t5;

[ output skipped ... ]

select count(*) from t3;
count
-----   6
(1 row)

select count(*) from t4;
count
-----   9
(1 row)

select count(*) from t5;
count
-----   6
(1 row)

This was with Postgres 6.5.1 and RedHat 6.0 on an i486. My apologies if
this is just my plain bad SQL or if it is a known issue (but I did not
find it in the history file for v. 6.5.2).

Greetings,

Joerg





Re: [SQL] SELECT DISTINCT and ORDER BY

From
Tom Lane
Date:
Joerg Eriskat <Eriskat@icf.med.uni-muenchen.de> writes:
> I'm not sure it is a bug, so I'm asking the SQL gurus.  I think
> there is a inconsistency with SELECT DISTINCT queries when used in
> combination with ORDER BY and column alias names (sorry for the rather
> long example):      

Looks like a bug to me.  This seems to be fixed in the current sources,
probably as a result of the changes I made to the representation of sort
ordering.  What is happening in 6.5.* is that the system is failing to
notice that the order-by values are the same as the ones being called
out as DISTINCT, so it generates additional columns and then sorts on
those instead of the DISTINCT ones.  6.6-to-be isn't fooled by the
aliases.

The underlying problem is still there, though: ordering by a column
not mentioned in the DISTINCT list doesn't do the right thing.
Arguably there is no right thing, and the code should raise an error,
but as of now it definitely isn't doing anything reasonable.  It would
take two sort steps to handle this: sort by the DISTINCT columns,
filter out adjacent duplicates, then sort by the ORDER BY columns.
But we only do one sort, and we use the ORDER BY columns (plus any
DISTINCT columns not already in ORDER BY).

However, there's more to this than meets the eye.  If you examine any
non-DISTINCT column after the filter step, then you are going to get a
randomly chosen one of the values associated with that set of DISTINCT
values.  So a second sort step on such a column after the DISTINCT
sort/filter would not give well-defined results.  That's why I think
maybe we should raise an error.  (I don't like "DISTINCT ON column"
either, since it is *necessarily* ill-defined.)
        regards, tom lane