Thread: SELECT DISTINCT ON ... ORDER BY ...

SELECT DISTINCT ON ... ORDER BY ...

From
Thomas Metz
Date:
I have the following problem:

Assuming the table TEST as follows:

ID     NAME
-----------------
1      Alex
2      Oliver
1      Thomas
2      Fenella


SELECT DISTINCT ON id id, name FROM test;
produces:

ID     NAME
-----------------
1      Alex
2      Oliver


SELECT DISTINCT ON id, name FROM test ORDER BY name;
produces:

ID     NAME
-----------------
1      Alex
2      Fenella
1      Thomas

I would have expected only two rows in both queries. I don't care which
names actually appear in the output as long as they are sorted, but
there should no longer be duplicate id's.

If the table is as follows:

ID     NAME
-----------------
2      Oliver
2      Alex
1      Thomas
1      Fenella


SELECT DISTINCT ON id id, name FROM test;
produces:

ID     NAME
-----------------
1      Thomas
2      Oliver


SELECT DISTINCT ON id, name FROM test ORDER BY name;
produces:

ID     NAME
-----------------
2      Alex
1      Fenella
2      Oliver
1      Thomas

What seems to happen is that from the sorted table, duplicate id's are
eliminated only if they are grouped. If there is no explicit sorting I
assume the DISTINCT performs an implicit sorting on id and then
eliminates records correctly. Is that the correct behaviour? Is there
another (simple) way of getting the results I want?

I am still using version 6.3

tm
--
Thomas Metz
GSF - National Research Center for Environment and Health
Institute of Mammalian Genetics

Re: [SQL] SELECT DISTINCT ON ... ORDER BY ...

From
"Sergei M. Suntsov"
Date:
I have the same on 6.4.2
Looks like a bug
Sincerely, Sergei

On Thu, 28 Jan 1999, Thomas Metz wrote:

> I have the following problem:
>
> Assuming the table TEST as follows:
>
> ID     NAME
> -----------------
> 1      Alex
> 2      Oliver
> 1      Thomas
> 2      Fenella
>
>
> SELECT DISTINCT ON id id, name FROM test;
> produces:
>
> ID     NAME
> -----------------
> 1      Alex
> 2      Oliver
>
>
> SELECT DISTINCT ON id, name FROM test ORDER BY name;
> produces:
>
> ID     NAME
> -----------------
> 1      Alex
> 2      Fenella
> 1      Thomas
>
> I would have expected only two rows in both queries. I don't care which
> names actually appear in the output as long as they are sorted, but
> there should no longer be duplicate id's.
>
> If the table is as follows:
>
> ID     NAME
> -----------------
> 2      Oliver
> 2      Alex
> 1      Thomas
> 1      Fenella
>
>
> SELECT DISTINCT ON id id, name FROM test;
> produces:
>
> ID     NAME
> -----------------
> 1      Thomas
> 2      Oliver
>
>
> SELECT DISTINCT ON id, name FROM test ORDER BY name;
> produces:
>
> ID     NAME
> -----------------
> 2      Alex
> 1      Fenella
> 2      Oliver
> 1      Thomas
>
> What seems to happen is that from the sorted table, duplicate id's are
> eliminated only if they are grouped. If there is no explicit sorting I
> assume the DISTINCT performs an implicit sorting on id and then
> eliminates records correctly. Is that the correct behaviour? Is there
> another (simple) way of getting the results I want?
>
> I am still using version 6.3
>
> tm
> --
> Thomas Metz
> GSF - National Research Center for Environment and Health
> Institute of Mammalian Genetics
>
>


Re: [SQL] SELECT DISTINCT ON ... ORDER BY ...

From
"Oliver Elphick"
Date:
Thomas Metz wrote:
  >I have the following problem:
  >
  >Assuming the table TEST as follows:
  >
  >ID     NAME
  >-----------------
  >1      Alex
  >2      Oliver
  >1      Thomas
  >2      Fenella
...
  >SELECT DISTINCT ON id, name FROM test ORDER BY name;
  >produces:
  >
  >ID     NAME
  >-----------------
  >1      Alex
  >2      Fenella
  >1      Thomas
  >
  >I would have expected only two rows in both queries. I don't care which
  >names actually appear in the output as long as they are sorted, but
  >there should no longer be duplicate id's.

I looked at the documentation for SELECT and found that DISTINCT ON is
mentioned but not explained; there are also some other parts of that
documentation that need expanding or correcting, so I'm doing that at
the moment.

Is there any intended difference between DISTINCT ON and GROUP BY? (I
realise from this thread that there is an actual difference!)

Am I correct in saying that DISTINCT ON is not part of SQL92?


--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "Many are the afflictions of the righteous; but the
      LORD delivereth him out of them all."
                                        Psalm 34:19



Re: SELECT DISTINCT ON ... ORDER BY ...

From
Tom Lane
Date:
Thomas Metz <tmetz@gsf.de> writes:
> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
> [doesn't work as expected]

There have been related discussions before on pg-hackers mail list;
you might care to check the list archives.  The conclusion I recall
is that it's not real clear how the combination of SELECT DISTINCT
on one column and ORDER BY on another *should* work.  Postgres'
current behavior is clearly wrong IMHO, but there isn't a unique
definition of right behavior, because it's not clear which tuples
should get selected for the sort.

This "SELECT DISTINCT ON attribute" option strikes me as even more
bogus.  Where did we get that from --- is it in the SQL92 standard?
If you SELECT DISTINCT on a subset of the attributes to be returned,
then there's no unique definition of which values get returned in the
other columns.  In Thomas' example:

> Assuming the table TEST as follows:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
> 1      Thomas
> 2      Fenella

> SELECT DISTINCT ON id id, name FROM test;
> produces:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver

There's no justifiable reason for preferring this output over
    1      Thomas
    2      Oliver
or
    1      Alex
    2      Fenella
or
    1      Thomas
    2      Fenella

Any of these are "DISTINCT ON id", but it's purely a matter of
happenstance table order and unspecified implementation choices which
one will appear.  Do we really have (or want) a statement with
inherently undefined behavior?

Anyway, to answer Thomas' question, the way SELECT DISTINCT is
implemented is that first there's a sort on the DISTINCT columns,
then there's a pass that eliminates adjacent duplicates (like the Unix
uniq(1) program).  In the current backend, doing an ORDER BY on another
column overrides the sorting on the DISTINCT columns, so when the
duplicate-eliminator runs it will fail to get rid of duplicates that
don't happen to appear consecutively in its input.  That's pretty
broken, but then the entire concept of combining these two options
doesn't seem well defined; the SELECT DISTINCT doesn't make any promises
about which tuples (with the same DISTINCT columns) it's going to pick,
therefore the result of ordering by some other column isn't clear.

If you're willing to live with poorly defined behavior, the fix
is fairly obvious: run the sort and uniq passes for the DISTINCT
columns, *then* run the sort on the ORDER BY columns --- which
will use whichever tuple the DISTINCT phase selected at random
out of each set with the same DISTINCT value.

I think the issue got put on the back burner last time in hopes that
some definition with consistent behavior would come up, but I haven't
seen any hope that there is one.

            regards, tom lane

Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
"Oliver Elphick"
Date:
Tom Lane wrote:
  >Thomas Metz <tmetz@gsf.de> writes:
  >> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
  >> [doesn't work as expected]
  >
  >There have been related discussions before on pg-hackers mail list;
  >you might care to check the list archives.  The conclusion I recall
  >is that it's not real clear how the combination of SELECT DISTINCT
  >on one column and ORDER BY on another *should* work.  Postgres'
  >current behavior is clearly wrong IMHO, but there isn't a unique
  >definition of right behavior, because it's not clear which tuples
  >should get selected for the sort.
  >
  >This "SELECT DISTINCT ON attribute" option strikes me as even more
  >bogus.  Where did we get that from --- is it in the SQL92 standard?

I looked through the standard yesterday and couldn't find it.  It doesn't
seem to be a useful extension, since it does nothing that you can't do
with GROUP BY and seems much less well defined.  For the moment I have
added a brief description to the reference documentation for SELECT.

  >If you SELECT DISTINCT on a subset of the attributes to be returned,
  >then there's no unique definition of which values get returned in the
  >other columns.  In Thomas' example:
 ...
  >Any of these are "DISTINCT ON id", but it's purely a matter of
  >happenstance table order and unspecified implementation choices which
  >one will appear.  Do we really have (or want) a statement with
  >inherently undefined behavior?

We have it; I suggest we don't want it!

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "My son, if sinners entice thee, consent thou not."
                            Proverbs 1:10



Re: [SQL] Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
"jose' soares"
Date:
 

Oliver Elphick ha scritto:

Tom Lane wrote:
  >Thomas Metz <tmetz@gsf.de> writes:
  >> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
  >> [doesn't work as expected]
  >
  >There have been related discussions before on pg-hackers mail list;
  >you might care to check the list archives.  The conclusion I recall
  >is that it's not real clear how the combination of SELECT DISTINCT
  >on one column and ORDER BY on another *should* work.  Postgres'
  >current behavior is clearly wrong IMHO, but there isn't a unique
  >definition of right behavior, because it's not clear which tuples
  >should get selected for the sort.
  >
  >This "SELECT DISTINCT ON attribute" option strikes me as even more
  >bogus.  Where did we get that from --- is it in the SQL92 standard?

I looked through the standard yesterday and couldn't find it.  It doesn't
seem to be a useful extension, since it does nothing that you can't do
with GROUP BY and seems much less well defined.  For the moment I have
added a brief description to the reference documentation for SELECT.

  >If you SELECT DISTINCT on a subset of the attributes to be returned,
  >then there's no unique definition of which values get returned in the
  >other columns.  In Thomas' example:
 ...
  >Any of these are "DISTINCT ON id", but it's purely a matter of
  >happenstance table order and unspecified implementation choices which
  >one will appear.  Do we really have (or want) a statement with
  >inherently undefined behavior?

We have it; I suggest we don't want it!

Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very interesting and I think it is something missing to Standard.
I don't know how to do the following, if we take off DISTINCT ON from PostgreSQL:

db=> select distinct cognome, nome,via from membri where cap = '41010';
cognome|nome            |via
-------+----------------+--------------------------
FIORANI|ELISABETTA      |VIA PRETI PARTIGIANI, 63
FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
FIORANI|MATTIA          |VIA PRETI PARTIGIANI, 63
FIORANI|SIMONE          |VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA         |VIA MAGNAGHI, 39
GOZZI  |MATTEO          |VIA MAGNAGHI, 39
RUSSO  |DAVIDE          |STRADA CORLETTO SUD, 194/1
RUSSO  |ELENA TERESA    |STRADA CORLETTO SUD, 194/1
RUSSO  |FORTUNATO       |STRADA CORLETTO SUD, 194/1
RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(10 rows)

db=> select distinct on cognome cognome, nome,via from membri where cap = '41010';
cognome|nome            |via
-------+----------------+--------------------------
FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA         |VIA MAGNAGHI, 39
RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(3 rows)
 

Re: [SQL] Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
"Oliver Elphick"
Date:
"jose' soares" wrote:
  >Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very
  >interesting and I think it is something missing to Standard.
  >I don't know how to do the following, if we take off DISTINCT ON from
  >PostgreSQL:
  >
  >db=> select distinct cognome, nome,via from membri where cap = '41010';
  >cognome|nome            |via
  >-------+----------------+--------------------------
  >FIORANI|ELISABETTA      |VIA PRETI PARTIGIANI, 63
  >FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
  >FIORANI|MATTIA          |VIA PRETI PARTIGIANI, 63
  >FIORANI|SIMONE          |VIA PRETI PARTIGIANI, 63
  >GOZZI  |LILIANA         |VIA MAGNAGHI, 39
  >GOZZI  |MATTEO          |VIA MAGNAGHI, 39
  >RUSSO  |DAVIDE          |STRADA CORLETTO SUD, 194/1
  >RUSSO  |ELENA TERESA    |STRADA CORLETTO SUD, 194/1
  >RUSSO  |FORTUNATO       |STRADA CORLETTO SUD, 194/1
  >RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
  >(10 rows)
  >
  >db=> select distinct on cognome cognome, nome,via from membri where cap =
  >'41010';
  >cognome|nome            |via
  >-------+----------------+--------------------------
  >FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
  >GOZZI  |LILIANA         |VIA MAGNAGHI, 39
  >RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
  >(3 rows)

This gives the same results:

junk=> select cognome, nome, via from membri where cap = '41010' group by cognome;
cognome|nome      |via
-------+----------+--------------------------
FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA   |VIA MAGNAGHI, 39
RUSSO  |DAVIDE    |STRADA CORLETTO SUD, 194/1
(3 rows)

The particular values returned for nome and via are different from yours
but the same as I get using DISTINCT ON.  Since nome and via are not
aggregated, the value returned for those columns is unpredictable and
therefore not useful.  I think that it is actually a bug that you are
able to name them at all.

In fact, if you add an aggregate column to the column list, GROUP BY does
not then allow columns that are neither grouped nor aggregated:

junk=> select  cognome, nome,via, max(age) from membri where cap = '41010' group by cognome;
ERROR:  parser: illegal use of aggregates or non-group column in target list
junk=> select  cognome, max(age) from membri where cap = '41010' group by cognome;
cognome|max
-------+---
FIORANI| 54
GOZZI  | 76
RUSSO  | 45
(3 rows)

which definitely suggests that it is a bug to allow such fields when no
aggregate is specified.

DISTINCT ON fails with an aggregate, even if no other columns are named:

junk=> select distinct on cognome cognome, max(age) from membri where cap = '41010';
ERROR:  parser: illegal use of aggregates or non-group column in target list

which makes it even less useful!

--
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "And be not conformed to this world; but be ye
      transformed by the renewing of your mind, that ye may
      prove what is that good, and acceptable, and perfect,
      will of God."             Romans 12:2



Re: [SQL] Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
"jose' soares"
Date:
 

Oliver Elphick ha scritto:

"jose' soares" wrote:
  >Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very
  >interesting and I think it is something missing to Standard.
  >I don't know how to do the following, if we take off DISTINCT ON from
  >PostgreSQL:
  >
  >db=> select distinct cognome, nome,via from membri where cap = '41010';
  >cognome|nome            |via
  >-------+----------------+--------------------------
  >FIORANI|ELISABETTA      |VIA PRETI PARTIGIANI, 63
  >FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
  >FIORANI|MATTIA          |VIA PRETI PARTIGIANI, 63
  >FIORANI|SIMONE          |VIA PRETI PARTIGIANI, 63
  >GOZZI  |LILIANA         |VIA MAGNAGHI, 39
  >GOZZI  |MATTEO          |VIA MAGNAGHI, 39
  >RUSSO  |DAVIDE          |STRADA CORLETTO SUD, 194/1
  >RUSSO  |ELENA TERESA    |STRADA CORLETTO SUD, 194/1
  >RUSSO  |FORTUNATO       |STRADA CORLETTO SUD, 194/1
  >RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
  >(10 rows)
  >
  >db=> select distinct on cognome cognome, nome,via from membri where cap =
  >'41010';
  >cognome|nome            |via
  >-------+----------------+--------------------------
  >FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
  >GOZZI  |LILIANA         |VIA MAGNAGHI, 39
  >RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
  >(3 rows)

This gives the same results:

junk=> select cognome, nome, via from membri where cap = '41010' group by cognome;
cognome|nome      |via
-------+----------+--------------------------
FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA   |VIA MAGNAGHI, 39
RUSSO  |DAVIDE    |STRADA CORLETTO SUD, 194/1

This is very interesting and useful, I thought it wasn't possible. Seems that standard allows only the "order by" column(s)
and the aggregate function(s) on target list.
I tried the same query on Informix, also on Ocelot but it gives me an error.

On Informix and Ocelot
queries like:
   select  cognome, max(age) from membri where cap = '41010' group by cognome;
are allowed.
but
queries like:
   select cognome, nome, via from membri where cap = '41010' group by cognome;
aren't allowed.

-Jose'-
 

Re: [SQL] Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
Herouth Maoz
Date:
At 17:25 +0200 on 02/02/1999, jose' soares wrote:


> This gives the same results:
>
> junk=> select cognome, nome, via from membri where cap = '41010'
> group by cognome;
> cognome|nome      |via
> -------+----------+--------------------------
> FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
> GOZZI  |LILIANA   |VIA MAGNAGHI, 39
> RUSSO  |DAVIDE    |STRADA CORLETTO SUD, 194/1
>
>  This is very interesting and useful, I thought it wasn't possible. Seems
>that standard allows only the "order by" column(s)
> and the aggregate function(s) on target list.
> I tried the same query on Informix, also on Ocelot but it gives me an error.

And with good reason, too. The above query has the same drawback as the
"select distinct on", which is: it does not fully specify which value
should be selected for the "nome" and "via" fields.

Thus, running this same query on a table that has the same data but was,
for example, filled in a different order, gives a different result. That's
bad, because order should not make a difference for output. Tables are
taken to be unordered sets.

If you want to have a representative of the "nome" and "via" fields, and it
doesn't matter which representative, then min(nome) or max(nome) should do
the trick. And this query (select cognome, min(nome), min(via)... group by
cognome) should give you the same result on all databases, no matter which
rows were inserted first.

If it was up to me, I wouldn't use the above form, and frankly, I am
surprised the Postgres allows this.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma



Re: [SQL] Re: [HACKERS] Re: SELECT DISTINCT ON ... ORDER BY ...

From
Herouth Maoz
Date:
At 17:25 +0200 on 02/02/1999, jose' soares wrote:


> This gives the same results:
>
> junk=> select cognome, nome, via from membri where cap = '41010'
> group by cognome;
> cognome|nome      |via
> -------+----------+--------------------------
> FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
> GOZZI  |LILIANA   |VIA MAGNAGHI, 39
> RUSSO  |DAVIDE    |STRADA CORLETTO SUD, 194/1
>
>  This is very interesting and useful, I thought it wasn't possible. Seems
>that standard allows only the "order by" column(s)
> and the aggregate function(s) on target list.
> I tried the same query on Informix, also on Ocelot but it gives me an error.

And with good reason, too. The above query has the same drawback as the
"select distinct on", which is: it does not fully specify which value
should be selected for the "nome" and "via" fields.

Thus, running this same query on a table that has the same data but was,
for example, filled in a different order, gives a different result. That's
bad, because order should not make a difference for output. Tables are
taken to be unordered sets.

If you want to have a representative of the "nome" and "via" fields, and it
doesn't matter which representative, then min(nome) or max(nome) should do
the trick. And this query (select cognome, min(nome), min(via)... group by
cognome) should give you the same result on all databases, no matter which
rows were inserted first.

If it was up to me, I wouldn't use the above form, and frankly, I am
surprised the Postgres allows this.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma