Thread: Re: SELECT DISTINCT ON ... ORDER BY ...
Thomas Metz <tmetz@gsf.de> writes: > SELECT DISTINCT ON id id, name FROM test ORDER BY name; > [doesn't work as expected] There have been related discussions before on pg-hackers mail list; you might care to check the list archives. The conclusion I recall is that it's not real clear how the combination of SELECT DISTINCT on one column and ORDER BY on another *should* work. Postgres' current behavior is clearly wrong IMHO, but there isn't a unique definition of right behavior, because it's not clear which tuples should get selected for the sort. This "SELECT DISTINCT ON attribute" option strikes me as even more bogus. Where did we get that from --- is it in the SQL92 standard? If you SELECT DISTINCT on a subset of the attributes to be returned, then there's no unique definition of which values get returned in the other columns. In Thomas' example: > Assuming the table TEST as follows: > ID NAME > - ----------------- > 1 Alex > 2 Oliver > 1 Thomas > 2 Fenella > SELECT DISTINCT ON id id, name FROM test; > produces: > ID NAME > - ----------------- > 1 Alex > 2 Oliver There's no justifiable reason for preferring this output over 1 Thomas 2 Oliver or 1 Alex 2 Fenella or 1 Thomas 2 Fenella Any of these are "DISTINCT ON id", but it's purely a matter of happenstance table order and unspecified implementation choices which one will appear. Do we really have (or want) a statement with inherently undefined behavior? Anyway, to answer Thomas' question, the way SELECT DISTINCT is implemented is that first there's a sort on the DISTINCT columns, then there's a pass that eliminates adjacent duplicates (like the Unix uniq(1) program). In the current backend, doing an ORDER BY on another column overrides the sorting on the DISTINCT columns, so when the duplicate-eliminator runs it will fail to get rid of duplicates that don't happen to appear consecutively in its input. That's pretty broken, but then the entire concept of combining these two options doesn't seem well defined; the SELECT DISTINCT doesn't make any promises about which tuples (with the same DISTINCT columns) it's going to pick, therefore the result of ordering by some other column isn't clear. If you're willing to live with poorly defined behavior, the fix is fairly obvious: run the sort and uniq passes for the DISTINCT columns, *then* run the sort on the ORDER BY columns --- which will use whichever tuple the DISTINCT phase selected at random out of each set with the same DISTINCT value. I think the issue got put on the back burner last time in hopes that some definition with consistent behavior would come up, but I haven't seen any hope that there is one. regards, tom lane
Tom Lane wrote: >Thomas Metz <tmetz@gsf.de> writes: >> SELECT DISTINCT ON id id, name FROM test ORDER BY name; >> [doesn't work as expected] > >There have been related discussions before on pg-hackers mail list; >you might care to check the list archives. The conclusion I recall >is that it's not real clear how the combination of SELECT DISTINCT >on one column and ORDER BY on another *should* work. Postgres' >current behavior is clearly wrong IMHO, but there isn't a unique >definition of right behavior, because it's not clear which tuples >should get selected for the sort. > >This "SELECT DISTINCT ON attribute" option strikes me as even more >bogus. Where did we get that from --- is it in the SQL92 standard? I looked through the standard yesterday and couldn't find it. It doesn't seem to be a useful extension, since it does nothing that you can't do with GROUP BY and seems much less well defined. For the moment I have added a brief description to the reference documentation for SELECT. >If you SELECT DISTINCT on a subset of the attributes to be returned, >then there's no unique definition of which values get returned in the >other columns. In Thomas' example: ... >Any of these are "DISTINCT ON id", but it's purely a matter of >happenstance table order and unspecified implementation choices which >one will appear. Do we really have (or want) a statement with >inherently undefined behavior? We have it; I suggest we don't want it! -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID 32B8FAA1 ======================================== "My son, if sinners entice thee, consent thou not." Proverbs 1:10
Oliver Elphick ha scritto:
Tom Lane wrote:Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very interesting and I think it is something missing to Standard.
>Thomas Metz <tmetz@gsf.de> writes:
>> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
>> [doesn't work as expected]
>
>There have been related discussions before on pg-hackers mail list;
>you might care to check the list archives. The conclusion I recall
>is that it's not real clear how the combination of SELECT DISTINCT
>on one column and ORDER BY on another *should* work. Postgres'
>current behavior is clearly wrong IMHO, but there isn't a unique
>definition of right behavior, because it's not clear which tuples
>should get selected for the sort.
>
>This "SELECT DISTINCT ON attribute" option strikes me as even more
>bogus. Where did we get that from --- is it in the SQL92 standard?I looked through the standard yesterday and couldn't find it. It doesn't
seem to be a useful extension, since it does nothing that you can't do
with GROUP BY and seems much less well defined. For the moment I have
added a brief description to the reference documentation for SELECT.>If you SELECT DISTINCT on a subset of the attributes to be returned,
>then there's no unique definition of which values get returned in the
>other columns. In Thomas' example:
...
>Any of these are "DISTINCT ON id", but it's purely a matter of
>happenstance table order and unspecified implementation choices which
>one will appear. Do we really have (or want) a statement with
>inherently undefined behavior?We have it; I suggest we don't want it!
I don't know how to do the following, if we take off DISTINCT ON from PostgreSQL:
db=> select distinct cognome, nome,via from membri where cap = '41010';
cognome|nome |via
-------+----------------+--------------------------
FIORANI|ELISABETTA |VIA PRETI PARTIGIANI, 63
FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
FIORANI|MATTIA |VIA PRETI PARTIGIANI, 63
FIORANI|SIMONE |VIA PRETI PARTIGIANI, 63
GOZZI |LILIANA |VIA MAGNAGHI, 39
GOZZI |MATTEO |VIA MAGNAGHI, 39
RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1
RUSSO |ELENA TERESA |STRADA CORLETTO SUD, 194/1
RUSSO |FORTUNATO |STRADA CORLETTO SUD, 194/1
RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(10 rows)
db=> select distinct on cognome cognome, nome,via from membri where cap = '41010';
cognome|nome |via
-------+----------------+--------------------------
FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
GOZZI |LILIANA |VIA MAGNAGHI, 39
RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(3 rows)
"jose' soares" wrote: >Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very >interesting and I think it is something missing to Standard. >I don't know how to do the following, if we take off DISTINCT ON from >PostgreSQL: > >db=> select distinct cognome, nome,via from membri where cap = '41010'; >cognome|nome |via >-------+----------------+-------------------------- >FIORANI|ELISABETTA |VIA PRETI PARTIGIANI, 63 >FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63 >FIORANI|MATTIA |VIA PRETI PARTIGIANI, 63 >FIORANI|SIMONE |VIA PRETI PARTIGIANI, 63 >GOZZI |LILIANA |VIA MAGNAGHI, 39 >GOZZI |MATTEO |VIA MAGNAGHI, 39 >RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 >RUSSO |ELENA TERESA |STRADA CORLETTO SUD, 194/1 >RUSSO |FORTUNATO |STRADA CORLETTO SUD, 194/1 >RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1 >(10 rows) > >db=> select distinct on cognome cognome, nome,via from membri where cap = >'41010'; >cognome|nome |via >-------+----------------+-------------------------- >FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63 >GOZZI |LILIANA |VIA MAGNAGHI, 39 >RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1 >(3 rows) This gives the same results: junk=> select cognome, nome, via from membri where cap = '41010' group by cognome; cognome|nome |via -------+----------+-------------------------- FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63 GOZZI |LILIANA |VIA MAGNAGHI, 39 RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 (3 rows) The particular values returned for nome and via are different from yours but the same as I get using DISTINCT ON. Since nome and via are not aggregated, the value returned for those columns is unpredictable and therefore not useful. I think that it is actually a bug that you are able to name them at all. In fact, if you add an aggregate column to the column list, GROUP BY does not then allow columns that are neither grouped nor aggregated: junk=> select cognome, nome,via, max(age) from membri where cap = '41010' group by cognome; ERROR: parser: illegal use of aggregates or non-group column in target list junk=> select cognome, max(age) from membri where cap = '41010' group by cognome; cognome|max -------+--- FIORANI| 54 GOZZI | 76 RUSSO | 45 (3 rows) which definitely suggests that it is a bug to allow such fields when no aggregate is specified. DISTINCT ON fails with an aggregate, even if no other columns are named: junk=> select distinct on cognome cognome, max(age) from membri where cap = '41010'; ERROR: parser: illegal use of aggregates or non-group column in target list which makes it even less useful! -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID 32B8FAA1 ======================================== "And be not conformed to this world; but be ye transformed by the renewing of your mind, that ye may prove what is that good, and acceptable, and perfect, will of God." Romans 12:2
Oliver Elphick ha scritto:
"jose' soares" wrote:This is very interesting and useful, I thought it wasn't possible. Seems that standard allows only the "order by" column(s)
>Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very
>interesting and I think it is something missing to Standard.
>I don't know how to do the following, if we take off DISTINCT ON from
>PostgreSQL:
>
>db=> select distinct cognome, nome,via from membri where cap = '41010';
>cognome|nome |via
>-------+----------------+--------------------------
>FIORANI|ELISABETTA |VIA PRETI PARTIGIANI, 63
>FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
>FIORANI|MATTIA |VIA PRETI PARTIGIANI, 63
>FIORANI|SIMONE |VIA PRETI PARTIGIANI, 63
>GOZZI |LILIANA |VIA MAGNAGHI, 39
>GOZZI |MATTEO |VIA MAGNAGHI, 39
>RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1
>RUSSO |ELENA TERESA |STRADA CORLETTO SUD, 194/1
>RUSSO |FORTUNATO |STRADA CORLETTO SUD, 194/1
>RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
>(10 rows)
>
>db=> select distinct on cognome cognome, nome,via from membri where cap =
>'41010';
>cognome|nome |via
>-------+----------------+--------------------------
>FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
>GOZZI |LILIANA |VIA MAGNAGHI, 39
>RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
>(3 rows)This gives the same results:
junk=> select cognome, nome, via from membri where cap = '41010' group by cognome;
cognome|nome |via
-------+----------+--------------------------
FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
GOZZI |LILIANA |VIA MAGNAGHI, 39
RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1
and the aggregate function(s) on target list.
I tried the same query on Informix, also on Ocelot but it gives me an error.
On Informix and Ocelot
queries like:
select cognome, max(age) from membri where cap = '41010' group by cognome;
are allowed.
but
queries like:
select cognome, nome, via from membri where cap = '41010' group by cognome;
aren't allowed.
-Jose'-
At 17:25 +0200 on 02/02/1999, jose' soares wrote: > This gives the same results: > > junk=> select cognome, nome, via from membri where cap = '41010' > group by cognome; > cognome|nome |via > -------+----------+-------------------------- > FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63 > GOZZI |LILIANA |VIA MAGNAGHI, 39 > RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 > > This is very interesting and useful, I thought it wasn't possible. Seems >that standard allows only the "order by" column(s) > and the aggregate function(s) on target list. > I tried the same query on Informix, also on Ocelot but it gives me an error. And with good reason, too. The above query has the same drawback as the "select distinct on", which is: it does not fully specify which value should be selected for the "nome" and "via" fields. Thus, running this same query on a table that has the same data but was, for example, filled in a different order, gives a different result. That's bad, because order should not make a difference for output. Tables are taken to be unordered sets. If you want to have a representative of the "nome" and "via" fields, and it doesn't matter which representative, then min(nome) or max(nome) should do the trick. And this query (select cognome, min(nome), min(via)... group by cognome) should give you the same result on all databases, no matter which rows were inserted first. If it was up to me, I wouldn't use the above form, and frankly, I am surprised the Postgres allows this. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma