Thread: sorting problem with distinct on()

sorting problem with distinct on()

From
Sky Lendar
Date:
Hi there ans thx for reading and answering this post if you can:

Let's regard an example of a table (stars) containing a code for a star (symb)
and its index (nb) in a file.


 nb |    symb    
----+------------
  0 | alTau
  1 | alTau
  2 | bePer
  3 | alSco
  4 | alLeo
  5 | alCMa
  6 | alVir
  7 | Trappist-1
  8 | SgrA*
  9 | SgrA*
 10 | GA
 11 | VC
 12 | M31
 13 | M44
 14 | alUMi
 15 | M87
 16 | alCyg
 17 | alCyg
 18 | beOri
 19 | omiCet
 20 | epTau
 21 | epCas
 22 | alAnd
 23 | alAnd
 24 | beAnd
 25 | ga-1And
 26 | ga-1And
 27 | ga-1And
 28 | ga-1And
 29 | xiAnd
 30 | upAnd
 31 | upAnd
 32 | 14And
 33 | 51And
 34 | M31
 35 | alAql
 36 | alAql
 37 | beAql
 38 | gaAql
 39 | deAql
 40 | epAql
 41 | zeAql
 42 | zeAql
 43 | etAql
 44 | thAql
 45 | ioAql
 46 | laAql
 47 | xiAql
 48 | 12Aql
 49 | alAqr
 50 | alAqr
 51 | beAqr
 52 | gaAqr
 53 | gaAqr
 54 | deAqr
 55 | epAqr
 56 | epAqr
 57 | ze-1Aqr
 58 | etAqr
 59 | etAqr
 60 | thAqr
 61 | kaAqr
 62 | laAqr
 63 | laAqr
 64 | laAqr
 65 | laAqr
 66 | nuAqr
 67 | piAqr
 68 | xiAqr
 69 | Trappist-1
 70 | alAra
 71 | muAra
 72 | alAri
 73 | beAri
 74 | beAri
 75 | gaAri
 76 | deAri
 77 | 39Ari
 78 | 41Ari
 79 | alAur
 80 | alAur
 81 | beAur
 82 | deAur
 83 | epAur
 84 | epAur
 85 | epAur
 86 | zeAur
 87 | zeAur
 88 | zeAur
 89 | zeAur
 90 | etAur
 91 | etAur
 92 | thAur
 93 | thAur
 94 | thAur
 95 | ioAur
 96 | ioAur
 97 | ioAur
 98 | alBoo
 99 | alBoo

Notice that some symbs are duplicated and I want to select only the
distinct symbs.
So, I could use

select distinct on(symb) * from stars;

I get this result:

nb | symb  
----+-------
 48 | 12Aql
 32 | 14And
 77 | 39Ari
 78 | 41Ari
 33 | 51And
 10 | GA
 34 | M31
 13 | M44
 15 | M87
  9 | SgrA*

and so on...

But when I want to reorder it by nb with:

with x as (select distinct on(symb) * from stars) select * from x order by nb;

nb |    symb    
----+------------
  1 | alTau
  2 | bePer
  3 | alSco
  4 | alLeo
  5 | alCMa
  6 | alVir
  7 | Trappist-1
  9 | SgrA*
 10 | GA
 11 | VC
 13 | M44
 14 | alUMi
 15 | M87
 16 | alCyg
 18 | beOri
 19 | omiCet
 20 | epTau
 21 | epCas
 23 | alAnd
 24 | beAnd
 28 | ga-1And
 ...

Notice that 12 is missing in the list.
Even with x as (select distinct on(symb) * from stars) select * from x
where nb = 12 order by nb;

returns nothing, whereas it should, but

with x as (select distinct on(symb) * from stars) select * from x where
symb = 'M31' order by nb;

returns

 nb | symb
----+------
 12 | M31

which is correct.

What should I do in order to get the wanted results.

This phenomenon occurs with or without index. On postgres 14.7 linux x86_64.

Thx again.


Re: sorting problem with distinct on()

From
"David G. Johnston"
Date:
On Fri, May 5, 2023 at 6:55 AM Sky Lendar <skylendar@gmail.com> wrote:
Hi there ans thx for reading and answering this post if you can:

Let's regard an example of a table (stars) containing a code for a star (symb)
and its index (nb) in a file.


 nb |    symb    
----+------------
  0 | alTau
  1 | alTau
  2 | bePer
  3 | alSco
  4 | alLeo
  5 | alCMa
  6 | alVir
  7 | Trappist-1
  8 | SgrA*
  9 | SgrA*
 10 | GA
 11 | VC
 12 | M31
 13 | M44
 14 | alUMi
 15 | M87
 16 | alCyg
 17 | alCyg
 18 | beOri
 19 | omiCet
 20 | epTau
 21 | epCas
 22 | alAnd
 23 | alAnd
 24 | beAnd
 25 | ga-1And
 26 | ga-1And
 27 | ga-1And
 28 | ga-1And
 29 | xiAnd
 30 | upAnd
 31 | upAnd
 32 | 14And
 33 | 51And
 34 | M31
 35 | alAql
 36 | alAql
 37 | beAql
 38 | gaAql
 39 | deAql
 40 | epAql
 41 | zeAql
 42 | zeAql
 43 | etAql
 44 | thAql
 45 | ioAql
 46 | laAql
 47 | xiAql
 48 | 12Aql
 49 | alAqr
 50 | alAqr
 51 | beAqr
 52 | gaAqr
 53 | gaAqr
 54 | deAqr
 55 | epAqr
 56 | epAqr
 57 | ze-1Aqr
 58 | etAqr
 59 | etAqr
 60 | thAqr
 61 | kaAqr
 62 | laAqr
 63 | laAqr
 64 | laAqr
 65 | laAqr
 66 | nuAqr
 67 | piAqr
 68 | xiAqr
 69 | Trappist-1
 70 | alAra
 71 | muAra
 72 | alAri
 73 | beAri
 74 | beAri
 75 | gaAri
 76 | deAri
 77 | 39Ari
 78 | 41Ari
 79 | alAur
 80 | alAur
 81 | beAur
 82 | deAur
 83 | epAur
 84 | epAur
 85 | epAur
 86 | zeAur
 87 | zeAur
 88 | zeAur
 89 | zeAur
 90 | etAur
 91 | etAur
 92 | thAur
 93 | thAur
 94 | thAur
 95 | ioAur
 96 | ioAur
 97 | ioAur
 98 | alBoo
 99 | alBoo

Notice that some symbs are duplicated and I want to select only the
distinct symbs.
So, I could use

select distinct on(symb) * from stars;

I get this result:

nb | symb  
----+-------
 48 | 12Aql
 32 | 14And
 77 | 39Ari
 78 | 41Ari
 33 | 51And
 10 | GA
 34 | M31
 13 | M44
 15 | M87
  9 | SgrA*

Notice that 12 is missing in the list.
Even with x as (select distinct on(symb) * from stars) select * from x
where nb = 12 order by nb;

nb = 12 is a duplicate with np = 34

Since your DISTINCT ON *subquery* doesn't specify an ordering which of those two are chosen as the representative record for M31 is non-determinstic.

If you want to ensure the lowest valued nb is chosen you need to sort the *subquery*.  The first record the DISTINCT encounters is the one selected to represent.

Sorting in the outer/main query happens after the DISTINCT and so the record is already gone.

David J.

Re: sorting problem with distinct on()

From
Sky Lendar
Date:
Thank you for the clarification.

Le ven. 5 mai 2023 à 16:16, David G. Johnston <david.g.johnston@gmail.com> a écrit :
On Fri, May 5, 2023 at 6:55 AM Sky Lendar <skylendar@gmail.com> wrote:
Hi there ans thx for reading and answering this post if you can:

Let's regard an example of a table (stars) containing a code for a star (symb)
and its index (nb) in a file.


 nb |    symb    
----+------------
  0 | alTau
  1 | alTau
  2 | bePer
  3 | alSco
  4 | alLeo
  5 | alCMa
  6 | alVir
  7 | Trappist-1
  8 | SgrA*
  9 | SgrA*
 10 | GA
 11 | VC
 12 | M31
 13 | M44
 14 | alUMi
 15 | M87
 16 | alCyg
 17 | alCyg
 18 | beOri
 19 | omiCet
 20 | epTau
 21 | epCas
 22 | alAnd
 23 | alAnd
 24 | beAnd
 25 | ga-1And
 26 | ga-1And
 27 | ga-1And
 28 | ga-1And
 29 | xiAnd
 30 | upAnd
 31 | upAnd
 32 | 14And
 33 | 51And
 34 | M31
 35 | alAql
 36 | alAql
 37 | beAql
 38 | gaAql
 39 | deAql
 40 | epAql
 41 | zeAql
 42 | zeAql
 43 | etAql
 44 | thAql
 45 | ioAql
 46 | laAql
 47 | xiAql
 48 | 12Aql
 49 | alAqr
 50 | alAqr
 51 | beAqr
 52 | gaAqr
 53 | gaAqr
 54 | deAqr
 55 | epAqr
 56 | epAqr
 57 | ze-1Aqr
 58 | etAqr
 59 | etAqr
 60 | thAqr
 61 | kaAqr
 62 | laAqr
 63 | laAqr
 64 | laAqr
 65 | laAqr
 66 | nuAqr
 67 | piAqr
 68 | xiAqr
 69 | Trappist-1
 70 | alAra
 71 | muAra
 72 | alAri
 73 | beAri
 74 | beAri
 75 | gaAri
 76 | deAri
 77 | 39Ari
 78 | 41Ari
 79 | alAur
 80 | alAur
 81 | beAur
 82 | deAur
 83 | epAur
 84 | epAur
 85 | epAur
 86 | zeAur
 87 | zeAur
 88 | zeAur
 89 | zeAur
 90 | etAur
 91 | etAur
 92 | thAur
 93 | thAur
 94 | thAur
 95 | ioAur
 96 | ioAur
 97 | ioAur
 98 | alBoo
 99 | alBoo

Notice that some symbs are duplicated and I want to select only the
distinct symbs.
So, I could use

select distinct on(symb) * from stars;

I get this result:

nb | symb  
----+-------
 48 | 12Aql
 32 | 14And
 77 | 39Ari
 78 | 41Ari
 33 | 51And
 10 | GA
 34 | M31
 13 | M44
 15 | M87
  9 | SgrA*

Notice that 12 is missing in the list.
Even with x as (select distinct on(symb) * from stars) select * from x
where nb = 12 order by nb;

nb = 12 is a duplicate with np = 34

Since your DISTINCT ON *subquery* doesn't specify an ordering which of those two are chosen as the representative record for M31 is non-determinstic.

If you want to ensure the lowest valued nb is chosen you need to sort the *subquery*.  The first record the DISTINCT encounters is the one selected to represent.

Sorting in the outer/main query happens after the DISTINCT and so the record is already gone.

David J.

Re: sorting problem with distinct on()

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Fri, May 5, 2023 at 6:55 AM Sky Lendar <skylendar@gmail.com> wrote:
>> Notice that 12 is missing in the list.

> Since your DISTINCT ON *subquery* doesn't specify an ordering which of
> those two are chosen as the representative record for M31 is
> non-determinstic.

> If you want to ensure the lowest valued nb is chosen you need to sort the
> *subquery*.

The "weather_reports" example here might help clarify this for you:

https://www.postgresql.org/docs/current/sql-select.html

            regards, tom lane