Thread: problem with distinct not distincting...

problem with distinct not distincting...

From
John Beynon
Date:
I have a pretty basic query;

select distinct on (name) name, length(name) from
drugs
where customer_id IS NOT NULL
order by name;

which I'd expect to only return me a single drug name if there are
duplicates, yet I get

name | length
==========
Roaccutane | 10
Roaccutane | 10

table encoding is UTF8...

I'm scratching my head!

Thanks,

John.


Re: problem with distinct not distincting...

From
"David Johnston"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of John Beynon
> Sent: Wednesday, October 17, 2012 6:48 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] problem with distinct not distincting...
>
> I have a pretty basic query;
>
> select distinct on (name) name, length(name) from drugs where
> customer_id IS NOT NULL order by name;
>
> which I'd expect to only return me a single drug name if there are
duplicates,
> yet I get
>
> name | length
> ==========
> Roaccutane | 10
> Roaccutane | 10
>
> table encoding is UTF8...
>
> I'm scratching my head!
>
> Thanks,
>
> John.
>

So, the following returns one record as expected on 9.0.3:

SELECT DISTINCT ON (f) f, length(l)
FROM (VALUES ('David','Johnston'),('David','Smith')) x (f, l)
ORDER BY f

Try:

SELECT name, count(*) FROM drugs where customer_id IS NOT NULL GROUP BY name

To see whether the GROUP BY logic considers the names identical.

Using "name" as a column name also sometimes has issues so maybe try giving
it an alias:

SELECT ... FROM (SELECT name AS customer_name FROM drugs WHERE ....) AS
drug_aliased ...

DISTINCT ON has uses but I try to avoid it myself.  In this specific case
the "ON" is redundant since a simple DISTINCT will give you the same
results.

You also need to provide the PostgreSQL version and possibly server
platform.

David J.




Re: problem with distinct not distincting...

From
Susan Cassidy
Date:
Are you sure that one of those entries doesn't have a trailing space?

Susan

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John Beynon
Sent: Wednesday, October 17, 2012 3:48 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] problem with distinct not distincting...

I have a pretty basic query;

select distinct on (name) name, length(name) from drugs where customer_id IS NOT NULL order by name;

which I'd expect to only return me a single drug name if there are duplicates, yet I get

name | length
==========
Roaccutane | 10
Roaccutane | 10

table encoding is UTF8...

I'm scratching my head!

Thanks,

John.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




Re: problem with distinct not distincting...

From
Susan Cassidy
Date:
I see you have the length included.

Perhaps, UTF8 characters in one and ascii in the other?

Susan

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Susan Cassidy
Sent: Wednesday, October 17, 2012 8:27 AM
To: John Beynon; pgsql-general@postgresql.org
Subject: Re: [GENERAL] problem with distinct not distincting...

Are you sure that one of those entries doesn't have a trailing space?

Susan

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John Beynon
Sent: Wednesday, October 17, 2012 3:48 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] problem with distinct not distincting...

I have a pretty basic query;

select distinct on (name) name, length(name) from drugs where customer_id IS NOT NULL order by name;

which I'd expect to only return me a single drug name if there are duplicates, yet I get

name | length
==========
Roaccutane | 10
Roaccutane | 10

table encoding is UTF8...

I'm scratching my head!

Thanks,

John.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




Re: problem with distinct not distincting...

From
Tom Lane
Date:
John Beynon <john@kyan.com> writes:
> I have a pretty basic query;
> select distinct on (name) name, length(name) from
> drugs
> where customer_id IS NOT NULL
> order by name;

> which I'd expect to only return me a single drug name if there are
> duplicates, yet I get

> name | length
> ==========
> Roaccutane | 10
> Roaccutane | 10

> table encoding is UTF8...

> I'm scratching my head!

Yeah, me too.  What PG version is this exactly?  What does EXPLAIN
show for the query?  Can you extract a self-contained test case?

            regards, tom lane


Re: problem with distinct not distincting...

From
Merlin Moncure
Date:
On Wed, Oct 17, 2012 at 5:48 AM, John Beynon <john@kyan.com> wrote:
> I have a pretty basic query;
>
> select distinct on (name) name, length(name) from
> drugs
> where customer_id IS NOT NULL
> order by name;
>
> which I'd expect to only return me a single drug name if there are
> duplicates, yet I get
>
> name | length
> ==========
> Roaccutane | 10
> Roaccutane | 10
>
> table encoding is UTF8...
>
> I'm scratching my head!

as a sanity check, try:

select distinct on (name) name, length(name), md5(name) from
drugs
where customer_id IS NOT NULL
order by name;

merlin


Re: problem with distinct not distincting...

From
John Beynon
Date:
I just managed to solve the problem infact.

The trailing 'e' character on the name was different for one row. All
my tools, (pgadmin and the source data in openoffice) showed the same
'e' character but psql showed it as different character...

Thanks for all taking the time to read / answer. It stumped me for a while!

John.

On Wed, Oct 17, 2012 at 4:01 PM, David Johnston <polobo@yahoo.com> wrote:
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
>> owner@postgresql.org] On Behalf Of John Beynon
>> Sent: Wednesday, October 17, 2012 6:48 AM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] problem with distinct not distincting...
>>
>> I have a pretty basic query;
>>
>> select distinct on (name) name, length(name) from drugs where
>> customer_id IS NOT NULL order by name;
>>
>> which I'd expect to only return me a single drug name if there are
> duplicates,
>> yet I get
>>
>> name | length
>> ==========
>> Roaccutane | 10
>> Roaccutane | 10
>>
>> table encoding is UTF8...
>>
>> I'm scratching my head!
>>
>> Thanks,
>>
>> John.
>>
>
> So, the following returns one record as expected on 9.0.3:
>
> SELECT DISTINCT ON (f) f, length(l)
> FROM (VALUES ('David','Johnston'),('David','Smith')) x (f, l)
> ORDER BY f
>
> Try:
>
> SELECT name, count(*) FROM drugs where customer_id IS NOT NULL GROUP BY name
>
> To see whether the GROUP BY logic considers the names identical.
>
> Using "name" as a column name also sometimes has issues so maybe try giving
> it an alias:
>
> SELECT ... FROM (SELECT name AS customer_name FROM drugs WHERE ....) AS
> drug_aliased ...
>
> DISTINCT ON has uses but I try to avoid it myself.  In this specific case
> the "ON" is redundant since a simple DISTINCT will give you the same
> results.
>
> You also need to provide the PostgreSQL version and possibly server
> platform.
>
> David J.
>
>



--
John Beynon
Kyanmedia Ltd.
Direct line: 01483 405210
http://kyan.com

Registered in England. Company number: 4575679
Registered Office: 171 High Street, Guildford GU1 3AJ


Re: problem with distinct not distincting...

From
Chris Angelico
Date:
On Thu, Oct 18, 2012 at 2:32 AM, John Beynon <john@kyan.com> wrote:
> I just managed to solve the problem infact.
>
> The trailing 'e' character on the name was different for one row. All
> my tools, (pgadmin and the source data in openoffice) showed the same
> 'e' character but psql showed it as different character...
>
> Thanks for all taking the time to read / answer. It stumped me for a while!

Cool! That's where the sanity check Merlin suggested comes in really
handy. You can manually key in what you think it's showing, and
compare with the two rows:

sikorsky@sikorsky:~$ psql -c "select md5('Roaccutane')"
               md5
----------------------------------
 70f818454267c719ed612f50fe563f64
(1 row)

Anything that shows a different hash is clearly different... somehow.

ChrisA