Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

From Oliveiros d'Azevedo Cristina
Subject Re: identifying duplicates in table with redundancies
Date
Msg-id 8590CAE9E5504E23AB08C4EF31EA066F@marktestcr.marktest.pt
Whole thread Raw
In response to identifying duplicates in table with redundancies  (Tarlika Elisabeth Schmitz <postgresql@numerixtechnology.de>)
Responses Re: identifying duplicates in table with redundancies  ("Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>)
List pgsql-sql
Howdy, Tarlika.

First, did you past correctly your query into your mail?
I am asking this because your query doesn't seem work for me, it returns an 
empty list :-|

Your most nested query, this one,
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student

returns this

1|"MacDonald"
2|"Smith"
3|"Smith"

Then, your next most nested query, which includes the previous one,
  -- trainer ids with appearing with different names
>        select distinct on (id) id
>        from
>        (
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student
>        ) as trainer
>        group by trainer.id
>        having count (trainer.name) > 1

will group by trainer_id and look for trainer_ids which appear more than 
once, and, as you can see from the above listing there is none...
So, I'm asking if you typed it correctly on mail, because it doesn't seem to 
work, at least not for me....

> 1) multiples trainer names for same trainer id
> 2) multiple trainer ids for same trainer name

To achieve 2) I would use this query

SELECT DISTINCT trainer_id,trainer_name
FROM (
SELECT trainer_name   -- The field you want to test for duplicates
FROM (
SELECT DISTINCT "trainer_id","trainer_name"
FROM student
) x
GROUP BY "trainer_name"    -- the field you want to test for duplicates
HAVING (COUNT(*) > 1)
) z
NATURAL JOIN student y

It will give you a list of the trainer names who have more than one trainer 
ID and the respective trainer IDS.
For your particular example data result will be
2|"Smith"
3|"Smith"

As Smith is the only trainer with records with diferent trainer IDs.
Question : Can this be what you want?

You can add ORDER BY clauses in the end as you wish to order the list in any 
way you'd like.


If you want to achieve 1) just substitute the trainer_name by trainer_id on 
the commented places.
But on the example data you provided you don't have the 1) situation, am I 
right? So it will  output an empty list.


Best,
Oliver



----- Original Message ----- 
From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
To: <pgsql-sql@postgresql.org>
Sent: Thursday, September 23, 2010 10:39 PM
Subject: [SQL] identifying duplicates in table with redundancies


>I loaded data from a spread into a interim table so I can analyze the
> quality of the data.
>
> The table contains an entry for every student (250K records) and his
> trainer. Eventually, I want to extract a unique list of trainers from
> it. But first of all I want to check for duplicates:
>
> 1) multiples trainer names for same trainer id
> 2) multiple trainer ids for same trainer name
>
> I cobbled together the SQL and it does the job but it seems rather
> convoluted. I would like to know how I can improve it.
>
> CREATE TABLE student (
> id INTEGER NOT NULL,
> name VARCHAR(256) NOT NULL,
> trainer_id INTEGER,
> trainer_name VARCHAR(256),
> );
>
> ====
> EXAMPLE DATA
>
> 22 John 1 Macdonald
> 23 Jane 1 MacDonald
> 24 Paul 1 MacDonald
> 25 Dick 2 Smith
> 26 Bill 3 Smith
> 27 Kate 3 Smith
> ====
>
> -- outputs trainer ids which appear under different names
> select trainer_id, trainer_name from
> (
>    -- different id/name combinations
>    select distinct on (trainer_name) trainer_id, trainer_name
>    from student
>    where trainer_id in
>    (
>        -- trainer ids with appearing with different names
>        select distinct on (id) id
>        from
>        (
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student
>        ) as trainer
>        group by trainer.id
>        having count (trainer.name) > 1
>    )
> ) as y
> order by trainer_id
>
> -- 
>
> Best Regards,
> Tarlika Elisabeth Schmitz
>
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql 



pgsql-sql by date:

Previous
From: James Kitambara
Date:
Subject: Re: insert into help
Next
From: Craig Ringer
Date:
Subject: Re: find and replace the string within a column