Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

From Tarlika Elisabeth Schmitz
Subject Re: identifying duplicates in table with redundancies
Date
Msg-id 20100924171226.6000b310@dick.coachhouse
Whole thread Raw
In response to Re: identifying duplicates in table with redundancies  ("Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt>)
Responses Re: identifying duplicates in table with redundancies
List pgsql-sql
Dear Oliveiros,
Thank you for taking the time to help.

On Fri, 24 Sep 2010 11:22:21 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote:

>----- Original Message ----- 
>From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
>To: <pgsql-sql@postgresql.org>
>Sent: Thursday, September 23, 2010 10:39 PM
>Subject: [SQL] identifying duplicates in table with redundancies
>
>
>>[...] I want to check for duplicates:
>>
>> 1) multiples trainer names for same trainer id
>> 2) multiple trainer ids for same trainer name
>>
>> I cobbled together the SQL and it does the job but it seems rather
>> convoluted. I would like to know how I can improve it.
>>
>> CREATE TABLE student (
>> id INTEGER NOT NULL,
>> name VARCHAR(256) NOT NULL,
>> trainer_id INTEGER,
>> trainer_name VARCHAR(256),
>> );
>>
>> ====
>> EXAMPLE DATA
>>
>> 22 John 1 Macdonald
>> 23 Jane 1 MacDonald
>> 24 Paul 1 MacDonald
>> 25 Dick 2 Smith
>> 26 Bill 3 Smith
>> 27 Kate 3 Smith
>> ====
>>
>> select trainer_id, trainer_name from
>> (
>>    select distinct on (trainer_name) trainer_id, trainer_name
>>    from student
>>    where trainer_id in
>>    (
>>        select distinct on (id) id
>>        from
>>        (
>>            select distinct on (trainer_id,trainer_name)
>>     trainer_id as id,
>>            trainer_name as name from student
>>        ) as trainer
>>        group by trainer.id
>>        having count (trainer.name) > 1
>>    )
>> ) as y
>> order by trainer_id


>Howdy, Tarlika.
>
>First, did you past correctly your query into your mail?
>I am asking this because your query doesn't seem work for me, it
>returns an empty list :-|
>
>Your most nested query, [...]
>returns this
>
>1|"MacDonald"
>2|"Smith"
>3|"Smith"


For me, the innermost query returns:
1|"Macdonald"
1|"MacDonald"
2|"Smith"
3|"Smith"

(note the lower/uppercase "d" in MacDonald)

The whole query returns:
1|"Macdonald"
1|"MacDonald"


>> 1) multiples trainer names for same trainer id
>> 2) multiple trainer ids for same trainer name
>
>To achieve 2) I would use this query
>
>SELECT DISTINCT trainer_id,trainer_name
>FROM (
>SELECT trainer_name   -- The field you want to test for duplicates
>FROM (
>SELECT DISTINCT "trainer_id","trainer_name"
>FROM student
>) x
>GROUP BY "trainer_name"    -- the field you want to test for duplicates
>HAVING (COUNT(*) > 1)
>) z
>NATURAL JOIN student y

I see my 2 innermost queries are the same as yours, just a bit more
wordy. I messed up at the third query, which threw up an error when I
tried to add an ORDER BY.

>It will give you a list of the trainer names who have more than one
>trainer ID and the respective trainer IDS.
>For your particular example data result will be
>2|"Smith"
>3|"Smith"

Splendid! Just what I wanted.

>As Smith is the only trainer with records with diferent trainer IDs.
>Question : Can this be what you want?

The real table has 250000 entries and quite a few dups.

>If you want to achieve 1) just substitute the trainer_name by
>trainer_id on the commented places.

1) works as well now - just had to transpose id/name.





-- 

Best Regards,
Tarlika Elisabeth Schmitz


pgsql-sql by date:

Previous
From: Leif Biberg Kristensen
Date:
Subject: Re: find and replace the string within a column
Next
From: Oliver d'Azevedo Christina
Date:
Subject: Re: identifying duplicates in table with redundancies