Thread: identifying duplicates in table with redundancies

identifying duplicates in table with redundancies

From

Tarlika Elisabeth Schmitz

Date:

23 September 2010, 19:35:40

I loaded data from a spread into a interim table so I can analyze the
quality of the data.

The table contains an entry for every student (250K records) and his
trainer. Eventually, I want to extract a unique list of trainers from
it. But first of all I want to check for duplicates:

1) multiples trainer names for same trainer id
2) multiple trainer ids for same trainer name

I cobbled together the SQL and it does the job but it seems rather
convoluted. I would like to know how I can improve it. 

CREATE TABLE student (
id INTEGER NOT NULL,
name VARCHAR(256) NOT NULL,
trainer_id INTEGER,
trainer_name VARCHAR(256),
);

====
EXAMPLE DATA

22 John 1 Macdonald
23 Jane 1 MacDonald
24 Paul 1 MacDonald
25 Dick 2 Smith
26 Bill 3 Smith
27 Kate 3 Smith
====

-- outputs trainer ids which appear under different names
select trainer_id, trainer_name from 
(   -- different id/name combinations   select distinct on (trainer_name) trainer_id, trainer_name   from student
wheretrainer_id in   (       -- trainer ids with appearing with different names       select distinct on (id) id
from       (           -- distinct trainer id-name           select distinct on (trainer_id,trainer_name)
trainer_idas id,            trainer_name as name from student       ) as trainer       group by trainer.id       having
count(trainer.name) > 1    ) 
 
) as y
order by trainer_id

-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: identifying duplicates in table with redundancies

From

"Oliveiros d'Azevedo Cristina"

Date:

24 September 2010, 07:22:31

Howdy, Tarlika.

First, did you past correctly your query into your mail?
I am asking this because your query doesn't seem work for me, it returns an 
empty list :-|

Your most nested query, this one,
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student

returns this

1|"MacDonald"
2|"Smith"
3|"Smith"

Then, your next most nested query, which includes the previous one,
  -- trainer ids with appearing with different names
>        select distinct on (id) id
>        from
>        (
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student
>        ) as trainer
>        group by trainer.id
>        having count (trainer.name) > 1

will group by trainer_id and look for trainer_ids which appear more than 
once, and, as you can see from the above listing there is none...
So, I'm asking if you typed it correctly on mail, because it doesn't seem to 
work, at least not for me....

> 1) multiples trainer names for same trainer id
> 2) multiple trainer ids for same trainer name

To achieve 2) I would use this query

SELECT DISTINCT trainer_id,trainer_name
FROM (
SELECT trainer_name   -- The field you want to test for duplicates
FROM (
SELECT DISTINCT "trainer_id","trainer_name"
FROM student
) x
GROUP BY "trainer_name"    -- the field you want to test for duplicates
HAVING (COUNT(*) > 1)
) z
NATURAL JOIN student y

It will give you a list of the trainer names who have more than one trainer 
ID and the respective trainer IDS.
For your particular example data result will be
2|"Smith"
3|"Smith"

As Smith is the only trainer with records with diferent trainer IDs.
Question : Can this be what you want?

You can add ORDER BY clauses in the end as you wish to order the list in any 
way you'd like.


If you want to achieve 1) just substitute the trainer_name by trainer_id on 
the commented places.
But on the example data you provided you don't have the 1) situation, am I 
right? So it will  output an empty list.


Best,
Oliver



----- Original Message ----- 
From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
To: <pgsql-sql@postgresql.org>
Sent: Thursday, September 23, 2010 10:39 PM
Subject: [SQL] identifying duplicates in table with redundancies


>I loaded data from a spread into a interim table so I can analyze the
> quality of the data.
>
> The table contains an entry for every student (250K records) and his
> trainer. Eventually, I want to extract a unique list of trainers from
> it. But first of all I want to check for duplicates:
>
> 1) multiples trainer names for same trainer id
> 2) multiple trainer ids for same trainer name
>
> I cobbled together the SQL and it does the job but it seems rather
> convoluted. I would like to know how I can improve it.
>
> CREATE TABLE student (
> id INTEGER NOT NULL,
> name VARCHAR(256) NOT NULL,
> trainer_id INTEGER,
> trainer_name VARCHAR(256),
> );
>
> ====
> EXAMPLE DATA
>
> 22 John 1 Macdonald
> 23 Jane 1 MacDonald
> 24 Paul 1 MacDonald
> 25 Dick 2 Smith
> 26 Bill 3 Smith
> 27 Kate 3 Smith
> ====
>
> -- outputs trainer ids which appear under different names
> select trainer_id, trainer_name from
> (
>    -- different id/name combinations
>    select distinct on (trainer_name) trainer_id, trainer_name
>    from student
>    where trainer_id in
>    (
>        -- trainer ids with appearing with different names
>        select distinct on (id) id
>        from
>        (
>            -- distinct trainer id-name
>            select distinct on (trainer_id,trainer_name)
>     trainer_id as id,
>            trainer_name as name from student
>        ) as trainer
>        group by trainer.id
>        having count (trainer.name) > 1
>    )
> ) as y
> order by trainer_id
>
> -- 
>
> Best Regards,
> Tarlika Elisabeth Schmitz
>
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql

Re: identifying duplicates in table with redundancies

From

"Tarlika Elisabeth Schmitz"

Date:

24 September 2010, 13:27:51

Dear Oliveiros,
Thank you for taking the time to help.

On Fri, 24 Sep 2010 11:22:21 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote:

>----- Original Message ----- 
>From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
>To: <pgsql-sql@postgresql.org>
>Sent: Thursday, September 23, 2010 10:39 PM
>Subject: [SQL] identifying duplicates in table with redundancies
>
>
>>[...] I want to check for duplicates:
>>
>> 1) multiples trainer names for same trainer id
>> 2) multiple trainer ids for same trainer name
>>
>> I cobbled together the SQL and it does the job but it seems rather
>> convoluted. I would like to know how I can improve it.
>>
>> CREATE TABLE student (
>> id INTEGER NOT NULL,
>> name VARCHAR(256) NOT NULL,
>> trainer_id INTEGER,
>> trainer_name VARCHAR(256),
>> );
>>
>> ====
>> EXAMPLE DATA
>>
>> 22 John 1 Macdonald
>> 23 Jane 1 MacDonald
>> 24 Paul 1 MacDonald
>> 25 Dick 2 Smith
>> 26 Bill 3 Smith
>> 27 Kate 3 Smith
>> ====
>>
>> select trainer_id, trainer_name from
>> (
>>    select distinct on (trainer_name) trainer_id, trainer_name
>>    from student
>>    where trainer_id in
>>    (
>>        select distinct on (id) id
>>        from
>>        (
>>            select distinct on (trainer_id,trainer_name)
>>     trainer_id as id,
>>            trainer_name as name from student
>>        ) as trainer
>>        group by trainer.id
>>        having count (trainer.name) > 1
>>    )
>> ) as y
>> order by trainer_id


>Howdy, Tarlika.
>
>First, did you past correctly your query into your mail?
>I am asking this because your query doesn't seem work for me, it
>returns an empty list :-|
>
>Your most nested query, [...]
>returns this
>
>1|"MacDonald"
>2|"Smith"
>3|"Smith"


For me, the innermost query returns:
1|"Macdonald"
1|"MacDonald"
2|"Smith"
3|"Smith"

(note the lower/uppercase "d" in MacDonald)

The whole query returns:
1|"Macdonald"
1|"MacDonald"


>> 1) multiples trainer names for same trainer id
>> 2) multiple trainer ids for same trainer name
>
>To achieve 2) I would use this query
>
>SELECT DISTINCT trainer_id,trainer_name
>FROM (
>SELECT trainer_name   -- The field you want to test for duplicates
>FROM (
>SELECT DISTINCT "trainer_id","trainer_name"
>FROM student
>) x
>GROUP BY "trainer_name"    -- the field you want to test for duplicates
>HAVING (COUNT(*) > 1)
>) z
>NATURAL JOIN student y

I see my 2 innermost queries are the same as yours, just a bit more
wordy. I messed up at the third query, which threw up an error when I
tried to add an ORDER BY.

>It will give you a list of the trainer names who have more than one
>trainer ID and the respective trainer IDS.
>For your particular example data result will be
>2|"Smith"
>3|"Smith"

Splendid! Just what I wanted.

>As Smith is the only trainer with records with diferent trainer IDs.
>Question : Can this be what you want?

The real table has 250000 entries and quite a few dups.

>If you want to achieve 1) just substitute the trainer_name by
>trainer_id on the commented places.

1) works as well now - just had to transpose id/name.





-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: identifying duplicates in table with redundancies

From

Oliver d'Azevedo Christina

Date:

24 September 2010, 14:11:22

Hey,Tarlika.
I tried to reproduce your test case through a series of inserts and It
seems that the lower case "d" went unnoticed.
That explains the empty list i got as result.  My fault.  Sorry :-(
Great to hear it helped you

Best,
Oliveiros



Enviado de meu iPhone

Em 24/09/2010, às 05:12 PM, "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.d
e> escreveu:

> Dear Oliveiros,
> Thank you for taking the time to help.
>
> On Fri, 24 Sep 2010 11:22:21 +0100
> "Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote:
>
>> ----- Original Message -----
>> From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
>> To: <pgsql-sql@postgresql.org>
>> Sent: Thursday, September 23, 2010 10:39 PM
>> Subject: [SQL] identifying duplicates in table with redundancies
>>
>>
>>> [...] I want to check for duplicates:
>>>
>>> 1) multiples trainer names for same trainer id
>>> 2) multiple trainer ids for same trainer name
>>>
>>> I cobbled together the SQL and it does the job but it seems rather
>>> convoluted. I would like to know how I can improve it.
>>>
>>> CREATE TABLE student (
>>> id INTEGER NOT NULL,
>>> name VARCHAR(256) NOT NULL,
>>> trainer_id INTEGER,
>>> trainer_name VARCHAR(256),
>>> );
>>>
>>> ====
>>> EXAMPLE DATA
>>>
>>> 22 John 1 Macdonald
>>> 23 Jane 1 MacDonald
>>> 24 Paul 1 MacDonald
>>> 25 Dick 2 Smith
>>> 26 Bill 3 Smith
>>> 27 Kate 3 Smith
>>> ====
>>>
>>> select trainer_id, trainer_name from
>>> (
>>>   select distinct on (trainer_name) trainer_id, trainer_name
>>>   from student
>>>   where trainer_id in
>>>   (
>>>       select distinct on (id) id
>>>       from
>>>       (
>>>           select distinct on (trainer_id,trainer_name)
>>>    trainer_id as id,
>>>           trainer_name as name from student
>>>       ) as trainer
>>>       group by trainer.id
>>>       having count (trainer.name) > 1
>>>   )
>>> ) as y
>>> order by trainer_id
>
>
>> Howdy, Tarlika.
>>
>> First, did you past correctly your query into your mail?
>> I am asking this because your query doesn't seem work for me, it
>> returns an empty list :-|
>>
>> Your most nested query, [...]
>> returns this
>>
>> 1|"MacDonald"
>> 2|"Smith"
>> 3|"Smith"
>
>
> For me, the innermost query returns:
> 1|"Macdonald"
> 1|"MacDonald"
> 2|"Smith"
> 3|"Smith"
>
> (note the lower/uppercase "d" in MacDonald)
>
> The whole query returns:
> 1|"Macdonald"
> 1|"MacDonald"
>
>
>>> 1) multiples trainer names for same trainer id
>>> 2) multiple trainer ids for same trainer name
>>
>> To achieve 2) I would use this query
>>
>> SELECT DISTINCT trainer_id,trainer_name
>> FROM (
>> SELECT trainer_name   -- The field you want to test for duplicates
>> FROM (
>> SELECT DISTINCT "trainer_id","trainer_name"
>> FROM student
>> ) x
>> GROUP BY "trainer_name"    -- the field you want to test for
>> duplicates
>> HAVING (COUNT(*) > 1)
>> ) z
>> NATURAL JOIN student y
>
> I see my 2 innermost queries are the same as yours, just a bit more
> wordy. I messed up at the third query, which threw up an error when I
> tried to add an ORDER BY.
>
>> It will give you a list of the trainer names who have more than one
>> trainer ID and the respective trainer IDS.
>> For your particular example data result will be
>> 2|"Smith"
>> 3|"Smith"
>
> Splendid! Just what I wanted.
>
>> As Smith is the only trainer with records with diferent trainer IDs.
>> Question : Can this be what you want?
>
> The real table has 250000 entries and quite a few dups.
>
>> If you want to achieve 1) just substitute the trainer_name by
>> trainer_id on the commented places.
>
> 1) works as well now - just had to transpose id/name.
>
>
>
>
>
> --
>
> Best Regards,
> Tarlika Elisabeth Schmitz
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql

Re: identifying duplicates in table with redundancies

From

Tarlika Elisabeth Schmitz

Date:

24 September 2010, 14:57:45

Hello Oliveiros,

On Fri, 24 Sep 2010 18:12:18 +0100
Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote:

>Hey,Tarlika.
>I tried to reproduce your test case through a series of inserts and
>It seems that the lower case "d" went unnoticed.

I can assure you people up here (in Scotland) get quite upset about it!

It took me  a while to get my head round your outer query with the
NATURAL JOIN between the student table and the nested query results. I
have done table joins before but this solution would not have sprung to
mind.

-- 

Many thanks,
Tarlika Elisabeth Schmitz

Re: identifying duplicates in table with redundancies

From

Tarlika Elisabeth Schmitz

Date:

27 September 2010, 13:57:52

On Fri, 24 Sep 2010 18:12:18 +0100
Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote:

>>> SELECT DISTINCT trainer_id,trainer_name
>>> FROM (
>>> SELECT trainer_name   -- The field you want to test for duplicates
>>> FROM (
>>> SELECT DISTINCT "trainer_id","trainer_name"
>>> FROM student
>>> ) x
>>> GROUP BY "trainer_name"    -- the field you want to test for  
>>> duplicates
>>> HAVING (COUNT(*) > 1)
>>> ) z
>>> NATURAL JOIN student y  



What indices would you recommend for this operation?

-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: identifying duplicates in table with redundancies

From

"Oliveiros d'Azevedo Cristina"

Date:

28 September 2010, 07:34:46

Hello again,
Tarlika.

In what concerns to indices, I 'm affraid I may not be the best person to 
advise you, my knowledge of them hardly goes beyond the most trivial cases.

I'm sure there are plenty of other people on the list who are able to give 
you better advise than me.
But, on this query in particular I would recomend an indice on trainer_name, 
as this field will be used on the join and on the group by.
For the other query, the one you get by substituting trainer_name by 
trainer_id, place an index on trainer_id.
Also, these indexes may help speed up the order by clause, if you use one.

If you have a table with lots of data you can try them around and see how 
performance varies (and don't forget there's also EXPLAIN ANALYZE)

Bear in mind, though, that this is just my 2 cents on a matter that I don't 
really master. Do not take this as an expert answer.

There are many people on this list that can help you better

Best,
Oliveiros

----- Original Message ----- 
From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
To: <pgsql-sql@postgresql.org>
Sent: Monday, September 27, 2010 5:54 PM
Subject: Re: [SQL] identifying duplicates in table with redundancies

> On Fri, 24 Sep 2010 18:12:18 +0100
> Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote:
>
>>>> SELECT DISTINCT trainer_id,trainer_name
>>>> FROM (
>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>> FROM (
>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>> FROM student
>>>> ) x
>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>> duplicates
>>>> HAVING (COUNT(*) > 1)
>>>> ) z
>>>> NATURAL JOIN student y
>
>
>
> What indices would you recommend for this operation?
>
> -- 
>
> Best Regards,
> Tarlika Elisabeth Schmitz
>
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql

Re: identifying duplicates in table with redundancies

From

"Tarlika Elisabeth Schmitz"

Date:

28 September 2010, 18:35:47

On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote:

>----- Original Message ----- 
>From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
>To: <pgsql-sql@postgresql.org>
>Sent: Monday, September 27, 2010 5:54 PM
>Subject: Re: [SQL] identifying duplicates in table with redundancies
>
>
>> On Fri, 24 Sep 2010 18:12:18 +0100
>> Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote:
>>
>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>> FROM (
>>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>>> FROM (
>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>> FROM student
>>>>> ) x
>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>> duplicates
>>>>> HAVING (COUNT(*) > 1)
>>>>> ) z
>>>>> NATURAL JOIN student y
>>
>>
>>
>> What indices would you recommend for this operation?
>
>But, on this query in particular I would recomend an indice on
>trainer_name, as this field will be used on the join and on the group
>by. For the other query, the one you get by substituting trainer_name
>by trainer_id, place an index on trainer_id.
>Also, these indexes may help speed up the order by clause, if you use
>one.
>
>If you have a table with lots of data you can try them around and see
>how performance varies (and don't forget there's also EXPLAIN ANALYZE)


Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.

The table has 250000 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.

It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.

Regards,
Tarlika

Re: identifying duplicates in table with redundancies

From

Andreas Schmitz

Date:

29 September 2010, 05:40:29

  On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
> On Tue, 28 Sep 2010 11:34:31 +0100
> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt>  wrote:
>
>> ----- Original Message -----
>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>> To:<pgsql-sql@postgresql.org>
>> Sent: Monday, September 27, 2010 5:54 PM
>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>
>>
>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com>  wrote:
>>>
>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>> FROM (
>>>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>>>> FROM (
>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>> FROM student
>>>>>> ) x
>>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>>> duplicates
>>>>>> HAVING (COUNT(*)>  1)
>>>>>> ) z
>>>>>> NATURAL JOIN student y
>>>
>>>
>>> What indices would you recommend for this operation?
>> But, on this query in particular I would recomend an indice on
>> trainer_name, as this field will be used on the join and on the group
>> by. For the other query, the one you get by substituting trainer_name
>> by trainer_id, place an index on trainer_id.
>> Also, these indexes may help speed up the order by clause, if you use
>> one.
>>
>> If you have a table with lots of data you can try them around and see
>> how performance varies (and don't forget there's also EXPLAIN ANALYZE)
>
> Strangely, these indices did not do anything.
> Without, the query took about 8500ms. Same with index.
>
> The table has 250000 records. 11000 have trainer_name = null. Only
> 13000 unique trainer_names.
>
> It is not hugely important as these queries are not time-critical.
> This is only a helper table, which I use to analyze the date prior to
> populating the destination tables with the data.
>
> Regards,
> Tarlika
>

I guess explain analyze shows up a seq scan. try avoiding to use 
distinct. use group by instead.

regards

Andreas

Re: identifying duplicates in table with redundancies

From

Tarlika Elisabeth Schmitz

Date:

29 September 2010, 08:58:10

On Wed, 29 Sep 2010 10:40:03 +0200
Andreas Schmitz <mailinglist@longimanus.net> wrote:

>  On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
>> On Tue, 28 Sep 2010 11:34:31 +0100
>> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt>
>> wrote:
>>
>>> ----- Original Message -----
>>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>>> To:<pgsql-sql@postgresql.org>
>>> Sent: Monday, September 27, 2010 5:54 PM
>>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>>
>>>
>>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com>  wrote:
>>>>
>>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>>> FROM (
>>>>>>> SELECT trainer_name   -- The field you want to test for
>>>>>>> duplicates FROM (
>>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>>> FROM student
>>>>>>> ) x
>>>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>>>> duplicates
>>>>>>> HAVING (COUNT(*)>  1)
>>>>>>> ) z
>>>>>>> NATURAL JOIN student y
>>>>
>>>>
>>>> What indices would you recommend for this operation?
>>> But, on this query in particular I would recomend an indice on
>>> trainer_name,[...]
>> Strangely, these indices did not do anything.
>> Without, the query took about 8500ms. Same with index.
>>
>> The table has 250000 records. 11000 have trainer_name = null. Only
>> 13000 unique trainer_names.
>
>I guess explain analyze shows up a seq scan. try avoiding to use 
>distinct. use group by instead.
>
>regards, Andreas

Hallo Andreas,
I reduced the problem to the innermost query:

1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.

2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.

3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms

4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!

Please, explain (pun not intended)! How can this be. Only 11000/250000
rows have a null trainer_id.


Thanks for the GROUP BY tip!


-- 

Best Regards,
Tarlika Elisabeth Schmitz

Re: identifying duplicates in table with redundancies

From

"Oliveiros d'Azevedo Cristina"

Date:

29 September 2010, 10:00:40

>
> Hallo Andreas,
> I reduced the problem to the innermost query:
>
> 1) SELECT DISTINCT trainer_id, trainer_name FROM student
> This results in a sequential table scan. Execution time 7500ms.
>
> 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
> scan instead, which still cost 7000ms.
>
> 3) Next, I changed from DISTINCT to GROUP BY:
> SELECT trainer_id, trainer_name FROM student
> GROUP BY trainer_id, trainer_name
> This resulted in an index scan @ 6750ms
>
> 4) I filtered out NULL trainer_ids
> WHERE trainer_id IS NOT NULL
> Amazingly, this resulted in a sequential table scan, which only took
> 1300ms!!
>
> Please, explain (pun not intended)! How can this be. Only 11000/250000
> rows have a null trainer_id.
>

That's an impressive improvement...
Personally I have no idea what caused it, specially when you say it was 
sequential :-|
Warmed caches ?

Best,
Oliver