Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

From Tarlika Elisabeth Schmitz
Subject Re: identifying duplicates in table with redundancies
Date
Msg-id 20100929124107.245b1314@dick.coachhouse
Whole thread Raw
In response to Re: identifying duplicates in table with redundancies  (Andreas Schmitz <mailinglist@longimanus.net>)
List pgsql-sql
On Wed, 29 Sep 2010 10:40:03 +0200
Andreas Schmitz <mailinglist@longimanus.net> wrote:

>  On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
>> On Tue, 28 Sep 2010 11:34:31 +0100
>> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt>
>> wrote:
>>
>>> ----- Original Message -----
>>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>>> To:<pgsql-sql@postgresql.org>
>>> Sent: Monday, September 27, 2010 5:54 PM
>>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>>
>>>
>>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com>  wrote:
>>>>
>>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>>> FROM (
>>>>>>> SELECT trainer_name   -- The field you want to test for
>>>>>>> duplicates FROM (
>>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>>> FROM student
>>>>>>> ) x
>>>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>>>> duplicates
>>>>>>> HAVING (COUNT(*)>  1)
>>>>>>> ) z
>>>>>>> NATURAL JOIN student y
>>>>
>>>>
>>>> What indices would you recommend for this operation?
>>> But, on this query in particular I would recomend an indice on
>>> trainer_name,[...]
>> Strangely, these indices did not do anything.
>> Without, the query took about 8500ms. Same with index.
>>
>> The table has 250000 records. 11000 have trainer_name = null. Only
>> 13000 unique trainer_names.
>
>I guess explain analyze shows up a seq scan. try avoiding to use 
>distinct. use group by instead.
>
>regards, Andreas

Hallo Andreas,
I reduced the problem to the innermost query:

1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.

2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.

3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms

4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!

Please, explain (pun not intended)! How can this be. Only 11000/250000
rows have a null trainer_id.


Thanks for the GROUP BY tip!


-- 

Best Regards,
Tarlika Elisabeth Schmitz


pgsql-sql by date:

Previous
From: Andreas Schmitz
Date:
Subject: Re: identifying duplicates in table with redundancies
Next
From: "Oliveiros d'Azevedo Cristina"
Date:
Subject: Re: identifying duplicates in table with redundancies