On Wed, 29 Sep 2010 10:40:03 +0200
Andreas Schmitz <mailinglist@longimanus.net> wrote:
> On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
>> On Tue, 28 Sep 2010 11:34:31 +0100
>> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt>
>> wrote:
>>
>>> ----- Original Message -----
>>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>>> To:<pgsql-sql@postgresql.org>
>>> Sent: Monday, September 27, 2010 5:54 PM
>>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>>
>>>
>>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote:
>>>>
>>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>>> FROM (
>>>>>>> SELECT trainer_name -- The field you want to test for
>>>>>>> duplicates FROM (
>>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>>> FROM student
>>>>>>> ) x
>>>>>>> GROUP BY "trainer_name" -- the field you want to test for
>>>>>>> duplicates
>>>>>>> HAVING (COUNT(*)> 1)
>>>>>>> ) z
>>>>>>> NATURAL JOIN student y
>>>>
>>>>
>>>> What indices would you recommend for this operation?
>>> But, on this query in particular I would recomend an indice on
>>> trainer_name,[...]
>> Strangely, these indices did not do anything.
>> Without, the query took about 8500ms. Same with index.
>>
>> The table has 250000 records. 11000 have trainer_name = null. Only
>> 13000 unique trainer_names.
>
>I guess explain analyze shows up a seq scan. try avoiding to use
>distinct. use group by instead.
>
>regards, Andreas
Hallo Andreas,
I reduced the problem to the innermost query:
1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.
2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.
3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms
4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!
Please, explain (pun not intended)! How can this be. Only 11000/250000
rows have a null trainer_id.
Thanks for the GROUP BY tip!
--
Best Regards,
Tarlika Elisabeth Schmitz