On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
> On Tue, 28 Sep 2010 11:34:31 +0100
> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt> wrote:
>
>> ----- Original Message -----
>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>> To:<pgsql-sql@postgresql.org>
>> Sent: Monday, September 27, 2010 5:54 PM
>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>
>>
>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote:
>>>
>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>> FROM (
>>>>>> SELECT trainer_name -- The field you want to test for duplicates
>>>>>> FROM (
>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>> FROM student
>>>>>> ) x
>>>>>> GROUP BY "trainer_name" -- the field you want to test for
>>>>>> duplicates
>>>>>> HAVING (COUNT(*)> 1)
>>>>>> ) z
>>>>>> NATURAL JOIN student y
>>>
>>>
>>> What indices would you recommend for this operation?
>> But, on this query in particular I would recomend an indice on
>> trainer_name, as this field will be used on the join and on the group
>> by. For the other query, the one you get by substituting trainer_name
>> by trainer_id, place an index on trainer_id.
>> Also, these indexes may help speed up the order by clause, if you use
>> one.
>>
>> If you have a table with lots of data you can try them around and see
>> how performance varies (and don't forget there's also EXPLAIN ANALYZE)
>
> Strangely, these indices did not do anything.
> Without, the query took about 8500ms. Same with index.
>
> The table has 250000 records. 11000 have trainer_name = null. Only
> 13000 unique trainer_names.
>
> It is not hugely important as these queries are not time-critical.
> This is only a helper table, which I use to analyze the date prior to
> populating the destination tables with the data.
>
> Regards,
> Tarlika
>
I guess explain analyze shows up a seq scan. try avoiding to use
distinct. use group by instead.
regards
Andreas