Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

From Tarlika Elisabeth Schmitz
Subject Re: identifying duplicates in table with redundancies
Date
Msg-id 20100928213600.4db60ba4@dick.coachhouse
Whole thread Raw
In response to Re: identifying duplicates in table with redundancies  ("Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt>)
Responses Re: identifying duplicates in table with redundancies  (Andreas Schmitz <mailinglist@longimanus.net>)
List pgsql-sql
On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote:

>----- Original Message ----- 
>From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de>
>To: <pgsql-sql@postgresql.org>
>Sent: Monday, September 27, 2010 5:54 PM
>Subject: Re: [SQL] identifying duplicates in table with redundancies
>
>
>> On Fri, 24 Sep 2010 18:12:18 +0100
>> Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote:
>>
>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>> FROM (
>>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>>> FROM (
>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>> FROM student
>>>>> ) x
>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>> duplicates
>>>>> HAVING (COUNT(*) > 1)
>>>>> ) z
>>>>> NATURAL JOIN student y
>>
>>
>>
>> What indices would you recommend for this operation?
>
>But, on this query in particular I would recomend an indice on
>trainer_name, as this field will be used on the join and on the group
>by. For the other query, the one you get by substituting trainer_name
>by trainer_id, place an index on trainer_id.
>Also, these indexes may help speed up the order by clause, if you use
>one.
>
>If you have a table with lots of data you can try them around and see
>how performance varies (and don't forget there's also EXPLAIN ANALYZE)


Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.

The table has 250000 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.

It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.

Regards,
Tarlika


pgsql-sql by date:

Previous
From: msi77
Date:
Subject: Re: is there a distinct function for comma lists ?
Next
From: "Thomas Andres"
Date:
Subject: Shema with template