Home > mailing lists

Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

From	Andreas Schmitz
Subject	Re: identifying duplicates in table with redundancies
Date	September 29, 2010 08:40:29
Msg-id	4CA2FB63.2030102@longimanus.net Whole thread Raw
In response to	Re: identifying duplicates in table with redundancies ("Tarlika Elisabeth Schmitz" <postgresql2@numerixtechnology.de>)
Responses	Re: identifying duplicates in table with redundancies
List	pgsql-sql

Tree view

  On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
> On Tue, 28 Sep 2010 11:34:31 +0100
> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt>  wrote:
>
>> ----- Original Message -----
>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de>
>> To:<pgsql-sql@postgresql.org>
>> Sent: Monday, September 27, 2010 5:54 PM
>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>
>>
>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com>  wrote:
>>>
>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>> FROM (
>>>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>>>> FROM (
>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>> FROM student
>>>>>> ) x
>>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>>> duplicates
>>>>>> HAVING (COUNT(*)>  1)
>>>>>> ) z
>>>>>> NATURAL JOIN student y
>>>
>>>
>>> What indices would you recommend for this operation?
>> But, on this query in particular I would recomend an indice on
>> trainer_name, as this field will be used on the join and on the group
>> by. For the other query, the one you get by substituting trainer_name
>> by trainer_id, place an index on trainer_id.
>> Also, these indexes may help speed up the order by clause, if you use
>> one.
>>
>> If you have a table with lots of data you can try them around and see
>> how performance varies (and don't forget there's also EXPLAIN ANALYZE)
>
> Strangely, these indices did not do anything.
> Without, the query took about 8500ms. Same with index.
>
> The table has 250000 records. 11000 have trainer_name = null. Only
> 13000 unique trainer_names.
>
> It is not hugely important as these queries are not time-critical.
> This is only a helper table, which I use to analyze the date prior to
> populating the destination tables with the data.
>
> Regards,
> Tarlika
>

I guess explain analyze shows up a seq scan. try avoiding to use 
distinct. use group by instead.

regards

Andreas

pgsql-sql by date:

From: Eduardas Kazakas
Date: 29 September 2010, 07:31:35
Subject: Regexp matching

From: Tarlika Elisabeth Schmitz
Date: 29 September 2010, 11:58:10
Subject: Re: identifying duplicates in table with redundancies

Re: identifying duplicates in table with redundancies - Mailing list pgsql-sql

Previous

Next