Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) - Mailing list pgsql-general

From Andy Colson
Subject Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date
Msg-id 54864EF3.2020603@squeakycode.net
Whole thread Raw
In response to Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)  (Daniel Begin <jfd553@hotmail.com>)
Responses Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
List pgsql-general
On 12/08/2014 03:59 PM, Daniel Begin wrote:
> Thanks for your answers Andy; I will keep in mind the procedure you proposed.
> About the fields required to find duplicate records, all of them are required (5-9) depending on the table.
>
> Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table
scan?
> - Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate
records;
> - Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from
big_table;
>
> Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from
distinctrecords? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what
thecriteria should be and how/why it would affect the selection of an approach:-| 
>
> Daniel
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson
> Sent: December-08-14 11:39
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>
> On 12/8/2014 10:30 AM, Andy Colson wrote:
>> On 12/7/2014 9:31 PM, Daniel Begin wrote:
>>> I have just completed the bulk upload of a large database. Some
>>> tables have billions of records and no constraints or indexes have
>>> been applied yet. About 0.1% of these records may have been
>>> duplicated during the upload and I need to remove them before applying constraints.
>>>
>>> I understand there are (at least) two approaches to get a table
>>> without duplicate records…
>>>
>>> -           Delete duplicate records from the table based on an
>>> appropriate select clause;
>>>
>>> -           Create a new table with the results from a select distinct
>>> clause, and then drop the original table.
>>>
>>> What would be the most efficient procedure in PostgreSQL to do the
>>> job considering …
>>>
>>> -           I do not know which records were duplicated;
>>>
>>> -           There are no indexes applied on tables yet;
>>>
>>> -           There is no OIDS on tables yet;
>>>
>>> -           The database is currently 1TB but I have plenty of disk
>>> space.
>>>
>>> Daniel
>>>
>>
>> How would you detect duplicate?  Is there a single field that would be
>> duplicated?  Or do you have to test a bunch of different fields?
>>
>> If its a single field, you could find dups in a single pass of the
>> table
>> with:
>>
>> create index bigtable_key on bigtable(key); select key, count(*) from
>> bigtable group by key having count(*) > 1;
>>
>> Save that list, and decide on some way of deleting the dups.
>>
>> The index might help the initial select, but will really help re-query
>> and delete statements.
>>
>> -Andy
>>
>>
>
> I just thought of a more generic way.
>
> 1) make a non-unique index on bigtable
> 2) make a temp table
> 3) -- copy only dups
> insert into temp table
> select * from big table where (its a duplicate);
>
> 4)
> delete from bigtable where keys in (select key from temp);
>
> 5)
> insert into bigtable
> select distinct from temp;
>
>
> This would minimize the amount of data you have to move around.  Depends on how hard step 3 is to write.  Index not
requiredbut would help both step 3 and 4 be faster. 
>
> -Andy
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>

I was assuming you could find dups using a single field.  The single field could be indexed, and the single field (via
index)could find dups, as well as delete rows. 

If you have to use all the columns ... well, I'd ignore everything I said. :-)

-Andy



pgsql-general by date:

Previous
From: Vincent de Phily
Date:
Subject: Re: Speeding up an in-progress wraparound-preventing vacuum
Next
From: Scott Marlowe
Date:
Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)