Thread: Duplicated records

Duplicated records

From

lucas@presserv.org

Date:

24 May 2005, 17:59:39

Hi.
How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
clause??
The problem is becouse I have imported data from Dbase (dbf) file, and this
function have not built the Constraint (unique, primary key, ...), and this
function is usually executed.
select * from table1; --id may be primary key- Table1 -id | field 20  | 'aaa'1  | 'bbb'2  | 'ccc'0  | 'aaa'  <<== The
datais duplicated1  | 'bbb'2  | 'ccc'0  | 'aaa'1  | 'bbb'2  | 'ccc'
 
Is there a way to delete the duplicated data without build another table with
constraints and copy those data to the new table?
Something like "delete from table1 where ...???"

Thanks,
Lucas

Re: Duplicated records

From

PFC

Date:

24 May 2005, 18:26:12

> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
> clause??
> The problem is becouse I have imported data from Dbase (dbf) file, and  
> this
> function have not built the Constraint (unique, primary key, ...), and  
> this
> function is usually executed.
If you have no primary key how can you reference a record in order to  
delete it ?I'd say use a temporary table...If you have complete row dupes (ie. the entire row is duplicated) useSELECT
*FROM table GROUP BY *(or select distinct)
 
If only the primary key is duplicated but other fields change, then you  
have to decide which one you wanna keep !

Re: [despammed] Duplicated records

From

Andreas Kretschmer

Date:

25 May 2005, 02:02:31

am  24.05.2005, um 17:59:31 -0300 mailte lucas@presserv.org folgendes:
> Hi.
> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
> clause??

Please read http://www.gtsm.com/oscon2003/deletetid.html

Its a very good article about this problem.


Regards, Andreas
-- 
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net===    Schollglas Unternehmensgruppe    ===

Re: Duplicated records

From

lucas@presserv.org

Date:

25 May 2005, 13:58:22

Hi.
Thanks for the article...
But, I have read it and the query works very slow...
My table have aprox. 180.000 records (correct) and in entire table it has
aprox.360.000 records(duplicated)...
I tried to execute a query to delete the duplicated records, but it 
worked very
very slow... look:

# select * from lanctos order by numos; numos | field1 | field2 | field3 |... 00001 | test   | T2-2   | 2      |...
00001| test   | T2-2   | 2      |... 00002 | Blabla | 0      | ABC    |... 00002 | Blabla | 0      | ABC    |... 00003
|Llllll | Oooooo | Rrrrrr |... 00003 | Llllll | Oooooo | Rrrrrr |...
 
...

The records is entire duplicated (with all fields having the same data),
thinking the "numos" fields as primary key I have executed the query:

# DELETE from lanctos where not oid=(select oid from lanctos as l2 where
l2.numos=lanctos.numos limit 1);

I have tested others querys with EXPLAIN command to examine the performance
time, and this query was the best performance I got... but its is slow. Other
query is:

# DELETE from lanctos where not exists (select '1' from lanctos as l2 where
l2.numos=lanctos.numos and not l2.oid=lanctos.oid);

Is there a way to delete those duplicated records faster??? Remembering the
table have aprox 360.000 records...
Is better I create other table and copy those data??? How should I created???

Thanks.


Quoting Andreas Kretschmer <akretschmer@despammed.com>:
> am  24.05.2005, um 17:59:31 -0300 mailte lucas@presserv.org folgendes:
>> Hi.
>> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
>> clause??
>
> Please read http://www.gtsm.com/oscon2003/deletetid.html
>
> Its a very good article about this problem.
>

Re: Duplicated records

From

Alvaro Herrera

Date:

25 May 2005, 14:45:02

On Wed, May 25, 2005 at 01:58:07PM -0300, lucas@presserv.org wrote:

> The records is entire duplicated (with all fields having the same data),
> thinking the "numos" fields as primary key I have executed the query:
> 
> # DELETE from lanctos where not oid=(select oid from lanctos as l2 where
> l2.numos=lanctos.numos limit 1);

That's because you don't have an index on the Oid column.  You may want
to try using the ctid column instead; it's the physical row position in
the table so it doesn't need to scan the whole table each time.

HTH,

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"The only difference is that Saddam would kill you on private, where the
Americans will kill you in public" (Mohammad Saleh, 39, a building contractor)

Re: [despammed] Re: Duplicated records

From

Andreas Kretschmer

Date:

25 May 2005, 14:55:48

am  25.05.2005, um 13:58:07 -0300 mailte lucas@presserv.org folgendes:
> Hi.
> Thanks for the article...
> But, I have read it and the query works very slow...
> My table have aprox. 180.000 records (correct) and in entire table it has
> aprox.360.000 records(duplicated)...

How often is this necessary?


> Is there a way to delete those duplicated records faster??? Remembering the
> table have aprox 360.000 records...

I dont know, but i think, you should prevent duplicated records in the
future, and make the job (delete duplicates) now.

Btw.: you wrote, there is a primary key on the first row. Real?

,----[  sorry, messages in german language  ]
| test_db=# create table blub (id int primary key, name varchar);
| HINWEIS:  CREATE TABLE / PRIMARY KEY erstellt implizit einen Index >>blub_pkey<< für Tabelle >>blub<<
| CREATE TABLE
| test_db=# insert into blub values (1, 'x');
| INSERT 970706 1
| test_db=# insert into blub values (1, 'y');
| FEHLER:  duplizierter Schlüssel verletzt Unique-Constraint >>blub_pkey<<
`----

In other words: if there a primary key on the first row, you cannot
insert duplicates.


Regards, Andreas
-- 
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net===    Schollglas Unternehmensgruppe    ===

Re: Duplicated records

From

lucas@presserv.org

Date:

25 May 2005, 17:54:12

Thanks....
CTAS (Create Table As Select) command works fine!!! With great performance.
I think it is the best way to correct the data...(apparently)
I didnt know about "select DISTINCT". I am going to read about it.

Thank you.

Quoting Bricklen Anderson <BAnderson@PresiNET.com>:
>> Is there a way to delete those duplicated records faster??? Remembering the
>> table have aprox 360.000 records...
>> Is better I create other table and copy those data??? How should I 
>> created???
>>
> for 180k rows (which isn't many) just do a CTAS (Create Table As Select):
>
> create table lanctos_distinct as select distinct * from lanctos;
> truncate table lanctos;
> alter table lanctos_distinct rename to lanctos;
>

Re: Duplicated records

From

Andrew Hammond

Date:

26 May 2005, 13:33:44

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

lucas@presserv.org wrote:
> Thanks....
> CTAS (Create Table As Select) command works fine!!! With great performance.
> I think it is the best way to correct the data...(apparently)
> I didnt know about "select DISTINCT". I am going to read about it.

You might want to wrap this in some locking and throw in some
constraints to avoid dupes in the future...

BEGIN;
LOCK lanctos IN ACCESS EXCLUSIVE;
CREATE TABLE lanctos_distinct AS SELECT DISTINCT * FROM lanctos;
DROP TABLE lanctos;
ALTER TABLE lanctos_distinct RENAME TO lanctos;
ALTER TABLE lanctos ALTER id SET NOT NULL;
CREATE UNIQUE INDEX lanctos_id_idx ON lanctos (id);
ALTER TABLE lanctos ADD CONSTRAINT lanctos_id_pkey PRIMARY KEY (id);
COMMIT;

As always, don't forget to ANALYZE the new table.

- --
Andrew Hammond    416-673-4138    ahammond@ca.afilias.info
Database Administrator, Afilias Canada Corp.
CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFClfpbgfzn5SevSpoRAnj3AJ9xvCmMiC9yWNmS9XLFZWO3o4vNcACfboz+
T442LzdAAV1DbIoj24rCJeA=
=vrDU
-----END PGP SIGNATURE-----