Thread: Duplicated records
Hi. How can I delete the duplicated records with "DELETE FROM TABLE WHERE..." clause?? The problem is becouse I have imported data from Dbase (dbf) file, and this function have not built the Constraint (unique, primary key, ...), and this function is usually executed. select * from table1; --id may be primary key- Table1 -id | field 20 | 'aaa'1 | 'bbb'2 | 'ccc'0 | 'aaa' <<== The datais duplicated1 | 'bbb'2 | 'ccc'0 | 'aaa'1 | 'bbb'2 | 'ccc' Is there a way to delete the duplicated data without build another table with constraints and copy those data to the new table? Something like "delete from table1 where ...???" Thanks, Lucas
> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..." > clause?? > The problem is becouse I have imported data from Dbase (dbf) file, and > this > function have not built the Constraint (unique, primary key, ...), and > this > function is usually executed. If you have no primary key how can you reference a record in order to delete it ?I'd say use a temporary table...If you have complete row dupes (ie. the entire row is duplicated) useSELECT *FROM table GROUP BY *(or select distinct) If only the primary key is duplicated but other fields change, then you have to decide which one you wanna keep !
am 24.05.2005, um 17:59:31 -0300 mailte lucas@presserv.org folgendes: > Hi. > How can I delete the duplicated records with "DELETE FROM TABLE WHERE..." > clause?? Please read http://www.gtsm.com/oscon2003/deletetid.html Its a very good article about this problem. Regards, Andreas -- Andreas Kretschmer (Kontakt: siehe Header) Heynitz: 035242/47212, D1: 0160/7141639 GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net=== Schollglas Unternehmensgruppe ===
Hi. Thanks for the article... But, I have read it and the query works very slow... My table have aprox. 180.000 records (correct) and in entire table it has aprox.360.000 records(duplicated)... I tried to execute a query to delete the duplicated records, but it worked very very slow... look: # select * from lanctos order by numos; numos | field1 | field2 | field3 |... 00001 | test | T2-2 | 2 |... 00001| test | T2-2 | 2 |... 00002 | Blabla | 0 | ABC |... 00002 | Blabla | 0 | ABC |... 00003 |Llllll | Oooooo | Rrrrrr |... 00003 | Llllll | Oooooo | Rrrrrr |... ... The records is entire duplicated (with all fields having the same data), thinking the "numos" fields as primary key I have executed the query: # DELETE from lanctos where not oid=(select oid from lanctos as l2 where l2.numos=lanctos.numos limit 1); I have tested others querys with EXPLAIN command to examine the performance time, and this query was the best performance I got... but its is slow. Other query is: # DELETE from lanctos where not exists (select '1' from lanctos as l2 where l2.numos=lanctos.numos and not l2.oid=lanctos.oid); Is there a way to delete those duplicated records faster??? Remembering the table have aprox 360.000 records... Is better I create other table and copy those data??? How should I created??? Thanks. Quoting Andreas Kretschmer <akretschmer@despammed.com>: > am 24.05.2005, um 17:59:31 -0300 mailte lucas@presserv.org folgendes: >> Hi. >> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..." >> clause?? > > Please read http://www.gtsm.com/oscon2003/deletetid.html > > Its a very good article about this problem. >
On Wed, May 25, 2005 at 01:58:07PM -0300, lucas@presserv.org wrote: > The records is entire duplicated (with all fields having the same data), > thinking the "numos" fields as primary key I have executed the query: > > # DELETE from lanctos where not oid=(select oid from lanctos as l2 where > l2.numos=lanctos.numos limit 1); That's because you don't have an index on the Oid column. You may want to try using the ctid column instead; it's the physical row position in the table so it doesn't need to scan the whole table each time. HTH, -- Alvaro Herrera (<alvherre[a]surnet.cl>) "The only difference is that Saddam would kill you on private, where the Americans will kill you in public" (Mohammad Saleh, 39, a building contractor)
am 25.05.2005, um 13:58:07 -0300 mailte lucas@presserv.org folgendes: > Hi. > Thanks for the article... > But, I have read it and the query works very slow... > My table have aprox. 180.000 records (correct) and in entire table it has > aprox.360.000 records(duplicated)... How often is this necessary? > Is there a way to delete those duplicated records faster??? Remembering the > table have aprox 360.000 records... I dont know, but i think, you should prevent duplicated records in the future, and make the job (delete duplicates) now. Btw.: you wrote, there is a primary key on the first row. Real? ,----[ sorry, messages in german language ] | test_db=# create table blub (id int primary key, name varchar); | HINWEIS: CREATE TABLE / PRIMARY KEY erstellt implizit einen Index >>blub_pkey<< für Tabelle >>blub<< | CREATE TABLE | test_db=# insert into blub values (1, 'x'); | INSERT 970706 1 | test_db=# insert into blub values (1, 'y'); | FEHLER: duplizierter Schlüssel verletzt Unique-Constraint >>blub_pkey<< `---- In other words: if there a primary key on the first row, you cannot insert duplicates. Regards, Andreas -- Andreas Kretschmer (Kontakt: siehe Header) Heynitz: 035242/47212, D1: 0160/7141639 GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net=== Schollglas Unternehmensgruppe ===
Thanks.... CTAS (Create Table As Select) command works fine!!! With great performance. I think it is the best way to correct the data...(apparently) I didnt know about "select DISTINCT". I am going to read about it. Thank you. Quoting Bricklen Anderson <BAnderson@PresiNET.com>: >> Is there a way to delete those duplicated records faster??? Remembering the >> table have aprox 360.000 records... >> Is better I create other table and copy those data??? How should I >> created??? >> > for 180k rows (which isn't many) just do a CTAS (Create Table As Select): > > create table lanctos_distinct as select distinct * from lanctos; > truncate table lanctos; > alter table lanctos_distinct rename to lanctos; >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 lucas@presserv.org wrote: > Thanks.... > CTAS (Create Table As Select) command works fine!!! With great performance. > I think it is the best way to correct the data...(apparently) > I didnt know about "select DISTINCT". I am going to read about it. You might want to wrap this in some locking and throw in some constraints to avoid dupes in the future... BEGIN; LOCK lanctos IN ACCESS EXCLUSIVE; CREATE TABLE lanctos_distinct AS SELECT DISTINCT * FROM lanctos; DROP TABLE lanctos; ALTER TABLE lanctos_distinct RENAME TO lanctos; ALTER TABLE lanctos ALTER id SET NOT NULL; CREATE UNIQUE INDEX lanctos_id_idx ON lanctos (id); ALTER TABLE lanctos ADD CONSTRAINT lanctos_id_pkey PRIMARY KEY (id); COMMIT; As always, don't forget to ANALYZE the new table. - -- Andrew Hammond 416-673-4138 ahammond@ca.afilias.info Database Administrator, Afilias Canada Corp. CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) iD8DBQFClfpbgfzn5SevSpoRAnj3AJ9xvCmMiC9yWNmS9XLFZWO3o4vNcACfboz+ T442LzdAAV1DbIoj24rCJeA= =vrDU -----END PGP SIGNATURE-----