Thread: How to delete duplicate rows?

How to delete duplicate rows?

From

Clodoaldo Pinto

Date:

04 February 2005, 01:07:17

This one must be obvious for most here.

I have a 170 million rows table from which I want to eliminate
duplicate "would be" keys and leave only uniques.

I found a query in http://www.jlcomp.demon.co.uk/faq/duplicates.html
for the oracle database but can't figure out how to refer to the row
id in postgresql:

delete from test where rowid not in
(select min(rowid) from test group by a,b);

How to refer to the row id? Any better way to do it?

Regards, Clodoaldo Pinto

Re: How to delete duplicate rows?

From

John Sidney-Woollett

Date:

04 February 2005, 07:36:26

Can you not use your table's primary key value instead?

If you table is created with OIDs you may be able to use those -
although I don't know if that this advisable or not since I never use
OIDs...

John Sidney-Woollett

Clodoaldo Pinto wrote:

> This one must be obvious for most here.
>
> I have a 170 million rows table from which I want to eliminate
> duplicate "would be" keys and leave only uniques.
>
> I found a query in http://www.jlcomp.demon.co.uk/faq/duplicates.html
> for the oracle database but can't figure out how to refer to the row
> id in postgresql:
>
> delete from test where rowid not in
> (select min(rowid) from test group by a,b);
>
> How to refer to the row id? Any better way to do it?
>
> Regards, Clodoaldo Pinto
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: How to delete duplicate rows?

From

John Sidney-Woollett

Date:

04 February 2005, 10:24:45

Doh, sorry - you're completely correct! Silly me...

Can you not add a serial or sequence column to the table for the
purposes of the de-dupe?

Then create an index on that column in one operation at the end and use
that in the way that you would use Oracle's rowid from the examples?

John Sidney-Woollett

Clodoaldo Pinto wrote:

> On Fri, 04 Feb 2005 07:38:26 +0000, John Sidney-Woollett
> <johnsw@wardbrook.com> wrote:
>
>>Can you not use your table's primary key value instead?
>>
>
> John, If a primary key existed there would be no duplicates.
>
> The row insertion is made in 650k rows batches 8 times a day and a
> primary key make it very slow.
>
> Found also methods 2 and 3 in
> http://www.orafaq.com/faq/Server_Utilities/SQL/faq55.htm
>
> Method 3 also relies in the row id. If no one can help I will do this:
>
> Insert the distinct rows in a temporary table. Drop the index. Insert
> into the original from the temporary.
>
> Clodoaldo

Re: How to delete duplicate rows?

From

Clodoaldo Pinto

Date:

04 February 2005, 11:11:32

<johnsw@wardbrook.com> wrote:
>
> Can you not add a serial or sequence column to the table for the
> purposes of the de-dupe?
>
> Then create an index on that column in one operation at the end and use
> that in the way that you would use Oracle's rowid from the examples?

Yes. It could work. I have a two hours window to do it. Creating the
index alone takes about one hour...

> Clodoaldo Pinto wrote:
>
> > Method 3 also relies in the row id. If no one can help I will do this:
> >
> > Insert the distinct rows in a temporary table. Drop the index. Insert
> > into the original from the temporary.
> >
How did i wrote it? The correct is:

Insert the distinct rows in a temporary table. Truncate the original.
Drop the index. Insert into the original from the temporary.  Recreate
the index.

Clodoaldo

Re: How to delete duplicate rows?

From

Bruno Wolff III

Date:

04 February 2005, 14:03:48

On Thu, Feb 03, 2005 at 23:04:57 -0200,
  Clodoaldo Pinto <clodoaldo.pinto@gmail.com> wrote:
> This one must be obvious for most here.
>
> I have a 170 million rows table from which I want to eliminate
> duplicate "would be" keys and leave only uniques.
>
> I found a query in http://www.jlcomp.demon.co.uk/faq/duplicates.html
> for the oracle database but can't figure out how to refer to the row
> id in postgresql:
>
> delete from test where rowid not in
> (select min(rowid) from test group by a,b);
>
> How to refer to the row id? Any better way to do it?

Your best bet is to use oid if the table has them. If it doesn't you
might want to use ctid. That will require creating some comparison
functions that don't exist by default. By the time you write and test
those, the select distinct method you actually used might be better.