Thread: Slow duplicate deletes

Slow duplicate deletes

From
DrYSG
Date:
I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove.  I have a bigseriel index that I
unique, but duplicates in another field.

I have an 8 core, 12GB memory computer with RAID disks.

This request has been running for 70 hours (is it safe to kill it?).

How can I make this run faster? This is a one time processing task, but it
is taking a long time.

DELETE FROM portal.metadata
WHERE idx NOT IN
(
    SELECT MIN(idx)
    FROM portal.metadata
    GROUP BY "data_object.unique_id"
);

CREATE TABLE
    metadata
    (
        data_object.unique_id CHARACTER(64) NOT NULL,
        size_bytes BIGINT,
        object_date TIMESTAMP(6) WITHOUT TIME ZONE,
        object_type CHARACTER VARYING(25),
        classification CHARACTER VARYING(7),
        object_source CHARACTER VARYING(50),
        object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
        clevel INTEGER,
        fsctlh CHARACTER VARYING(50),
        oname CHARACTER VARYING(40),
        description CHARACTER VARYING(80),
        horizontal_datum CHARACTER VARYING(20),
        do_location CHARACTER VARYING(200),
        elevation_ft INTEGER,
        location USER-DEFINED,
        idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass) NOT
NULL,
        bbox CHARACTER VARYING(160),
        CONSTRAINT MetaDataKey PRIMARY KEY (idx)
    )


--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537818.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

Re: Slow duplicate deletes

From
DrYSG
Date:
Actually, I have 24GB, but my question remains: How can I speed this up? and
can I kill the current SQL command (running in pgAdmin).


--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537832.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

Re: Slow duplicate deletes

From
Merlin Moncure
Date:
On Mon, Mar 5, 2012 at 9:17 AM, DrYSG <ygutfreund@draper.com> wrote:
> I have a large table (20M records) but mostly short text fields. There are
> duplicates that I am trying to remove.  I have a bigseriel index that I
> unique, but duplicates in another field.
>
> I have an 8 core, 12GB memory computer with RAID disks.
>
> This request has been running for 70 hours (is it safe to kill it?).

yes...generally speaking, it's safe to kill just about any query in
postgres any time.

> How can I make this run faster? This is a one time processing task, but it
> is taking a long time.
>
> DELETE FROM portal.metadata
> WHERE idx NOT IN
> (
>    SELECT MIN(idx)
>    FROM portal.metadata
>    GROUP BY "data_object.unique_id"
> );

compare the plan for that query  (EXPLAIN) vs this one:

/* delete the records from m1 if there is another record with a lower
idx for the same unique_id */
DELETE FROM portal.metadata m1
WHERE EXISTS
(
  SELECT 1 FROM portal.metadata m2
  WHERE m1.unique_id = m2.unique_id
    AND m2.idx < m1.idx
)

also, if you don't already have one, consider making an index on at
least unqiue_id, or possibly unique_id, idx.

back up your database before running this query :-).

merlin

Re: Slow duplicate deletes

From
DrYSG
Date:
One point I might not have made clear. The reason I want to remove duplicates
is that the column "data_object.unique_id" became non-unique (someone added
duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
the rows that became duplicated.

I am going to try out some of your excellent suggestions. I will report back
on how they are working.

One idea that was given to me was the following (what do you think Merlin?)

CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Or something of this ilk should be faster because it only need to do a
sort on data_object.unique_id and then an insert. After you have
verified the results you can do:

BEGIN;
ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
COMMIT;


--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5538858.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

Re: Slow duplicate deletes

From
Michael Wood
Date:
On 5 March 2012 22:43, DrYSG <ygutfreund@draper.com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
>
> Or something of this ilk should be faster because it only need to do a
> sort on data_object.unique_id and then an insert. After you have
> verified the results you can do:
>
> BEGIN;
> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
> COMMIT;

This sounds like a good way to go, but if you have foreign keys
pointing at portal.metadata I think you will need to drop and recreate
them again after the rename.

--
Michael Wood <esiotrot@gmail.com>

Re: Slow duplicate deletes

From
Merlin Moncure
Date:
On Mon, Mar 5, 2012 at 2:43 PM, DrYSG <ygutfreund@draper.com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;

sure that will work, but as Michael noted it's not always practical to
do that.  Also, if a fairly small percentage of the records have to be
deleted, an in-place delete may end up being faster anyways.  Modern
postgres is pretty smart at optimizing 'where exists' and you should
get a decent plan.

merlin

Re: Slow duplicate deletes

From
DrYSG
Date:
The following query took only 16 hours 2 minutes

CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Thank you.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5544386.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.