Thread: Slow duplicate deletes
I have a large table (20M records) but mostly short text fields. There are duplicates that I am trying to remove. I have a bigseriel index that I unique, but duplicates in another field. I have an 8 core, 12GB memory computer with RAID disks. This request has been running for 70 hours (is it safe to kill it?). How can I make this run faster? This is a one time processing task, but it is taking a long time. DELETE FROM portal.metadata WHERE idx NOT IN ( SELECT MIN(idx) FROM portal.metadata GROUP BY "data_object.unique_id" ); CREATE TABLE metadata ( data_object.unique_id CHARACTER(64) NOT NULL, size_bytes BIGINT, object_date TIMESTAMP(6) WITHOUT TIME ZONE, object_type CHARACTER VARYING(25), classification CHARACTER VARYING(7), object_source CHARACTER VARYING(50), object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE, clevel INTEGER, fsctlh CHARACTER VARYING(50), oname CHARACTER VARYING(40), description CHARACTER VARYING(80), horizontal_datum CHARACTER VARYING(20), do_location CHARACTER VARYING(200), elevation_ft INTEGER, location USER-DEFINED, idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass) NOT NULL, bbox CHARACTER VARYING(160), CONSTRAINT MetaDataKey PRIMARY KEY (idx) ) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537818.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
Actually, I have 24GB, but my question remains: How can I speed this up? and can I kill the current SQL command (running in pgAdmin). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537832.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On Mon, Mar 5, 2012 at 9:17 AM, DrYSG <ygutfreund@draper.com> wrote: > I have a large table (20M records) but mostly short text fields. There are > duplicates that I am trying to remove. I have a bigseriel index that I > unique, but duplicates in another field. > > I have an 8 core, 12GB memory computer with RAID disks. > > This request has been running for 70 hours (is it safe to kill it?). yes...generally speaking, it's safe to kill just about any query in postgres any time. > How can I make this run faster? This is a one time processing task, but it > is taking a long time. > > DELETE FROM portal.metadata > WHERE idx NOT IN > ( > SELECT MIN(idx) > FROM portal.metadata > GROUP BY "data_object.unique_id" > ); compare the plan for that query (EXPLAIN) vs this one: /* delete the records from m1 if there is another record with a lower idx for the same unique_id */ DELETE FROM portal.metadata m1 WHERE EXISTS ( SELECT 1 FROM portal.metadata m2 WHERE m1.unique_id = m2.unique_id AND m2.idx < m1.idx ) also, if you don't already have one, consider making an index on at least unqiue_id, or possibly unique_id, idx. back up your database before running this query :-). merlin
One point I might not have made clear. The reason I want to remove duplicates is that the column "data_object.unique_id" became non-unique (someone added duplicate rows). So I added the bigSeriel (idx) to uniquely identify the rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of the rows that became duplicated. I am going to try out some of your excellent suggestions. I will report back on how they are working. One idea that was given to me was the following (what do you think Merlin?) CREATE TABLE portal.new_metatdata AS select distinct on (data_object.unique_id) * FROM portal.metadata; Or something of this ilk should be faster because it only need to do a sort on data_object.unique_id and then an insert. After you have verified the results you can do: BEGIN; ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old; ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old; COMMIT; -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5538858.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
On 5 March 2012 22:43, DrYSG <ygutfreund@draper.com> wrote: > One point I might not have made clear. The reason I want to remove duplicates > is that the column "data_object.unique_id" became non-unique (someone added > duplicate rows). So I added the bigSeriel (idx) to uniquely identify the > rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of > the rows that became duplicated. > > I am going to try out some of your excellent suggestions. I will report back > on how they are working. > > One idea that was given to me was the following (what do you think Merlin?) > > CREATE TABLE portal.new_metatdata AS > select distinct on (data_object.unique_id) * FROM portal.metadata; > > Or something of this ilk should be faster because it only need to do a > sort on data_object.unique_id and then an insert. After you have > verified the results you can do: > > BEGIN; > ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old; > ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old; > COMMIT; This sounds like a good way to go, but if you have foreign keys pointing at portal.metadata I think you will need to drop and recreate them again after the rename. -- Michael Wood <esiotrot@gmail.com>
On Mon, Mar 5, 2012 at 2:43 PM, DrYSG <ygutfreund@draper.com> wrote: > One point I might not have made clear. The reason I want to remove duplicates > is that the column "data_object.unique_id" became non-unique (someone added > duplicate rows). So I added the bigSeriel (idx) to uniquely identify the > rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of > the rows that became duplicated. > > I am going to try out some of your excellent suggestions. I will report back > on how they are working. > > One idea that was given to me was the following (what do you think Merlin?) > > CREATE TABLE portal.new_metatdata AS > select distinct on (data_object.unique_id) * FROM portal.metadata; sure that will work, but as Michael noted it's not always practical to do that. Also, if a fairly small percentage of the records have to be deleted, an in-place delete may end up being faster anyways. Modern postgres is pretty smart at optimizing 'where exists' and you should get a decent plan. merlin
The following query took only 16 hours 2 minutes CREATE TABLE portal.new_metatdata AS select distinct on (data_object.unique_id) * FROM portal.metadata; Thank you. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5544386.html Sent from the PostgreSQL - novice mailing list archive at Nabble.com.