Thread: Delete performance
Hello, i have a table with about 250m records from which i want to delete thoose not contained in other table. I used this SQL query: DELETE FROM data_structures_items WHERE id_data_structure NOT IN ( SELECT id_structure FROM data_structures); This ran for 24hours and in the mean time it read more than 20TB of disk (then i run out of patience). Since the actual table is only about 16GB in size (the whole database is ~50GB) the only explanation i can think of is that the result of the sub-query is not cached and is read again for each of 2.43655e+008 rows of data_structures_items table. Also i don't quite get why is data_individual_structures_pkey having ~52MB when the actual data contained is only ~343kb. Autovacuum is on and before doing this query i also ran vacum full analyze. Please tell me there is some logic in this and i am doing something wrong. SELECT relname, reltuples, relpages FROM pg_class ; relname reltuples relpages data_structures 85820 2002 data_structures_id_individual_state 85820 6526 data_structures_id_structure_seq 1 1 data_structures_items 2.43655e+008 2030460 data_structures_items_depth 2.43655e+008 675971 data_structures_items_id_data_structure 2.43655e+008 668184 data_structures_items_id_data_structure_item_seq 1 1 data_structures_items_left 2.43655e+008 676334 data_structures_items_pkey 2.43655e+008 668074 data_individual_structures_pkey 85820 6526 CREATE TABLE data_structures_items ( id_data_structure_item serial NOT NULL, id_data_structure integer NOT NULL, "text" character varying(255) NOT NULL, lft integer NOT NULL, rght integer NOT NULL, depth integer NOT NULL, description character varying(255), CONSTRAINT data_structures_items_pkey PRIMARY KEY (id_data_structure_item) ) WITH (OIDS=TRUE); CREATE INDEX data_structures_items_depth ON data_structures_items USING btree (depth); CREATE INDEX data_structures_items_id_data_structure ON data_structures_items USING btree (id_data_structure); CREATE INDEX data_structures_items_left ON data_structures_items USING btree (lft); CREATE TABLE data_structures ( id_structure serial NOT NULL, id_individual_state integer NOT NULL, "text" text, "timestamp" timestamp(0) without time zone DEFAULT now(), CONSTRAINT data_individual_structures_pkey PRIMARY KEY (id_structure), CONSTRAINT data_structures_fk FOREIGN KEY (id_individual_state) REFERENCES data_individual_states (id_individual_state) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ) WITH (OIDS=TRUE); ALTER TABLE data_structures OWNER TO sanae; CREATE INDEX data_structures_id_individual_state ON data_structures USING btree (id_individual_state);
Jana wrote: > > Hello, > i have a table with about 250m records from which i want to delete > thoose not contained in other table. I used this SQL query: > DELETE FROM data_structures_items WHERE id_data_structure NOT IN ( > SELECT id_structure FROM data_structures); DELETE FROM data_structures_items, data_structures WHERE data_structures_items.id_data_structure = data_structures.id_data_structure AND data_structures_items.id_data_structure IS NULL;
Frank Bax wrote: > Jana wrote: >> >> Hello, >> i have a table with about 250m records from which i want to delete >> thoose not contained in other table. I used this SQL query: >> DELETE FROM data_structures_items WHERE id_data_structure NOT IN ( >> SELECT id_structure FROM data_structures); > > > DELETE FROM data_structures_items, data_structures WHERE > data_structures_items.id_data_structure = > data_structures.id_data_structure AND > data_structures_items.id_data_structure IS NULL; > Sorry; that should probably be: DELETE FROM data_structures_items, data_structures WHERE data_structures_items.id_data_structure = data_structures.id_data_structure AND data_structures.id_data_structure IS NULL;
On Sun, 14 Jun 2009 18:12:50 +0200, Frank Bax <fbax@sympatico.ca> wrote: > Jana wrote: >> Hello, >> i have a table with about 250m records from which i want to delete >> thoose not contained in other table. I used this SQL query: >> DELETE FROM data_structures_items WHERE id_data_structure NOT IN ( >> SELECT id_structure FROM data_structures); > > > DELETE FROM data_structures_items, data_structures WHERE > data_structures_items.id_data_structure = > data_structures.id_data_structure AND > data_structures_items.id_data_structure IS NULL; > Thanks for answer, this however is not a valid syntax (at least according to manual, and my 8.3) version. What could be done is DELETE FROM data_structures_items USING data_structures WHERE data_structures_items.id_data_structure=data_structures.id_structure AND something but the problem is in that "something". I cannot write data_structures_items.id_data_structure = data_structures.id_data_structure AND data_structures_items.id_data_structure IS NULL; because it a) doesn't make sense (column can't be null and equal to something at the same time), b) doesn't select what i want (rows whoose id_data_structure is NOT in the data_structures table, i'm pretty sure it is a number ). I can't join tables in DELETE command, and i can't think of a way doing it with WHERE Regards, Jana
I've had to do this quite a bit, and here's how I usually go about it: DELETE FROM data_structures_items WHERE NOT EXISTS (SELECT 1 FROM data_structures WHERE id_structure=id_data_structure LIMIT 1); Even when the item in the subquery is a primary key, I find it semantically informative to always use "LIMIT 1". Looking closer at your structure, this looks a lot like a problem I have had in the past with almost the exact same scenario. What I ended up doing beyond the above suggested query was to create a temporary table. Since you're lucky enough to be working with a primary key in the large table (I wasn't), you can copy all of the IDs into temp table. Then delete from the temp table anything that appears in your data_structures IDs. I'm not sure if it's necessary, but you could also create an index on the temp table. So your query becomes: DELETE FROM data_structures_items WHERE EXISTS (SELECT 1 FROM tmp_not_in_data_structures WHERE id_structure=id_data_structure LIMIT 1); This should help if the set of IDs to delete is smaller than the entire set of IDs. It works for me, but I can't be sure it'll help you - my main goal was to minimize the condition checking overhead for each row of the big table. Also, I'm not sure if it's valid/possible/better, but instead of EXISTS, "IS NOT NULL" might work here as well. If your indexes are rather large (and therefore may not fit into memory), you might consider partitioning. Along with the above solution, I use partitions and operate on them in parallel from a Python client. Cheers, Phillip 6/15 Jana <jana.vasseru@gmail.com>: > On Sun, 14 Jun 2009 18:12:50 +0200, Frank Bax <fbax@sympatico.ca> wrote: > >> Jana wrote: >>> >>> Hello, >>> i have a table with about 250m records from which i want to delete thoose >>> not contained in other table. I used this SQL query: >>> DELETE FROM data_structures_items WHERE id_data_structure NOT IN ( >>> SELECT id_structure FROM data_structures); >> >> >> DELETE FROM data_structures_items, data_structures WHERE >> data_structures_items.id_data_structure = data_structures.id_data_structure >> AND data_structures_items.id_data_structure IS NULL; >> > > Thanks for answer, this however is not a valid syntax (at least according to > manual, and my 8.3) > version. What could be done is > > DELETE FROM data_structures_items USING data_structures > WHERE > data_structures_items.id_data_structure=data_structures.id_structure > AND something > > but the problem is in that "something". I cannot write > data_structures_items.id_data_structure = > data_structures.id_data_structure AND > data_structures_items.id_data_structure IS NULL; > > because it a) doesn't make sense (column can't be null and equal to > something at the same time), b) doesn't select what i want (rows whoose > id_data_structure is NOT in the data_structures table, i'm pretty sure it > is a number ). I can't join tables in DELETE command, and i can't think of a > way doing it with WHERE > > Regards, > Jana > > -- > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-novice >
Phillip Sitbon <phillip@sitbon.net> writes: > I've had to do this quite a bit, and here's how I usually go about it: > DELETE FROM data_structures_items WHERE NOT EXISTS > (SELECT 1 FROM data_structures WHERE id_structure=id_data_structure LIMIT 1); > Even when the item in the subquery is a primary key, I find it > semantically informative to always use "LIMIT 1". EXISTS implies LIMIT 1 for its subquery; there's no value for either comprehension or performance in adding that. I'd recommend leaving it off, because it makes your query syntax nonstandard for no benefit. regards, tom lane