Thread: Delete performance

Delete performance

From
Jana
Date:
Hello,
i have a table with about 250m records from which i want to delete thoose
not contained in other table. I used this SQL query:
DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
SELECT id_structure FROM data_structures);

This ran for 24hours and in the mean time it read more than 20TB of disk
(then i run out of patience). Since the actual table is only about 16GB in
size (the whole database is ~50GB) the only explanation i can think of is
that the result of the sub-query is not cached and is  read again for each
of 2.43655e+008 rows of data_structures_items table. Also i don't quite
get why is data_individual_structures_pkey having ~52MB when the actual
data contained is only ~343kb. Autovacuum is on and before doing this
query i also ran vacum full analyze. Please tell me there is some logic in
this and i am doing something wrong.


SELECT relname, reltuples, relpages FROM pg_class ;
relname    reltuples    relpages
data_structures    85820    2002
data_structures_id_individual_state    85820    6526
data_structures_id_structure_seq    1    1
data_structures_items    2.43655e+008    2030460
data_structures_items_depth    2.43655e+008    675971
data_structures_items_id_data_structure    2.43655e+008    668184
data_structures_items_id_data_structure_item_seq    1    1
data_structures_items_left    2.43655e+008    676334
data_structures_items_pkey    2.43655e+008    668074
data_individual_structures_pkey 85820    6526

CREATE TABLE data_structures_items
(
   id_data_structure_item serial NOT NULL,
   id_data_structure integer NOT NULL,
   "text" character varying(255) NOT NULL,
   lft integer NOT NULL,
   rght integer NOT NULL,
   depth integer NOT NULL,
   description character varying(255),
   CONSTRAINT data_structures_items_pkey PRIMARY KEY
(id_data_structure_item)
)
WITH (OIDS=TRUE);

CREATE INDEX data_structures_items_depth
   ON data_structures_items
   USING btree
   (depth);

CREATE INDEX data_structures_items_id_data_structure
   ON data_structures_items
   USING btree
   (id_data_structure);

CREATE INDEX data_structures_items_left
   ON data_structures_items
   USING btree
   (lft);


CREATE TABLE data_structures
(
   id_structure serial NOT NULL,
   id_individual_state integer NOT NULL,
   "text" text,
   "timestamp" timestamp(0) without time zone DEFAULT now(),
   CONSTRAINT data_individual_structures_pkey PRIMARY KEY (id_structure),
   CONSTRAINT data_structures_fk FOREIGN KEY (id_individual_state)
       REFERENCES data_individual_states (id_individual_state) MATCH SIMPLE
       ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (OIDS=TRUE);
ALTER TABLE data_structures OWNER TO sanae;

CREATE INDEX data_structures_id_individual_state
   ON data_structures
   USING btree
   (id_individual_state);

Re: Delete performance

From
Frank Bax
Date:
Jana wrote:
>
> Hello,
> i have a table with about 250m records from which i want to delete
> thoose not contained in other table. I used this SQL query:
> DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
> SELECT id_structure FROM data_structures);


DELETE FROM data_structures_items, data_structures WHERE
data_structures_items.id_data_structure =
data_structures.id_data_structure AND
data_structures_items.id_data_structure IS NULL;

Re: Delete performance

From
Frank Bax
Date:
Frank Bax wrote:
> Jana wrote:
>>
>> Hello,
>> i have a table with about 250m records from which i want to delete
>> thoose not contained in other table. I used this SQL query:
>> DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
>> SELECT id_structure FROM data_structures);
>
>
> DELETE FROM data_structures_items, data_structures WHERE
> data_structures_items.id_data_structure =
> data_structures.id_data_structure AND
> data_structures_items.id_data_structure IS NULL;
>

Sorry; that should probably be:

DELETE FROM data_structures_items, data_structures WHERE
data_structures_items.id_data_structure =
data_structures.id_data_structure AND
data_structures.id_data_structure IS NULL;

Re: Delete performance

From
Jana
Date:
On Sun, 14 Jun 2009 18:12:50 +0200, Frank Bax <fbax@sympatico.ca> wrote:

> Jana wrote:
>>  Hello,
>> i have a table with about 250m records from which i want to delete
>> thoose not contained in other table. I used this SQL query:
>> DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
>> SELECT id_structure FROM data_structures);
>
>
> DELETE FROM data_structures_items, data_structures WHERE
> data_structures_items.id_data_structure =
> data_structures.id_data_structure AND
> data_structures_items.id_data_structure IS NULL;
>

Thanks for answer, this however is not a valid syntax (at least according
to manual, and my 8.3)
version. What could be done is

DELETE FROM data_structures_items USING data_structures
WHERE
data_structures_items.id_data_structure=data_structures.id_structure
AND something

but the problem is in that "something". I cannot write
data_structures_items.id_data_structure =
data_structures.id_data_structure AND
data_structures_items.id_data_structure IS NULL;

because it a) doesn't make sense (column can't be null and equal to
something at the same time), b) doesn't select what i want (rows whoose
id_data_structure is NOT in the data_structures table, i'm pretty sure it
is a number ). I can't join tables in DELETE command, and i can't think of
a way doing it with WHERE

Regards,
  Jana

Re: Delete performance

From
Phillip Sitbon
Date:
I've had to do this quite a bit, and here's how I usually go about it:

DELETE FROM data_structures_items WHERE NOT EXISTS
  (SELECT 1 FROM data_structures WHERE id_structure=id_data_structure LIMIT 1);

Even when the item in the subquery is a primary key, I find it
semantically informative to always use "LIMIT 1".

Looking closer at your structure, this looks a lot like a problem I
have had in the past with almost the exact same scenario.  What I
ended up doing beyond the above suggested query was to create a
temporary table. Since you're lucky enough to be working with a
primary key in the large table (I wasn't), you can copy all of the IDs
into temp table. Then delete from the temp table anything that appears
in your data_structures IDs. I'm not sure if it's necessary, but you
could also create an index on the temp table.

So your query becomes:

DELETE FROM data_structures_items WHERE EXISTS
  (SELECT 1 FROM tmp_not_in_data_structures WHERE
id_structure=id_data_structure LIMIT 1);

This should help if the set of IDs to delete is smaller than the
entire set of IDs. It works for me, but I can't be sure it'll help you
- my main goal was to minimize the condition checking overhead for
each row of the big table. Also, I'm not sure if it's
valid/possible/better, but instead of EXISTS, "IS NOT NULL" might work
here as well.

If your indexes are rather large (and therefore may not fit into
memory), you might consider partitioning. Along with the above
solution, I use partitions and operate on them in parallel from a
Python client.

Cheers,

  Phillip

6/15 Jana <jana.vasseru@gmail.com>:
> On Sun, 14 Jun 2009 18:12:50 +0200, Frank Bax <fbax@sympatico.ca> wrote:
>
>> Jana wrote:
>>>
>>>  Hello,
>>> i have a table with about 250m records from which i want to delete thoose
>>> not contained in other table. I used this SQL query:
>>> DELETE FROM data_structures_items WHERE id_data_structure NOT IN (
>>> SELECT id_structure FROM data_structures);
>>
>>
>> DELETE FROM data_structures_items, data_structures WHERE
>> data_structures_items.id_data_structure = data_structures.id_data_structure
>> AND data_structures_items.id_data_structure IS NULL;
>>
>
> Thanks for answer, this however is not a valid syntax (at least according to
> manual, and my 8.3)
> version. What could be done is
>
> DELETE FROM data_structures_items USING data_structures
> WHERE
> data_structures_items.id_data_structure=data_structures.id_structure
> AND something
>
> but the problem is in that "something". I cannot write
> data_structures_items.id_data_structure =
> data_structures.id_data_structure AND
> data_structures_items.id_data_structure IS NULL;
>
> because it a) doesn't make sense (column can't be null and equal to
> something at the same time), b) doesn't select what i want (rows whoose
> id_data_structure is NOT in the data_structures table, i'm pretty sure it
> is a number ). I can't join tables in DELETE command, and i can't think of a
> way doing it with WHERE
>
> Regards,
>  Jana
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice
>

Re: Delete performance

From
Tom Lane
Date:
Phillip Sitbon <phillip@sitbon.net> writes:
> I've had to do this quite a bit, and here's how I usually go about it:

> DELETE FROM data_structures_items WHERE NOT EXISTS
>   (SELECT 1 FROM data_structures WHERE id_structure=id_data_structure LIMIT 1);

> Even when the item in the subquery is a primary key, I find it
> semantically informative to always use "LIMIT 1".

EXISTS implies LIMIT 1 for its subquery; there's no value for either
comprehension or performance in adding that.  I'd recommend leaving
it off, because it makes your query syntax nonstandard for no benefit.

            regards, tom lane