Thread: Best way to delete big amount of records from big table

Best way to delete big amount of records from big table

From
Ekaterina Amez
Date:
Hello list,

I'm trying to clean up a database with millions of records of 
useless-but-don't-remove-just-in-case data. This database has all tables 
in public schema so I've created a new schema "old_data" to move there 
all this data. I have several tables with 20million of records or so 
that I've managed to clean up relatively fast without special effort 
(not having to drop indexes or constraints) What I've made with these 
tables is easy as these ones are going to be emptied (I have to keep 
tables) so I only have to insert data into old_data.new_table and 
truncate cascade.

But also I'm cleaning tables with 150million records where I'm going to 
remove 60% of existing data and after a few tests I'm not sure what's 
the best approach as all seem to take similar time to run. These tables 
are grouped in 4 tables group with master, detail, master_history, 
detail_history structure. None of the tables have primary key nor 
foreign key or any constraint but the sequence used for what should be 
the PK column, though this column is not defined as PK.

I've decided to delete from the last one in chunks (10 days of data per 
chunk but it coud be any other quantity) so I've created a function.  
I've tested it with indexes (in master_hist for filtering data and in 
detail_hist for the fk and pk), without indexes, after analyzing table, 
and no matter what I always end up with more or less the same execution 
time. I can afford the time it's getting to run but I'd like to know if 
it's there a better way to do this. I'm testing on version 9.2 BUT 
production server is 8.4 (legacy application, supposed to be in at least 
9.2 but recently discovered it was 8.4, planning upgrade but not now). 
Config parameters are default ones.

Table definition:

CREATE TABLE master (

   id integer serial NOT NULL,
   device_id int4 NOT NULL,
   col1 int4 NULL DEFAULT 0,
   data_date bpchar(17) NULL, -- field to filter data
   data_file_date bpchar(14) NULL
); -- 9 of 20 records to be removed

CREATE TABLE detail (
   id integer serial NOT NULL,
   parent_id int4 NOT NULL,
   col1 float8 NULL,
   col2 int4 NOT NULL
); -- 2304 of 5120 records to be removed

CREATE TABLE master_history (
   id integer serial NOT NULL,
   device_id int4 NOT NULL,
   col1 int4 NULL DEFAULT 0,
   data_date bpchar(17) NULL, -- field to filter data
   data_file_date bpchar(14) NULL
);  --355687 of 586999 records to be removed

CREATE TABLE detail_history (
   id integer serial NOT NULL,
   parent_id int4 NOT NULL,
   col1 float8 NULL,
   col2 int4 NOT NULL
); -- 91055872 of  150.271.744 records to be removed


And the function:

CREATE or replace FUNCTION delete_test() RETURNS integer AS $$
DECLARE
     _begin_date date;
     _end_date date := '2019-08-01';
     _begin_exec timestamp := clock_timestamp();
     _end_exec timestamp ;
     _begin_exec_partial timestamp;
     _end_exec_partial timestamp;
     _time double precision;
     _num_regs integer;
BEGIN
     for _begin_date in (select '2018-05-01'::date + s.a * '10 
days'::interval from (select generate_series(0,1000) as a) as s)
     loop
         if (_begin_date > _end_date) then
             raise log 'STOP!!!!!';
             exit;
         end if;
         raise log 'Date %', _begin_date;
         _begin_exec_partial := clock_timestamp();
         delete from public.detail_history t1
           where exists
             (select 1 from public.master_history t2
               where t2.id = t1.parent_id
                 and t2.data_date >= rpad(to_char(_begin_date, 
'YYYYMMDD'), 17, '0')
                 and t2.data_date < rpad(to_char((_begin_date + interval 
'10 days'), 'YYYYMMDD'), 17, '0'));
         GET DIAGNOSTICS _num_regs = ROW_COUNT;
         _end_exec_partial := clock_timestamp();
         _time := 1000 * ( extract(epoch from _end_exec_partial) - 
extract(epoch from _begin_exec_partial) );
         raise log 'Records removed % in % ms', _num_regs, _time;

     end loop;

     _end_exec := clock_timestamp();
     _time := 1000 * ( extract(epoch from _end_exec) - extract(epoch 
from _begin_exec) );
     raise log 'Total time: %', _time;
     return 0;
END;
$$ LANGUAGE plpgsql;


Delete execution plan in 8.4 is:

test_eka=# explain delete from public.detail_hist t1
test_eka-#   where exists
test_eka-#     (select 1 from public.master_hist t2
test_eka(#       where t2.id = t1.parent_id
test_eka(#         and t2.data_date >= '20180501000000000000000'
test_eka(#         and t2.data_date < '20190101000000000000000');
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=33431.46..5890182.88 rows=156649104 width=6)
    Hash Cond: (t1.parent_id = t2.id)
    ->  Seq Scan on detail_hist t1  (cost=0.00..2564256.04 
rows=156649104 width=10)
    ->  Hash  (cost=30922.13..30922.13 rows=152906 width=4)
          ->  Unique  (cost=30157.60..30922.13 rows=152906 width=4)
                ->  Sort  (cost=30157.60..30539.87 rows=152906 width=4)
                      Sort Key: t2.id
                      ->  Seq Scan on master_hist t2 
(cost=0.00..14897.65 rows=152906 width=4)
                            Filter: ((data_date >= 
'20180501000000000000000'::bpchar) AND (data_date < 
'20190101000000000000000'::bpchar))


After PK-FK creation (with IX over FK)

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=26678.41..5883424.77 rows=156648960 width=6)
    Hash Cond: (t1.id_param_espec_este = t2.id_param_espec_este_historico)
    ->  Seq Scan on param_espec_este_datos_historico_tbl t1 
(cost=0.00..2564254.60 rows=156648960 width=10)
    ->  Hash  (cost=24169.09..24169.09 rows=152906 width=4)
          ->  Unique  (cost=23404.56..24169.09 rows=152906 width=4)
                ->  Sort  (cost=23404.56..23786.82 rows=152906 width=4)
                      Sort Key: t2.id_param_espec_este_historico
                      ->  Index Scan using fecha_gps_pe_este_hist_idx on 
param_espec_este_historico_tbl t2 (cost=0.00..8144.60 rows=152906 width=4)
                            Index Cond: 
((fecha_gps_parametros_espectrales >= '20180501000000000000000'::bpchar) 
AND (fecha_gps_parametros_espectrales < '20190101000000000000000'::bpchar))


Any ideas are welcome.

Kind regards,

Ekaterina.





Re: Best way to delete big amount of records from big table

From
Michael Lewis
Date:
If you can afford the time, I am not sure the reason for the question. Just run it and be done with it, yes?

A couple of thoughts-
1) That is a big big transaction if you are doing all the cleanup in a single function call. Will this be a production system that is still online for this archiving? Having a plpgsql function that encapsulates the work seems fine, but I would limit the work to a month at a time or something and call the function repeatedly. Get the min month where records exist still, delete everything matching that, return. Rinse, repeat.
2) If you are deleting/moving most of the table (91 of 150 million), consider moving only the records you are keeping to a new table, renaming old table, and renaming new table back to original name. Then you can do what you want to shift the data in the old table and delete it.

Re: Best way to delete big amount of records from big table

From
Laurenz Albe
Date:
On Fri, 2020-03-27 at 15:13 +0100, Ekaterina Amez wrote:
> I'm trying to clean up a database with millions of records of 
> useless-but-don't-remove-just-in-case data. [...]
> 
> But also I'm cleaning tables with 150million records where I'm going to 
> remove 60% of existing data and after a few tests I'm not sure what's 
> the best approach as all seem to take similar time to run. These tables 
> are grouped in 4 tables group with master, detail, master_history, 
> detail_history structure. None of the tables have primary key nor 
> foreign key or any constraint but the sequence used for what should be 
> the PK column, though this column is not defined as PK.

You should define primary and foreign keys if you can, but I guess
I don't have to tell you that.

> I've decided to delete from the last one in chunks (10 days of data per 
> chunk but it coud be any other quantity) so I've created a function.  
> I've tested it with indexes (in master_hist for filtering data and in 
> detail_hist for the fk and pk), without indexes, after analyzing table, 
> and no matter what I always end up with more or less the same execution 
> time. I can afford the time it's getting to run but I'd like to know if 
> it's there a better way to do this.

There is no need to delete in batches unless you have a need to keep
transactions short (danger of deadlock because the data are still
modified, or you cannot afford to block autovacuum that long).

If you can drop the indexes while you do it (downtime), go for it.
Perhaps there is a way to use partial indexes that exclude all the
data that you have to delete, then work could go on as normal.

> I'm testing on version 9.2 BUT 
> production server is 8.4 (legacy application, supposed to be in at least 
> 9.2 but recently discovered it was 8.4, planning upgrade but not now). 
> Config parameters are default ones.

Now that is a seriously bad idea.  You should test on the same version
as you have running in production.  And you should insist in an upgrade.
People who insist in running ancient software often insist in ancient
hardware as well, and both is a good way to get data corruption.
If the system blows up, they are going to blame you.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Best way to delete big amount of records from big table

From
Ekaterina Amez
Date:
Sorry, I sent my response only to you, I'm sending it again to the group in a minute...

El vie., 27 mar. 2020 a las 15:41, Michael Lewis (<mlewis@entrata.com>) escribió:
If you can afford the time, I am not sure the reason for the question. Just run it and be done with it, yes?

A couple of thoughts-
1) That is a big big transaction if you are doing all the cleanup in a single function call. Will this be a production system that is still online for this archiving? Having a plpgsql function that encapsulates the work seems fine, but I would limit the work to a month at a time or something and call the function repeatedly. Get the min month where records exist still, delete everything matching that, return. Rinse, repeat.
2) If you are deleting/moving most of the table (91 of 150 million), consider moving only the records you are keeping to a new table, renaming old table, and renaming new table back to original name. Then you can do what you want to shift the data in the old table and delete it.

Re: Best way to delete big amount of records from big table

From
Ekaterina Amez
Date:
Hi Michael,

El vie., 27 mar. 2020 a las 15:41, Michael Lewis (<mlewis@entrata.com>) escribió:
If you can afford the time, I am not sure the reason for the question. Just run it and be done with it, yes?

I've been working with other RDBMS all of my life and I'm quite new to PG world,  and I'm learning to do things when I need to do them so I'm trying to learn them in the right way :D
Also, for what I'm seeing in other projects, this is going to be a problem in most of them (if it's not yet a problem), and it's going to be me the one that solves it so again I'm in the path of learning to do this kind of things in the right way.
 

A couple of thoughts-
1) That is a big big transaction if you are doing all the cleanup in a single function call. Will this be a production system that is still online for this archiving? Having a plpgsql function that encapsulates the work seems fine, but I would limit the work to a month at a time or something and call the function repeatedly. Get the min month where records exist still, delete everything matching that, return. Rinse, repeat.

Ok, the function provided it's just a first approach. I was planning to add parameters to make dates more flexible.

2) If you are deleting/moving most of the table (91 of 150 million), consider moving only the records you are keeping to a new table, renaming old table, and renaming new table back to original name. Then you can do what you want to shift the data in the old table and delete it.

I was aware of this solution but I've read it's not side effect free. As my tables don't have any kind of FK-PK only the sequences for the serial columns,  would this be a safe way to do what I want?

Re: Best way to delete big amount of records from big table

From
Rick Otten
Date:


On Fri, Mar 27, 2020 at 10:14 AM Ekaterina Amez <ekaterina.amez@zunibal.com> wrote:

it's there a better way to do this. I'm testing on version 9.2 BUT
production server is 8.4 (legacy application, supposed to be in at least
9.2 but recently discovered it was 8.4, planning upgrade but not now).
Config parameters are default ones.

PostgreSQL 8.4 came out in 2009 and hit EOL in 2014.   PostgreSQL 9.2 hit EOL in 2017.


Re: Best way to delete big amount of records from big table

From
Justin Pryzby
Date:
On Fri, Mar 27, 2020 at 08:41:04AM -0600, Michael Lewis wrote:
> 2) If you are deleting/moving most of the table (91 of 150 million),
> consider moving only the records you are keeping to a new table, renaming
> old table, and renaming new table back to original name. Then you can do
> what you want to shift the data in the old table and delete it.

You could also make the old table a child of (inherit from) the new table.
That allows you to remove rows separately from removing them.
Partitioning (with legacy inheritence or the new, integrated way available in
postgres 10) allows DROPing oldest tables rather than DELETEing from one
gigantic table.

You should consider somehow cleaning up the old table after you DELETE from it,
maybe using vacuum full (which requires a long exclusive lock) or pg_repack
(which briefly acquires an exclusive lock).

-- 
Justin



Re: Best way to delete big amount of records from big table

From
Ekaterina Amez
Date:
Hi Laurenz,

El vie., 27 mar. 2020 a las 15:46, Laurenz Albe (<laurenz.albe@cybertec.at>) escribió:
On Fri, 2020-03-27 at 15:13 +0100, Ekaterina Amez wrote:
> I'm trying to clean up a database with millions of records of
> useless-but-don't-remove-just-in-case data. [...]
>
> But also I'm cleaning tables with 150million records where I'm going to
> remove 60% of existing data and after a few tests I'm not sure what's
> the best approach as all seem to take similar time to run. These tables
> are grouped in 4 tables group with master, detail, master_history,
> detail_history structure. None of the tables have primary key nor
> foreign key or any constraint but the sequence used for what should be
> the PK column, though this column is not defined as PK.

You should define primary and foreign keys if you can, but I guess
I don't have to tell you that.

I know about DB design ;)
This structure of master-detail-master_hist-detail_hist is repeated all over the DB and other groups of tables are perfectly created with theri PK-FK-UQ-IX... I don't know why these ones haven't been created in the same way.
Excuse me if this is a silly question but I've read (or understood) that it's better to remove constraints to improve delete performance... this is related to indexes only? or also to PK-FK?


> I've decided to delete from the last one in chunks (10 days of data per
> chunk but it coud be any other quantity) so I've created a function. 
> I've tested it with indexes (in master_hist for filtering data and in
> detail_hist for the fk and pk), without indexes, after analyzing table,
> and no matter what I always end up with more or less the same execution
> time. I can afford the time it's getting to run but I'd like to know if
> it's there a better way to do this.

There is no need to delete in batches unless you have a need to keep
transactions short (danger of deadlock because the data are still
modified, or you cannot afford to block autovacuum that long).

I prefer doing it in batches because I know there are other processes accessing this table and I can't assure they won't change any data.


If you can drop the indexes while you do it (downtime), go for it.
Perhaps there is a way to use partial indexes that exclude all the
data that you have to delete, then work could go on as normal.

As I said, these particular tables doesn't have any indexes at all. I'll give a try to the partial index suggestion, thanks.


> I'm testing on version 9.2 BUT
> production server is 8.4 (legacy application, supposed to be in at least
> 9.2 but recently discovered it was 8.4, planning upgrade but not now).
> Config parameters are default ones.

Now that is a seriously bad idea.  You should test on the same version
as you have running in production.  And you should insist in an upgrade.
People who insist in running ancient software often insist in ancient
hardware as well, and both is a good way to get data corruption.
If the system blows up, they are going to blame you.

Believe me, I'm totally aware of all of this. Upgrade is planned to happen after I clean up the database. I'm the one that has discover that production server is so old, it looked like no one knew it before. In the time I've been working here I've upgraded 2 servers.


Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com


Regards,
Ekaterina

Re: Best way to delete big amount of records from big table

From
Laurenz Albe
Date:
On Fri, 2020-03-27 at 16:15 +0100, Ekaterina Amez wrote:
> > You should define primary and foreign keys if you can, but I guess
> > I don't have to tell you that.
> 
> Excuse me if this is a silly question but I've read (or understood) that it's better
> to remove constraints to improve delete performance... this is related to indexes only? or also to PK-FK?

I meant, add constraints *after* you are done cleaning up.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com