Best way to delete big amount of records from big table - Mailing list pgsql-performance

From Ekaterina Amez
Subject Best way to delete big amount of records from big table
Date
Msg-id 6e456b07-a396-f9e3-6a22-23a96b73420c@zunibal.com
Whole thread Raw
Responses Re: Best way to delete big amount of records from big table  (Michael Lewis <mlewis@entrata.com>)
Re: Best way to delete big amount of records from big table  (Laurenz Albe <laurenz.albe@cybertec.at>)
Re: Best way to delete big amount of records from big table  (Rick Otten <rottenwindfish@gmail.com>)
List pgsql-performance
Hello list,

I'm trying to clean up a database with millions of records of 
useless-but-don't-remove-just-in-case data. This database has all tables 
in public schema so I've created a new schema "old_data" to move there 
all this data. I have several tables with 20million of records or so 
that I've managed to clean up relatively fast without special effort 
(not having to drop indexes or constraints) What I've made with these 
tables is easy as these ones are going to be emptied (I have to keep 
tables) so I only have to insert data into old_data.new_table and 
truncate cascade.

But also I'm cleaning tables with 150million records where I'm going to 
remove 60% of existing data and after a few tests I'm not sure what's 
the best approach as all seem to take similar time to run. These tables 
are grouped in 4 tables group with master, detail, master_history, 
detail_history structure. None of the tables have primary key nor 
foreign key or any constraint but the sequence used for what should be 
the PK column, though this column is not defined as PK.

I've decided to delete from the last one in chunks (10 days of data per 
chunk but it coud be any other quantity) so I've created a function.  
I've tested it with indexes (in master_hist for filtering data and in 
detail_hist for the fk and pk), without indexes, after analyzing table, 
and no matter what I always end up with more or less the same execution 
time. I can afford the time it's getting to run but I'd like to know if 
it's there a better way to do this. I'm testing on version 9.2 BUT 
production server is 8.4 (legacy application, supposed to be in at least 
9.2 but recently discovered it was 8.4, planning upgrade but not now). 
Config parameters are default ones.

Table definition:

CREATE TABLE master (

   id integer serial NOT NULL,
   device_id int4 NOT NULL,
   col1 int4 NULL DEFAULT 0,
   data_date bpchar(17) NULL, -- field to filter data
   data_file_date bpchar(14) NULL
); -- 9 of 20 records to be removed

CREATE TABLE detail (
   id integer serial NOT NULL,
   parent_id int4 NOT NULL,
   col1 float8 NULL,
   col2 int4 NOT NULL
); -- 2304 of 5120 records to be removed

CREATE TABLE master_history (
   id integer serial NOT NULL,
   device_id int4 NOT NULL,
   col1 int4 NULL DEFAULT 0,
   data_date bpchar(17) NULL, -- field to filter data
   data_file_date bpchar(14) NULL
);  --355687 of 586999 records to be removed

CREATE TABLE detail_history (
   id integer serial NOT NULL,
   parent_id int4 NOT NULL,
   col1 float8 NULL,
   col2 int4 NOT NULL
); -- 91055872 of  150.271.744 records to be removed


And the function:

CREATE or replace FUNCTION delete_test() RETURNS integer AS $$
DECLARE
     _begin_date date;
     _end_date date := '2019-08-01';
     _begin_exec timestamp := clock_timestamp();
     _end_exec timestamp ;
     _begin_exec_partial timestamp;
     _end_exec_partial timestamp;
     _time double precision;
     _num_regs integer;
BEGIN
     for _begin_date in (select '2018-05-01'::date + s.a * '10 
days'::interval from (select generate_series(0,1000) as a) as s)
     loop
         if (_begin_date > _end_date) then
             raise log 'STOP!!!!!';
             exit;
         end if;
         raise log 'Date %', _begin_date;
         _begin_exec_partial := clock_timestamp();
         delete from public.detail_history t1
           where exists
             (select 1 from public.master_history t2
               where t2.id = t1.parent_id
                 and t2.data_date >= rpad(to_char(_begin_date, 
'YYYYMMDD'), 17, '0')
                 and t2.data_date < rpad(to_char((_begin_date + interval 
'10 days'), 'YYYYMMDD'), 17, '0'));
         GET DIAGNOSTICS _num_regs = ROW_COUNT;
         _end_exec_partial := clock_timestamp();
         _time := 1000 * ( extract(epoch from _end_exec_partial) - 
extract(epoch from _begin_exec_partial) );
         raise log 'Records removed % in % ms', _num_regs, _time;

     end loop;

     _end_exec := clock_timestamp();
     _time := 1000 * ( extract(epoch from _end_exec) - extract(epoch 
from _begin_exec) );
     raise log 'Total time: %', _time;
     return 0;
END;
$$ LANGUAGE plpgsql;


Delete execution plan in 8.4 is:

test_eka=# explain delete from public.detail_hist t1
test_eka-#   where exists
test_eka-#     (select 1 from public.master_hist t2
test_eka(#       where t2.id = t1.parent_id
test_eka(#         and t2.data_date >= '20180501000000000000000'
test_eka(#         and t2.data_date < '20190101000000000000000');
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=33431.46..5890182.88 rows=156649104 width=6)
    Hash Cond: (t1.parent_id = t2.id)
    ->  Seq Scan on detail_hist t1  (cost=0.00..2564256.04 
rows=156649104 width=10)
    ->  Hash  (cost=30922.13..30922.13 rows=152906 width=4)
          ->  Unique  (cost=30157.60..30922.13 rows=152906 width=4)
                ->  Sort  (cost=30157.60..30539.87 rows=152906 width=4)
                      Sort Key: t2.id
                      ->  Seq Scan on master_hist t2 
(cost=0.00..14897.65 rows=152906 width=4)
                            Filter: ((data_date >= 
'20180501000000000000000'::bpchar) AND (data_date < 
'20190101000000000000000'::bpchar))


After PK-FK creation (with IX over FK)

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=26678.41..5883424.77 rows=156648960 width=6)
    Hash Cond: (t1.id_param_espec_este = t2.id_param_espec_este_historico)
    ->  Seq Scan on param_espec_este_datos_historico_tbl t1 
(cost=0.00..2564254.60 rows=156648960 width=10)
    ->  Hash  (cost=24169.09..24169.09 rows=152906 width=4)
          ->  Unique  (cost=23404.56..24169.09 rows=152906 width=4)
                ->  Sort  (cost=23404.56..23786.82 rows=152906 width=4)
                      Sort Key: t2.id_param_espec_este_historico
                      ->  Index Scan using fecha_gps_pe_este_hist_idx on 
param_espec_este_historico_tbl t2 (cost=0.00..8144.60 rows=152906 width=4)
                            Index Cond: 
((fecha_gps_parametros_espectrales >= '20180501000000000000000'::bpchar) 
AND (fecha_gps_parametros_espectrales < '20190101000000000000000'::bpchar))


Any ideas are welcome.

Kind regards,

Ekaterina.





pgsql-performance by date:

Previous
From: Anders Steinlein
Date:
Subject: Re: Slow planning time when public schema included (12 vs. 9.4)
Next
From: Michael Lewis
Date:
Subject: Re: Best way to delete big amount of records from big table