Thread: BUG #18166: 100 Gb 18000000 records table update

BUG #18166: 100 Gb 18000000 records table update

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18166
Logged by:          Ruslan Ganeev
Email address:      ruslan.ganeev@list.ru
PostgreSQL version: 12.6
Operating system:   CentOS 8
Description:

Greetings
Could you help us with the following 

We have a table with name «svjul», that contains 18 million records, 100 Gb
disk space occupied 

The table «svjul» contains 3 columns 
-  ID (Type: Integer) – it’s a key
-  enddate (Type: Date) 
-  Data (Type: JSONB). That column contains fields «OGRN» and «DataVip». The
value in a row «OGRN» is not unique: several records may contain the same
OGRN.

We tried to make a script, which sets enddate = '2022-12-31' for all
records, having the value in «DataVip» that is not maximal.  For other
records the script set s enddate = null
The problem is that the script is running for 6 hours, the main percentage
of time is taken by the rebuilding of indexes.

Would you be so kind to help us with the information whether there is any
possibility to update record without rebuilding indexes. Or may be there is
some programming technique to solve the problem of this kind.

Sorry for my English, I’m not a native speaker

Best regards

--Our script--

-- 0 minute
drop table if exists tempsvjul1;
create table tempsvjul1 (id bigint not null, code text not null, datapostuch
text, datavip text, constraint tempsvjul1_pkey primary key (id));

-- 30 minute
insert into tempsvjul1(id, code, datapostuch, datavip) select id,
data->>'OGRN', data -> 'SvUchetNO' ->> 'DataPostUch', data ->> 'DataVip'
from svjul;

-- 1 minute
create index idx_deduplicate_tempsvjul1_btree_codedpostdvipid on tempsvjul1
using btree (code, datapostuch desc, datavip desc, id desc);

-- 0 minute
drop table if exists tempsvjul2;
create table tempsvjul2(id bigint not null, constraint tempsvjul2_pkey
primary key (id));

-- 2 minute
insert into tempsvjul2(id) select distinct on (code) id from tempsvjul1
order by code, datapostuch desc, datavip desc, id desc;

-- 4 minute without indexes 6 hour with indexes
update svjul set
    code = coalesce((select tempsvjul1.code from tempsvjul1 where
tempsvjul1.id = svjul.id), '???'),
    enddate = case when exists(select 1 from tempsvjul2 where tempsvjul2.id
= svjul.id) then null else '2023-10-05'::date end;

-- 0 minute
drop table tempsvjul1;
drop table tempsvjul2;


Re: BUG #18166: 100 Gb 18000000 records table update

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> We tried to make a script, which sets enddate = '2022-12-31' for all
> records, having the value in «DataVip» that is not maximal.  For other
> records the script set s enddate = null
> The problem is that the script is running for 6 hours, the main percentage
> of time is taken by the rebuilding of indexes.

This is not a bug.  However ... a common workaround for bulk updates
like that is to drop all the table's indexes and then recreate them
afterwards.  It's often quicker than doing row-by-row index updates.

            regards, tom lane



Re[2]: BUG #18166: 100 Gb 18000000 records table update

From
Ruslan Ganeev
Date:

Thank you very much

--
Отправлено из Mail.ru для Android

суббота, 21 октября 2023г., 04:22 +03:00 от Tom Lane tgl@sss.pgh.pa.us:

PG Bug reporting form <noreply@postgresql.org> writes:
> We tried to make a script, which sets enddate = '2022-12-31' for all
> records, having the value in «DataVip» that is not maximal. For other
> records the script set s enddate = null
> The problem is that the script is running for 6 hours, the main percentage
> of time is taken by the rebuilding of indexes.

This is not a bug. However ... a common workaround for bulk updates
like that is to drop all the table's indexes and then recreate them
afterwards. It's often quicker than doing row-by-row index updates.

regards, tom lane