Home > mailing lists

Thread: UPDATE on 20 Million Records Transaction or not?

UPDATE on 20 Million Records Transaction or not?

From

Jason Ralph

Date:

23 June 2020, 13:32:45

Hello List,

PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R

ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.

My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it?

Thanks,

Jason Ralph

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

Re: UPDATE on 20 Million Records Transaction or not?

From

Adrian Klaver

Date:

23 June 2020, 13:38:05

On 6/23/20 6:32 AM, Jason Ralph wrote:
> Hello List,
> 
> PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 
> 20120313 (R
> 
> ed Hat 4.4.7-23), 64-bit
> 
> I am planning an update on a table with 20Million records, I have been 
> researching the best practices.  I will remove all indexes and foreign 
> keys prior to the update, however I am not sure if I should use a 
> transaction or not.
> 
> My thought process is that a transaction would be easier to recover if 
> something fails, however it would take more time to write to the WAL log 
> in a transaction.

Unless this is an UNLOGGED table WALs will be written.

> 
> Would it make sense to make a back up of the table then execute update 
> without a transaction? How would you guys do it?

You could break it down into multiple transactions if there is way to 
specify ranges of records.

> 
> Thanks,
> 
> *Jason Ralph*
> 
> This message contains confidential information and is intended only for 
> the individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and 
> delete this e-mail from your system. E-mail transmission cannot be 
> guaranteed to be secure or error-free as information could be 
> intercepted, corrupted, lost, destroyed, arrive late or incomplete, or 
> contain viruses. The sender therefore does not accept liability for any 
> errors or omissions in the contents of this message, which arise as a 
> result of e-mail transmission. If verification is required please 
> request a hard-copy version.


-- 
Adrian Klaver
adrian.klaver@aklaver.com

RE: UPDATE on 20 Million Records Transaction or not?

From

Jason Ralph

Date:

23 June 2020, 13:46:23

Thanks Adrian,
> You could break it down into multiple transactions if there is way to specify ranges of records.
Say I couldn't break it up, would it be faster in or out of the transaction?


Jason Ralph

-----Original Message-----
From: Adrian Klaver <adrian.klaver@aklaver.com>
Sent: Tuesday, June 23, 2020 9:38 AM
To: Jason Ralph <jralph@affinitysolutions.com>; pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?

On 6/23/20 6:32 AM, Jason Ralph wrote:
> Hello List,
>
> PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (R
>
> ed Hat 4.4.7-23), 64-bit
>
> I am planning an update on a table with 20Million records, I have been
> researching the best practices.  I will remove all indexes and foreign
> keys prior to the update, however I am not sure if I should use a
> transaction or not.
>
> My thought process is that a transaction would be easier to recover if
> something fails, however it would take more time to write to the WAL
> log in a transaction.

Unless this is an UNLOGGED table WALs will be written.

>
> Would it make sense to make a back up of the table then execute update
> without a transaction? How would you guys do it?

You could break it down into multiple transactions if there is way to specify ranges of records.

>
> Thanks,
>
> *Jason Ralph*
>
> This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you
> should not disseminate, distribute or copy this e-mail. Please notify
> the sender immediately by e-mail if you have received this e-mail by
> mistake and delete this e-mail from your system. E-mail transmission
> cannot be guaranteed to be secure or error-free as information could
> be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
> or contain viruses. The sender therefore does not accept liability for
> any errors or omissions in the contents of this message, which arise
> as a result of e-mail transmission. If verification is required please
> request a hard-copy version.


--
Adrian Klaver
adrian.klaver@aklaver.com
This message contains confidential information and is intended only for the individual named. If you are not the named
addresseeyou should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if
youhave received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be
guaranteedto be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or
incomplete,or contain viruses. The sender therefore does not accept liability for any errors or omissions in the
contentsof this message, which arise as a result of e-mail transmission. If verification is required please request a
hard-copyversion.

Re: UPDATE on 20 Million Records Transaction or not?

From

luis.roberto@siscobra.com.br

Date:

23 June 2020, 13:50:12


Thanks Adrian,
> You could break it down into multiple transactions if there is way to specify ranges of records.
Say I couldn't break it up, would it be faster in or out of the transaction?


It depends whether you have concurrent transactions using the table. If you do, I think it would be better to split the
updatebetween smaller transactions.

Re: UPDATE on 20 Million Records Transaction or not?

From

Ron

Date:

23 June 2020, 14:56:48

On 6/23/20 8:32 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

Are you updating every row in the table?

Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)

Would it make sense to make a back up of the table then execute update without a transaction?

Always make a backup.

How would you guys do it?

It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.

--
Angular momentum makes the world go 'round.

RE: UPDATE on 20 Million Records Transaction or not?

From

Jason Ralph

Date:

23 June 2020, 15:17:01

Hello List,

PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R

ed Hat 4.4.7-23), 64-bit

My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

>Are you updating every row in the table?

No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'

DB=# select count(*) from members where regdate = '2020-07-07';

count

----------

17333090

(1 row)

>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)

Just checked regdate is not indexed so I will leave them in place.

Would it make sense to make a back up of the table then execute update without a transaction?

>Always make a backup.

Agreed

How would you guys do it?

>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.

Please see above, thanks

Jason Ralph

From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?

On 6/23/20 8:32 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

Are you updating every row in the table?

Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)

Would it make sense to make a back up of the table then execute update without a transaction?

Always make a backup.

How would you guys do it?

It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.

--
Angular momentum makes the world go 'round.

Re: UPDATE on 20 Million Records Transaction or not?

From

Ron

Date:

23 June 2020, 15:36:31

I'd make a copy of the table, and test how long the various methods take.

On 6/23/20 10:17 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)

>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Just checked regdate is not indexed so I will leave them in place.

Would it make sense to make a back up of the table then execute update without a transaction?

>Always make a backup.
Agreed

How would you guys do it?

>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.
Please see above, thanks

Jason Ralph

From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?

On 6/23/20 8:32 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

Are you updating every row in the table?

Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)

Would it make sense to make a back up of the table then execute update without a transaction?

Always make a backup.

How would you guys do it?

It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.

--
Angular momentum makes the world go 'round.
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

--
Angular momentum makes the world go 'round.

Re: UPDATE on 20 Million Records Transaction or not?

From

Ganesh Korde

Date:

23 June 2020, 17:19:22

You can try this

First take backup of table then execute below statements.

create table members_temp

select <other_columns>,'2038-01-18' regdate from members where regdate = '2020-07-07';

delete from members where regdate = '2020-07-07';

insert into members select * from members_temp ;

drop table members_temp;

Regards,

Ganesh Korde.

On Tue, Jun 23, 2020 at 9:06 PM Ron <ronljohnsonjr@gmail.com> wrote:

I'd make a copy of the table, and test how long the various methods take.

On 6/23/20 10:17 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)

>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Just checked regdate is not indexed so I will leave them in place.

Would it make sense to make a back up of the table then execute update without a transaction?

>Always make a backup.
Agreed

How would you guys do it?

>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.
Please see above, thanks

Jason Ralph

From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?

On 6/23/20 8:32 AM, Jason Ralph wrote:

Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit

I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.

Are you updating every row in the table?

Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)

Would it make sense to make a back up of the table then execute update without a transaction?

Always make a backup.

How would you guys do it?

It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.

--
Angular momentum makes the world go 'round.
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

--
Angular momentum makes the world go 'round.

Re: UPDATE on 20 Million Records Transaction or not?

From

Michael Lewis

Date:

23 June 2020, 17:42:50

>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)

Just update them and be done with it. Do the work in batches if it doesn't matter that concurrent accesses to the table might see some rows that have old value and some new. Given you are on PG11 and can commit within a function, you could adapt something like this to just run until done. Oh, if you have not tuned the autovacuum process much, depending on the total number of rows in your table, it could be good to manually vacuum once or twice during this process so that space is reused.

*the below was shared by Andreas Krogh on one of these mailing lists about 6 months back.

do $_$
declare
num_rows bigint;
begin
loop
delete from YourTable where id in
(select id from YourTable where id < 500 limit 100);

commit;
get diagnostics num_rows = row_count;
raise notice 'deleted % rows', num_rows;
exit when num_rows = 0;
end loop;
end;$_$;

RE: UPDATE on 20 Million Records Transaction or not?

From

Jason Ralph

Date:

23 June 2020, 20:54:06

>Just update them and be done with it. Do the work in batches if it doesn't matter that concurrent accesses to the table might see some >rows that have old value and some new. Given you are on PG11 and can commit within a function, you could adapt something like this to >just run until done. Oh, if you have not tuned the autovacuum process much, depending on the total number of rows in your table, it could >be good to manually vacuum once or twice during this process so that space is reused.

>*the below was shared by Andreas Krogh on one of these mailing lists about 6 months back.

>do $_$
>declare
> num_rows bigint;
>begin
> loop
> delete from YourTable where id in
> (select id from YourTable where id < 500 limit 100);

> commit;
> get diagnostics num_rows = row_count;
> raise notice 'deleted % rows', num_rows;
> exit when num_rows = 0;
> end loop;
> end;$_$;

Thanks to all the suggestions, I really like the function, I will test this. I have autovacuum fully tuned for this table, so should be good.

Jason Ralph

From: Michael Lewis <mlewis@entrata.com>
Sent: Tuesday, June 23, 2020 1:43 PM
To: Ganesh Korde <ganeshakorde@gmail.com>
Cc: Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?

>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)

commit;
get diagnostics num_rows = row_count;
raise notice 'deleted % rows', num_rows;
exit when num_rows = 0;
end loop;
end;$_$;

Re: UPDATE on 20 Million Records Transaction or not?

From

Tim Cross

Date:

24 June 2020, 00:52:35

Jason Ralph <jralph@affinitysolutions.com> writes:

> Hello List,
> PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
> ed Hat 4.4.7-23), 64-bit
>
> I am planning an update on a table with 20Million records, I have been researching the best practices.  I will remove
allindexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.

> My thought process is that a transaction would be easier to recover if something fails, however it would take more
timeto write to the WAL log in a transaction.

>
> Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do
it?
>

This is really the sort of thing you need to test in a development
environment. There are just too many unknowns to provide a definitive
answer. You need to run this with minimal 'adjustments' in a dev
scenario to get a baseline and then decide what, if anything, you need
to do.

Just because you have 20M records, it doesn't mean that you have to do
anything 'special'. A lot will depend on how many of that 20M need to be
updated, the size of the records and fields being updated, how the
updated records are spread through the table, what other processes are
updating or querying the table during the operation, which indexes are
used and which are unnecessary or just update overhead etc, etc.

Trying to optimise this operation without knowing where the bottlenecks
are is extremely difficult. Unless you know that every one of the 20M
records will be updated, I wouldn't rush into removing all indexes. I
would definitely look at breaking up the update into smaller
transactions.

As to the question about whether to use a transaction or not, it really
depends on your data and how easily you can recover from a partial
transaction. Assume the update is going to be interrupted before
completion. If you can recover from that situation easily, then not
using a full transaction with commit/rollback may be acceptable.
otherwise, I would avoid it.

At any rate, baseline first as you may be surprised. I was recently
working on an application using PG 9.6 which had numerous web clients
and a backend process which ran multiple times a day and which updated
millions of rows each run in a table with over 100M records. The full
update would take around 20 - 30 min, but consisted of multiple
transactions. The data was part of a weather information system used for
predictive modelling. The data consisted of rainfall, max/min temp,
humidity and vapour pressure for 9am and 3pm and solar radiation for
grid points 5km apart covering all of Australia. The table had records
for every day for the last 4+ years and as yet, does not use
partitioning. This is a lot of records, but each record is quite small.
While at some point, we will need to look at additional optimisations
like partitioning, the performance is currently within acceptable
limits. The only optimisation we have done is basic db tuning. When I
did this tuning, performance saw a significant improvement, but none of
it was specific to this large table or the update process.

The key point is that large numbers of records doesn't necessarily mean
that an update will be slow (or more specifically too slow for
acceptable performance, whatever that is). Assuming you will need to
take lots of special action may be premature - you may need to do none
or only a couple of things. Get a baseline first and you will know how
big of an issue you have. You will also be able to determine if what you
try has any benefit. 
-- 
Tim Cross

Re: UPDATE on 20 Million Records Transaction or not?

From

Tom Lane

Date:

24 June 2020, 01:44:15

Tim Cross <theophilusx@gmail.com> writes:
> Jason Ralph <jralph@affinitysolutions.com> writes:
>> I am planning an update on a table with 20Million records, I have been researching the best practices.  I will
removeall indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not. 
>> My thought process is that a transaction would be easier to recover if something fails, however it would take more
timeto write to the WAL log in a transaction. 
>>
>> Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do
it?

> Just because you have 20M records, it doesn't mean that you have to do
> anything 'special'.

Indeed.  That's really not a lot of data, on any modern machine.
Just for experiment, I tried updating a table of 20M rows on my
development machine (no powerhouse by any serious standard, plus
it's running a debug build which is likely 30% or more slower than
production):

regression=# \timing
Timing is on.
regression=# create table foo as select x, 'what' as w from generate_series(1,20000000) x;
SELECT 20000000
Time: 23573.818 ms (00:23.574)
regression=# update foo set w = 'other';
UPDATE 20000000
Time: 60878.418 ms (01:00.878)

OK, that's cheating, because I had no indexes, so ...

regression=# create index on foo(w);
CREATE INDEX
Time: 11774.094 ms (00:11.774)
regression=# update foo set w = x::text;
UPDATE 20000000
Time: 186425.606 ms (03:06.426)

That may still be cheating, because I had no foreign-key constraints or
other possibly-expensive features.  And you might have lots of indexes
not just one.  Still, this doesn't seem to be territory where you need to
spend days researching the best alternative.  I'm not even convinced that
it's worth your time to remove and rebuild indexes.

But the real reason I'm commenting is that the original question seems to
embed a whole lot of misunderstanding about how Postgres works.  There is
no such thing as updating "outside a transaction".  You can use an
explicit transaction (ie BEGIN + COMMIT) if you'd like to preserve the
option to roll back after the UPDATE command proper, but that will make
absolutely no difference to performance of the UPDATE.  Likewise, there
is no such thing as not writing WAL (unless you're using an UNLOGGED
table, which you should only ever do for throwaway data).  And also, there's
unlikely to be any performance benefit from carving up the update into
smaller pieces.  If you're seriously short of disk space you might need
to split up the update (because just doing it in one go will about double
the table's size-on-disk) and vacuum between the stages to reclaim disk
space.  But that will take more time not less ... and again, on any
modern machine, such steps shouldn't be necessary for only 20M rows.

            regards, tom lane