Thread: UPDATE on 20 Million Records Transaction or not?
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it?
Thanks,
Jason Ralph
On 6/23/20 6:32 AM, Jason Ralph wrote: > Hello List, > > PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 > 20120313 (R > > ed Hat 4.4.7-23), 64-bit > > I am planning an update on a table with 20Million records, I have been > researching the best practices. I will remove all indexes and foreign > keys prior to the update, however I am not sure if I should use a > transaction or not. > > My thought process is that a transaction would be easier to recover if > something fails, however it would take more time to write to the WAL log > in a transaction. Unless this is an UNLOGGED table WALs will be written. > > Would it make sense to make a back up of the table then execute update > without a transaction? How would you guys do it? You could break it down into multiple transactions if there is way to specify ranges of records. > > Thanks, > > *Jason Ralph* > > This message contains confidential information and is intended only for > the individual named. If you are not the named addressee you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately by e-mail if you have received this e-mail by mistake and > delete this e-mail from your system. E-mail transmission cannot be > guaranteed to be secure or error-free as information could be > intercepted, corrupted, lost, destroyed, arrive late or incomplete, or > contain viruses. The sender therefore does not accept liability for any > errors or omissions in the contents of this message, which arise as a > result of e-mail transmission. If verification is required please > request a hard-copy version. -- Adrian Klaver adrian.klaver@aklaver.com
Thanks Adrian, > You could break it down into multiple transactions if there is way to specify ranges of records. Say I couldn't break it up, would it be faster in or out of the transaction? Jason Ralph -----Original Message----- From: Adrian Klaver <adrian.klaver@aklaver.com> Sent: Tuesday, June 23, 2020 9:38 AM To: Jason Ralph <jralph@affinitysolutions.com>; pgsql-general@lists.postgresql.org Subject: Re: UPDATE on 20 Million Records Transaction or not? On 6/23/20 6:32 AM, Jason Ralph wrote: > Hello List, > > PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 > 20120313 (R > > ed Hat 4.4.7-23), 64-bit > > I am planning an update on a table with 20Million records, I have been > researching the best practices. I will remove all indexes and foreign > keys prior to the update, however I am not sure if I should use a > transaction or not. > > My thought process is that a transaction would be easier to recover if > something fails, however it would take more time to write to the WAL > log in a transaction. Unless this is an UNLOGGED table WALs will be written. > > Would it make sense to make a back up of the table then execute update > without a transaction? How would you guys do it? You could break it down into multiple transactions if there is way to specify ranges of records. > > Thanks, > > *Jason Ralph* > > This message contains confidential information and is intended only > for the individual named. If you are not the named addressee you > should not disseminate, distribute or copy this e-mail. Please notify > the sender immediately by e-mail if you have received this e-mail by > mistake and delete this e-mail from your system. E-mail transmission > cannot be guaranteed to be secure or error-free as information could > be intercepted, corrupted, lost, destroyed, arrive late or incomplete, > or contain viruses. The sender therefore does not accept liability for > any errors or omissions in the contents of this message, which arise > as a result of e-mail transmission. If verification is required please > request a hard-copy version. -- Adrian Klaver adrian.klaver@aklaver.com This message contains confidential information and is intended only for the individual named. If you are not the named addresseeyou should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if youhave received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteedto be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete,or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contentsof this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copyversion.
Thanks Adrian, > You could break it down into multiple transactions if there is way to specify ranges of records. Say I couldn't break it up, would it be faster in or out of the transaction? It depends whether you have concurrent transactions using the table. If you do, I think it would be better to split the updatebetween smaller transactions.
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
Are you updating every row in the table?
Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Would it make sense to make a back up of the table then execute update without a transaction?
Always make a backup.
How would you guys do it?
It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.
Angular momentum makes the world go 'round.
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)
>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Just checked regdate is not indexed so I will leave them in place.
Would it make sense to make a back up of the table then execute update without a transaction?
>Always make a backup.
Agreed
How would you guys do it?
>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.
Please see above, thanks
Jason Ralph
From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?
On 6/23/20 8:32 AM, Jason Ralph wrote:
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
Are you updating every row in the table?
Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Would it make sense to make a back up of the table then execute update without a transaction?
Always make a backup.
How would you guys do it?
It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.
--
Angular momentum makes the world go 'round.
I'd make a copy of the table, and test how long the various methods take.
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
>Are you updating every row in the table?No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)
>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)Just checked regdate is not indexed so I will leave them in place.
Would it make sense to make a back up of the table then execute update without a transaction?
>Always make a backup.Agreed
How would you guys do it?
>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.Please see above, thanks
Jason Ralph
From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?
On 6/23/20 8:32 AM, Jason Ralph wrote:
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
Are you updating every row in the table?
Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Would it make sense to make a back up of the table then execute update without a transaction?
Always make a backup.How would you guys do it?
It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
I'd make a copy of the table, and test how long the various methods take.On 6/23/20 10:17 AM, Jason Ralph wrote:This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
>Are you updating every row in the table?No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)
>Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)Just checked regdate is not indexed so I will leave them in place.
Would it make sense to make a back up of the table then execute update without a transaction?
>Always make a backup.Agreed
How would you guys do it?
>It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.Please see above, thanks
Jason Ralph
From: Ron <ronljohnsonjr@gmail.com>
Sent: Tuesday, June 23, 2020 10:57 AM
To: pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?
On 6/23/20 8:32 AM, Jason Ralph wrote:
Hello List,
PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
ed Hat 4.4.7-23), 64-bit
I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
Are you updating every row in the table?
Are you updating indexed fields? (If not, then leave the indexes and FKs, since they won't be touched.)
Would it make sense to make a back up of the table then execute update without a transaction?
Always make a backup.How would you guys do it?
It depends on what percentage of the rows are being updated, which columns are being updated and how big the records are.--
Angular momentum makes the world go 'round.--
Angular momentum makes the world go 'round.
>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)
do $_$
declare
num_rows bigint;
begin
loop
delete from YourTable where id in
(select id from YourTable where id < 500 limit 100);
get diagnostics num_rows = row_count;
raise notice 'deleted % rows', num_rows;
exit when num_rows = 0;
end loop;
end;$_$;
>Just update them and be done with it. Do the work in batches if it doesn't matter that concurrent accesses to the table might see some >rows that have old value and some new. Given you are on PG11 and can commit within a function, you could adapt something like this to >just run until done. Oh, if you have not tuned the autovacuum process much, depending on the total number of rows in your table, it could >be good to manually vacuum once or twice during this process so that space is reused.
>*the below was shared by Andreas Krogh on one of these mailing lists about 6 months back.
>do $_$
>declare
> num_rows bigint;
>begin
> loop
> delete from YourTable where id in
> (select id from YourTable where id < 500 limit 100);
> commit;
> get diagnostics num_rows = row_count;
> raise notice 'deleted % rows', num_rows;
> exit when num_rows = 0;
> end loop;
> end;$_$;
Thanks to all the suggestions, I really like the function, I will test this. I have autovacuum fully tuned for this table, so should be good.
Jason Ralph
From: Michael Lewis <mlewis@entrata.com>
Sent: Tuesday, June 23, 2020 1:43 PM
To: Ganesh Korde <ganeshakorde@gmail.com>
Cc: Ron <ronljohnsonjr@gmail.com>; pgsql-general@lists.postgresql.org
Subject: Re: UPDATE on 20 Million Records Transaction or not?
>Are you updating every row in the table?
No I am using an update like so: UPDATE members SET regdate='2038-01-18' WHERE regdate='2020-07-07'
DB=# select count(*) from members where regdate = '2020-07-07';
count
----------
17333090
(1 row)
Just update them and be done with it. Do the work in batches if it doesn't matter that concurrent accesses to the table might see some rows that have old value and some new. Given you are on PG11 and can commit within a function, you could adapt something like this to just run until done. Oh, if you have not tuned the autovacuum process much, depending on the total number of rows in your table, it could be good to manually vacuum once or twice during this process so that space is reused.
*the below was shared by Andreas Krogh on one of these mailing lists about 6 months back.
do $_$
declare
num_rows bigint;
begin
loop
delete from YourTable where id in
(select id from YourTable where id < 500 limit 100);
commit;
get diagnostics num_rows = row_count;
raise notice 'deleted % rows', num_rows;
exit when num_rows = 0;
end loop;
end;$_$;
Jason Ralph <jralph@affinitysolutions.com> writes: > Hello List, > PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R > ed Hat 4.4.7-23), 64-bit > > I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove allindexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not. > My thought process is that a transaction would be easier to recover if something fails, however it would take more timeto write to the WAL log in a transaction. > > Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it? > This is really the sort of thing you need to test in a development environment. There are just too many unknowns to provide a definitive answer. You need to run this with minimal 'adjustments' in a dev scenario to get a baseline and then decide what, if anything, you need to do. Just because you have 20M records, it doesn't mean that you have to do anything 'special'. A lot will depend on how many of that 20M need to be updated, the size of the records and fields being updated, how the updated records are spread through the table, what other processes are updating or querying the table during the operation, which indexes are used and which are unnecessary or just update overhead etc, etc. Trying to optimise this operation without knowing where the bottlenecks are is extremely difficult. Unless you know that every one of the 20M records will be updated, I wouldn't rush into removing all indexes. I would definitely look at breaking up the update into smaller transactions. As to the question about whether to use a transaction or not, it really depends on your data and how easily you can recover from a partial transaction. Assume the update is going to be interrupted before completion. If you can recover from that situation easily, then not using a full transaction with commit/rollback may be acceptable. otherwise, I would avoid it. At any rate, baseline first as you may be surprised. I was recently working on an application using PG 9.6 which had numerous web clients and a backend process which ran multiple times a day and which updated millions of rows each run in a table with over 100M records. The full update would take around 20 - 30 min, but consisted of multiple transactions. The data was part of a weather information system used for predictive modelling. The data consisted of rainfall, max/min temp, humidity and vapour pressure for 9am and 3pm and solar radiation for grid points 5km apart covering all of Australia. The table had records for every day for the last 4+ years and as yet, does not use partitioning. This is a lot of records, but each record is quite small. While at some point, we will need to look at additional optimisations like partitioning, the performance is currently within acceptable limits. The only optimisation we have done is basic db tuning. When I did this tuning, performance saw a significant improvement, but none of it was specific to this large table or the update process. The key point is that large numbers of records doesn't necessarily mean that an update will be slow (or more specifically too slow for acceptable performance, whatever that is). Assuming you will need to take lots of special action may be premature - you may need to do none or only a couple of things. Get a baseline first and you will know how big of an issue you have. You will also be able to determine if what you try has any benefit. -- Tim Cross
Tim Cross <theophilusx@gmail.com> writes: > Jason Ralph <jralph@affinitysolutions.com> writes: >> I am planning an update on a table with 20Million records, I have been researching the best practices. I will removeall indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not. >> My thought process is that a transaction would be easier to recover if something fails, however it would take more timeto write to the WAL log in a transaction. >> >> Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it? > Just because you have 20M records, it doesn't mean that you have to do > anything 'special'. Indeed. That's really not a lot of data, on any modern machine. Just for experiment, I tried updating a table of 20M rows on my development machine (no powerhouse by any serious standard, plus it's running a debug build which is likely 30% or more slower than production): regression=# \timing Timing is on. regression=# create table foo as select x, 'what' as w from generate_series(1,20000000) x; SELECT 20000000 Time: 23573.818 ms (00:23.574) regression=# update foo set w = 'other'; UPDATE 20000000 Time: 60878.418 ms (01:00.878) OK, that's cheating, because I had no indexes, so ... regression=# create index on foo(w); CREATE INDEX Time: 11774.094 ms (00:11.774) regression=# update foo set w = x::text; UPDATE 20000000 Time: 186425.606 ms (03:06.426) That may still be cheating, because I had no foreign-key constraints or other possibly-expensive features. And you might have lots of indexes not just one. Still, this doesn't seem to be territory where you need to spend days researching the best alternative. I'm not even convinced that it's worth your time to remove and rebuild indexes. But the real reason I'm commenting is that the original question seems to embed a whole lot of misunderstanding about how Postgres works. There is no such thing as updating "outside a transaction". You can use an explicit transaction (ie BEGIN + COMMIT) if you'd like to preserve the option to roll back after the UPDATE command proper, but that will make absolutely no difference to performance of the UPDATE. Likewise, there is no such thing as not writing WAL (unless you're using an UNLOGGED table, which you should only ever do for throwaway data). And also, there's unlikely to be any performance benefit from carving up the update into smaller pieces. If you're seriously short of disk space you might need to split up the update (because just doing it in one go will about double the table's size-on-disk) and vacuum between the stages to reclaim disk space. But that will take more time not less ... and again, on any modern machine, such steps shouldn't be necessary for only 20M rows. regards, tom lane