Thread: UPDATE many records
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
int4range(i, i+10000)
;Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
On 1/6/20 10:36 AM, Israel Brewster wrote: > Thanks to a change in historical data, I have a need to update a large > number of records (around 50 million). The update itself is straight > forward, as I can just issue an "UPDATE table_name SET > changed_field=new_value();" (yes, new_value is the result of a stored > procedure, if that makes a difference) command via psql, and it should > work. However, due to the large number of records this command will > obviously take a while, and if anything goes wrong during the update > (one bad value in row 45 million, lost connection, etc), all the work > that has been done already will be lost due to the transactional nature > of such commands (unless I am missing something). > > Given that each row update is completely independent of any other row, I > have the following questions: > > 1) Is there any way to set the command such that each row change is > committed as it is calculated? Pretty sure: UPDATE table_name SET changed_field=new_value(); is seen as a single statement and is all or none. If you want to go row by row you will need to have the statement run on a row by row basis or maybe in batches. > 2) Is there some way to run this command in parallel in order to better > utilize multiple processor cores, other than manually breaking the data > into chunks and running a separate psql/update process for each chunk? > Honestly, manual parallelizing wouldn’t be too bad (there are a number > of logical segregations I can apply), I’m just wondering if there is a > more automatic option. This is good time to ask what Postgres version? I am still working out the recent parallel query system additions. Not sure if it applies to UPDATE or not. > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 > -- Adrian Klaver adrian.klaver@aklaver.com
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
question, "How would the Lone Ranger handle this?"
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:There are several ways to actually do thisIf you have Postgresql 11 or higher we now have Create Procedure that allows committing transactions, one draw back is it can not parallel from inside the procedureIf its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you wantfor i in list of IDsbegin ;"UPDATE table_name SET changed_field=new_value() where ID @>int4range(i, i+10000)
;commit;To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel but given the simple update it will NOT help in performance, this will be Hard disk IO bound, not process bound where parallelization helpsOn Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu> wrote:Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.create table just_keys as select pk_column from big_historical_table;alter table just_keys add column processed boolean;create index jkpk on just_keys(pk_column) where (processed is null);then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where processed is null limit 1000;[do update on big_historical_table where pk_column in (select pk_column from iteration)]update iteration set processed='true' where pk_column in (select pk_column from iteration);drop table iteration;
Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.In that case, more like...create temp sequence seq_procs start with 1 maxval 8 cycle;create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding to their assigned batch number.
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
--When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
Good information. I did forget to mention that I am using PostgreSQL 11.5. I also was not aware of the distinction between PROCEDURE and FUNCTION, so I guess I used the wrong terminology there when stating that new_value is the result of a stored procedure. It’s actually a function.So would your suggestion then be to create a procedure that loops through the records, calculating and committing each one (or, as in your older Postgres example, batches of 10k to 20k)?Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on how complicated the actual update is? Still, there is a good chance you are correct in that statement, so that aspect is probably not worth spending too much time on.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:There are several ways to actually do thisIf you have Postgresql 11 or higher we now have Create Procedure that allows committing transactions, one draw back is it can not parallel from inside the procedureIf its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you wantfor i in list of IDsbegin ;"UPDATE table_name SET changed_field=new_value() where ID @>int4range(i, i+10000)
;commit;To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel but given the simple update it will NOT help in performance, this will be Hard disk IO bound, not process bound where parallelization helpsOn Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu> wrote:Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
> On 6 Jan 2020, at 21:15, Israel Brewster <ijbrewster@alaska.edu> wrote: > >> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote: >> >> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote: >> Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The updateitself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_valueis the result of a stored procedure, if that makes a difference) command via psql, and it should work. However,due to the large number of records this command will obviously take a while, and if anything goes wrong during theupdate (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost dueto the transactional nature of such commands (unless I am missing something). >> >> Given that each row update is completely independent of any other row, I have the following questions: >> >> 1) Is there any way to set the command such that each row change is committed as it is calculated? >> 2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manuallybreaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizingwouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there isa more automatic option. >> >> Yeah, I'd be inclined to do this in batches. I think you’re overcomplicating the matter. I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware,depending on how resource-intensive your function is. If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it failsand what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situationthat won’t require more than a few workarounds - after which you can just run the update again. Ad 1). No harm has been done, it’s a single transaction that rolled back. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.create table just_keys as select pk_column from big_historical_table;alter table just_keys add column processed boolean;create index jkpk on just_keys(pk_column) where (processed is null);then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where processed is null limit 1000;[do update on big_historical_table where pk_column in (select pk_column from iteration)]update iteration set processed='true' where pk_column in (select pk_column from iteration);drop table iteration;Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.In that case, more like...create temp sequence seq_procs start with 1 maxval 8 cycle;create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding to their assigned batch number.Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info
I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc). (I suspect there’s a correlation between table size (business value) and number of indices)On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com> wrote:I think you’re overcomplicating the matter.
I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.
If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.
Ad 1). No harm has been done, it’s a single transaction that rolled back.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
On Jan 6, 2020, at 11:38 AM, Christopher Browne <cbbrowne@gmail.com> wrote:On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster@alaska.edu> wrote:On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.create table just_keys as select pk_column from big_historical_table;alter table just_keys add column processed boolean;create index jkpk on just_keys(pk_column) where (processed is null);then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where processed is null limit 1000;[do update on big_historical_table where pk_column in (select pk_column from iteration)]update iteration set processed='true' where pk_column in (select pk_column from iteration);drop table iteration;Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.In that case, more like...create temp sequence seq_procs start with 1 maxval 8 cycle;create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding to their assigned batch number.Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the infoAs for parallelism, if you have really powerful disk, lots of disks on disk array, it may help. Or not, as commented.I didn't test my wee bit of code, so yep, I meant to update just_keys :-).You won't find something terribly much more automatic.Oh, yah, there's a possible further complication; does the application need to get stopped to do this update? Is the newest version of the app still generating data that needs the rewriting? Sure hope not…
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
On Jan 6, 2020, at 11:40 AM, Rob Sargent <robjsargent@gmail.com> wrote:I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc). (I suspect there’s a correlation between table size (business value) and number of indices)On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com> wrote:I think you’re overcomplicating the matter.
I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.
Ad 1). No harm has been done, it’s a single transaction that rolled back.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
Just out of curiosity, what kind of trigger are you using, a row level trigger or a statement level trigger? If you are using a row level trigger, see if you can achieve your requirements using a statement level trigger instead. I’m relatively new to Postgres, so there could be some limit that I’m not aware of, but my understanding is that you have access to the old and new values of the updated rows in the after statement trigger. It would likely be much more performant to do your operation once after the statement is done rather than firing a trigger on every changed row.
Regards,
Mark Z.
From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 1:24 PM
To: Michael Lewis <mlewis@entrata.com>
Cc: Rob Sargent <robjsargent@gmail.com>; Alban Hertroys <haramrae@gmail.com>; Christopher Browne <cbbrowne@gmail.com>; pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: UPDATE many records
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.
One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Just out of curiosity, what kind of trigger are you using, a row level trigger or a statement level trigger? If you are using a row level trigger, see if you can achieve your requirements using a statement level trigger instead. I’m relatively new to Postgres, so there could be some limit that I’m not aware of, but my understanding is that you have access to the old and new values of the updated rows in the after statement trigger. It would likely be much more performant to do your operation once after the statement is done rather than firing a trigger on every changed row.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:What was the HD wait time ? What tool is being use to monitor the server resources??
It appears based on this information there is allot more going on than a simple Update command
Moving code out of the trigger probably not going to improve performance, unless there is allot of code that does not need to be processed for this update or code touching other tables
Study the trigger identify what has to run, pull that code out, then disable the trigger. Move the necessary code to a new function for Updating..
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.
question, "How would the Lone Ranger handle this?"
rotate_tilt_data: time=6560.401 calls=9299
On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:What was the HD wait time ? What tool is being use to monitor the server resources??No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.It appears based on this information there is allot more going on than a simple Update commandDepending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.Moving code out of the trigger probably not going to improve performance, unless there is allot of code that does not need to be processed for this update or code touching other tablesOne SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.Study the trigger identify what has to run, pull that code out, then disable the trigger. Move the necessary code to a new function for Updating..Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.Thanks again!---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
You don’t tell us if other users will be concurrently changing any of the records involved. If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new. Given the way Postgres handles updates, I would think that might perform significantly better. Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better. Especially if you re-create the indexes after all the data is moved.
From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many records
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).
Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
On Jan 7, 2020, at 10:09 AM, Mark Zellers <markz@adaptiveinsights.com> wrote:You don’t tell us if other users will be concurrently changing any of the records involved. If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new. Given the way Postgres handles updates, I would think that might perform significantly better. Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better. Especially if you re-create the indexes after all the data is moved.From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many recordsThanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
On Jan 7, 2020, at 10:09 AM, Mark Zellers <markz@adaptiveinsights.com> wrote:You don’t tell us if other users will be concurrently changing any of the records involved. If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new. Given the way Postgres handles updates, I would think that might perform significantly better. Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better. Especially if you re-create the indexes after all the data is moved.From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many recordsThanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:1) Is there any way to set the command such that each row change is committed as it is calculated?2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.---Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320Work: 907-474-5172
cell: 907-328-9145
On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote: > One potential issue I just thought of with this approach: disk space. > Will I be doubling the amount of space used while both tables exist? > If so, that would prevent this from working - I don’t have that much > space available at the moment. The original update you planned would do that, too. You probably need to just do the update in batches and vacuum the table between batches.
> On Jan 7, 2020, at 11:56 AM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote: > > On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote: >> One potential issue I just thought of with this approach: disk space. >> Will I be doubling the amount of space used while both tables exist? >> If so, that would prevent this from working - I don’t have that much >> space available at the moment. > > The original update you planned would do that, too. > > You probably need to just do the update in batches and vacuum the table > between batches. > > Really? Why? With the update I am only changing data - I’m not adding any additional data, so the total size should staythe same, right? I’m obviously missing something… :-) --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 >
On 1/7/20 12:47 PM, Israel Brewster wrote: > One potential issue I just thought of with this approach: disk space. > Will I be doubling the amount of space used while both tables exist? If > so, that would prevent this from working - I don’t have that much space > available at the moment. It will definitely increase the disk space by at least the data in the new table. How much relative to the old table is going to depend on how aggressive the AUTOVACUUM/VACUUM is. A suggestion for an alternative approach: 1) Create a table: create table change_table(id int, changed_fld some_type) where is is the PK from the existing table. 2) Run your conversion function against existing table with change to have it put new field value in change_table keyed to id/PK. Probably do this in batches. 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk = change_table.id; > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 -- Adrian Klaver adrian.klaver@aklaver.com
On 1/7/20 12:58 PM, Israel Brewster wrote: >> On Jan 7, 2020, at 11:56 AM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote: >> >> On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote: >>> One potential issue I just thought of with this approach: disk space. >>> Will I be doubling the amount of space used while both tables exist? >>> If so, that would prevent this from working - I don’t have that much >>> space available at the moment. >> >> The original update you planned would do that, too. >> >> You probably need to just do the update in batches and vacuum the table >> between batches. >> >> > > Really? Why? With the update I am only changing data - I’m not adding any additional data, so the total size should staythe same, right? I’m obviously missing something… :-) https://www.postgresql.org/docs/12/sql-vacuum.html "VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables." > > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 > >> > > > -- Adrian Klaver adrian.klaver@aklaver.com
> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > On 1/7/20 12:47 PM, Israel Brewster wrote: >> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used whileboth tables exist? If so, that would prevent this from working - I don’t have that much space available at the moment. > > It will definitely increase the disk space by at least the data in the new table. How much relative to the old table isgoing to depend on how aggressive the AUTOVACUUM/VACUUM is. > > A suggestion for an alternative approach: > > 1) Create a table: > > create table change_table(id int, changed_fld some_type) > > where is is the PK from the existing table. > > 2) Run your conversion function against existing table with change to have it put new field value in change_table keyedto id/PK. Probably do this in batches. > > 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk= change_table.id; Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not havingto call any functions. From what you are saying about updates though, I may still need to batch the UPDATE section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are obsoletedby an update”, which is possible. > >> --- >> Israel Brewster >> Software Engineer >> Alaska Volcano Observatory >> Geophysical Institute - UAF >> 2156 Koyukuk Drive >> Fairbanks AK 99775-7320 >> Work: 907-474-5172 >> cell: 907-328-9145 > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote: > > > Really? Why? With the update I am only changing data - I’m not adding > any additional data, so the total size should stay the same, right? > I’m obviously missing something… :-) > PostgreSQL keeps the old row until it gets vacuumed, as it needs to be visible to other transactions. Not only that, but every index record gets updated to point to the location of the new data row too (excluding HOT), and those old index blocks also need to get vacuumed. And none of those rows can get removed until your update finishes. I know this isn't universally true with HOT and fillfactor etc. but with an update this big I think it's safe to say most of the space will get doubled. Plus you'll get a ton of write-ahead logs.
> On Jan 7, 2020, at 12:15 PM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote: > > On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote: >>> >> Really? Why? With the update I am only changing data - I’m not adding >> any additional data, so the total size should stay the same, right? >> I’m obviously missing something… :-) >> > > PostgreSQL keeps the old row until it gets vacuumed, as it needs to be > visible to other transactions. Not only that, but every index record > gets updated to point to the location of the new data row too (excluding > HOT), and those old index blocks also need to get vacuumed. And none of > those rows can get removed until your update finishes. > > I know this isn't universally true with HOT and fillfactor etc. but with > an update this big I think it's safe to say most of the space will get > doubled. > > Plus you'll get a ton of write-ahead logs. Gotcha. Batches with VACUUM it is! Thanks for the info. --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > > >
On 1/7/20 1:10 PM, Israel Brewster wrote: > >> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote: >> >> On 1/7/20 12:47 PM, Israel Brewster wrote: >>> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used whileboth tables exist? If so, that would prevent this from working - I don’t have that much space available at the moment. >> >> It will definitely increase the disk space by at least the data in the new table. How much relative to the old table isgoing to depend on how aggressive the AUTOVACUUM/VACUUM is. >> >> A suggestion for an alternative approach: >> >> 1) Create a table: >> >> create table change_table(id int, changed_fld some_type) >> >> where is is the PK from the existing table. >> >> 2) Run your conversion function against existing table with change to have it put new field value in change_table keyedto id/PK. Probably do this in batches. >> >> 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk= change_table.id; > > Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not havingto call any functions. From what you are saying about updates though, I may still need to batch the UPDATE section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are obsoletedby an update”, which is possible. You are not. For a more thorough explanation see: https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS How much space do you have to work with? To get an idea of the disk space currently used by table see; https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT > >> >>> --- >>> Israel Brewster >>> Software Engineer >>> Alaska Volcano Observatory >>> Geophysical Institute - UAF >>> 2156 Koyukuk Drive >>> Fairbanks AK 99775-7320 >>> Work: 907-474-5172 >>> cell: 907-328-9145 >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com > -- Adrian Klaver adrian.klaver@aklaver.com
On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 1/7/20 1:10 PM, Israel Brewster wrote:On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not having to call any functions. From what you are saying about updates though, I may still need to batch the UPDATE section, with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are obsoleted by an update”, which is possible.
On 1/7/20 12:47 PM, Israel Brewster wrote:One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used while both tables exist? If so, that would prevent this from working - I don’t have that much space available at the moment.
It will definitely increase the disk space by at least the data in the new table. How much relative to the old table is going to depend on how aggressive the AUTOVACUUM/VACUUM is.
A suggestion for an alternative approach:
1) Create a table:
create table change_table(id int, changed_fld some_type)
where is is the PK from the existing table.
2) Run your conversion function against existing table with change to have it put new field value in change_table keyed to id/PK. Probably do this in batches.
3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk = change_table.id;
You are not. For a more thorough explanation see:
https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS
How much space do you have to work with?
To get an idea of the disk space currently used by table see;
https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Adrian Klaver
adrian.klaver@aklaver.com
On 1/7/20 1:43 PM, Israel Brewster wrote: >> On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com >> <mailto:adrian.klaver@aklaver.com>> wrote: >> >> On 1/7/20 1:10 PM, Israel Brewster wrote: >>>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver >>>> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: >>>> >>>> On 1/7/20 12:47 PM, Israel Brewster wrote: >>>>> One potential issue I just thought of with this approach: disk >>>>> space. Will I be doubling the amount of space used while both >>>>> tables exist? If so, that would prevent this from working - I don’t >>>>> have that much space available at the moment. >>>> >>>> It will definitely increase the disk space by at least the data in >>>> the new table. How much relative to the old table is going to depend >>>> on how aggressive the AUTOVACUUM/VACUUM is. >>>> >>>> A suggestion for an alternative approach: >>>> >>>> 1) Create a table: >>>> >>>> create table change_table(id int, changed_fld some_type) >>>> >>>> where is is the PK from the existing table. >>>> >>>> 2) Run your conversion function against existing table with change >>>> to have it put new field value in change_table keyed to id/PK. >>>> Probably do this in batches. >>>> >>>> 3) Once all the values have been updated, do an UPDATE set >>>> changed_field = changed_fld from change_table where >>>> existing_table.pk = change_table.id <http://change_table.id>; >>> Makes sense. Use the fast SELECT to create/populate the other table, >>> then the update can just be setting a value, not having to call any >>> functions. From what you are saying about updates though, I may still >>> need to batch the UPDATE section, with occasional VACUUMs to maintain >>> disk space. Unless I am not understanding the concept of “tuples that >>> are obsoleted by an update”, which is possible. >> >> You are not. For a more thorough explanation see: >> >> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS >> >> How much space do you have to work with? >> >> To get an idea of the disk space currently used by table see; >> >> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT > > Oh, ok, I guess I was being overly paranoid on this front. Those > functions would indicate that the table is only 7.5 GB, with another > 8.7GB of indexes, for a total of around 16GB. So not a problem after all > - I have around 100GB available. > > Of course, that now leaves me with the mystery of where my other 500GB > of disk space is going, since it is apparently NOT going to my DB as I > had assumed, but solving that can wait. Assuming you are on some form of Linux: sudo du -h -d 1 / Then you can drill down into the output of above. > > Thanks again for all the good information and suggestions! > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 >> >>>> >>>>> --- >>>>> Israel Brewster >>>>> Software Engineer >>>>> Alaska Volcano Observatory >>>>> Geophysical Institute - UAF >>>>> 2156 Koyukuk Drive >>>>> Fairbanks AK 99775-7320 >>>>> Work: 907-474-5172 >>>>> cell: 907-328-9145 >>>> >>>> >>>> -- >>>> Adrian Klaver >>>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > -- Adrian Klaver adrian.klaver@aklaver.com
> > On Jan 7, 2020, at 12:57 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > On 1/7/20 1:43 PM, Israel Brewster wrote: >>> On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: >>> >>> On 1/7/20 1:10 PM, Israel Brewster wrote: >>>>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: >>>>> >>>>> On 1/7/20 12:47 PM, Israel Brewster wrote: >>>>>> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space usedwhile both tables exist? If so, that would prevent this from working - I don’t have that much space available at themoment. >>>>> >>>>> It will definitely increase the disk space by at least the data in the new table. How much relative to the old tableis going to depend on how aggressive the AUTOVACUUM/VACUUM is. >>>>> >>>>> A suggestion for an alternative approach: >>>>> >>>>> 1) Create a table: >>>>> >>>>> create table change_table(id int, changed_fld some_type) >>>>> >>>>> where is is the PK from the existing table. >>>>> >>>>> 2) Run your conversion function against existing table with change to have it put new field value in change_table keyedto id/PK. Probably do this in batches. >>>>> >>>>> 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk= change_table.id <http://change_table.id>; >>>> Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, nothaving to call any functions. From what you are saying about updates though, I may still need to batch the UPDATE section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are obsoletedby an update”, which is possible. >>> >>> You are not. For a more thorough explanation see: >>> >>> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS >>> >>> How much space do you have to work with? >>> >>> To get an idea of the disk space currently used by table see; >>> >>> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT >> Oh, ok, I guess I was being overly paranoid on this front. Those functions would indicate that the table is only 7.5 GB,with another 8.7GB of indexes, for a total of around 16GB. So not a problem after all - I have around 100GB available. >> Of course, that now leaves me with the mystery of where my other 500GB of disk space is going, since it is apparentlyNOT going to my DB as I had assumed, but solving that can wait. > > Assuming you are on some form of Linux: > > sudo du -h -d 1 / > > Then you can drill down into the output of above. Yep. Done it many times to discover a runaway log file or the like. Just mildly amusing that solving one problem leads toanother I need to take care of as well… But at least the select into a new table should work nicely. Thanks! --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > >> Thanks again for all the good information and suggestions! >> --- >> Israel Brewster >> Software Engineer >> Alaska Volcano Observatory >> Geophysical Institute - UAF >> 2156 Koyukuk Drive >> Fairbanks AK 99775-7320 >> Work: 907-474-5172 >> cell: 907-328-9145 >>> >>>>> >>>>>> --- >>>>>> Israel Brewster >>>>>> Software Engineer >>>>>> Alaska Volcano Observatory >>>>>> Geophysical Institute - UAF >>>>>> 2156 Koyukuk Drive >>>>>> Fairbanks AK 99775-7320 >>>>>> Work: 907-474-5172 >>>>>> cell: 907-328-9145 >>>>> >>>>> >>>>> -- >>>>> Adrian Klaver >>>>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> >>> >>> >>> -- >>> Adrian Klaver >>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com> > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
On Jan 8, 2020, at 7:52 AM, stan <stanb@panix.com> wrote:On Tue, Jan 07, 2020 at 12:20:12PM -0900, Israel Brewster wrote:I'd love to see you report on how this went.On Jan 7, 2020, at 12:15 PM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:Really? Why? With the update I am only changing data - I???m not adding
any additional data, so the total size should stay the same, right?
I???m obviously missing something??? :-)
PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
visible to other transactions. Not only that, but every index record
gets updated to point to the location of the new data row too (excluding
HOT), and those old index blocks also need to get vacuumed. And none of
those rows can get removed until your update finishes.
I know this isn't universally true with HOT and fillfactor etc. but with
an update this big I think it's safe to say most of the space will get
doubled.
Plus you'll get a ton of write-ahead logs.
Gotcha. Batches with VACUUM it is! Thanks for the info.
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
cell: 907-328-9145
--
"They that would give up essential liberty for temporary safety deserve
neither liberty nor safety."
-- Benjamin Franklin