Re: UPDATE many records - Mailing list pgsql-general

From Israel Brewster
Subject Re: UPDATE many records
Date
Msg-id 57C76375-D3CC-4FF5-A214-C45BC9DF996E@alaska.edu
Whole thread Raw
In response to Re: UPDATE many records  (Justin <zzzzz.graf@gmail.com>)
Responses Re: UPDATE many records
Re: UPDATE many records
List pgsql-general

On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:

What was the HD wait time ?  What tool is being use to monitor the server resources??

No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.

It appears based on this information there is allot more going on than a simple Update command

Depending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.


Moving code out of the trigger  probably not  going to improve performance, unless there is allot of code  that does not need to be processed for this update or code touching other tables

One SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.

Study the trigger identify what has to run, pull that code out, then disable the trigger.    Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


pgsql-general by date:

Previous
From: Justin
Date:
Subject: Re: UPDATE many records
Next
From: Miles Elam
Date:
Subject: Table Interfaces (Inheritance again. I know. I'm sorry.)