Thread: exceptionally large UPDATE
I've to make large UPDATE to a DB. The largest UPDATE involve a table that has triggers and a gin index on a computed tsvector. The table is 1.5M records with about 15 fields of different types. I've roughly 2.5-3Gb of ram dedicated to postgres. UPDATE queries are simple, few of them use join and mainly consist of updating records from temporary tables that contains a very similar structure to the target. This updates are rare so I can afford to tune postgresql just for this large update and then return to a more balanced configuration. I can even afford to be the only user of the DB so responsiveness of the application using the DB is not an issue. Duration of the update is. Anything I can tune in postgresql.conf to speed up the UPDATE? I'm increasing maintenance_work_mem to 180MB just before recreating the gin index. Should it be more? The update should be monolithic and it is inside a single transaction. Since I can afford to be the only user of the DB for a while, is there anything I can tune to take advantage of it? What else could I change to speed up the update? The triggers recreate the tsvector. One of the component of the tsvector is taken from a join table. I'll surely drop the gin index and recreate it when everything is over. I'm not sure if it's a good idea to drop the triggers since I'll have to update the tsvectr later and I suspect this will cause twice the disk IO. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo wrote: > I've to make large UPDATE to a DB. > The largest UPDATE involve a table that has triggers and a gin > index on a computed tsvector. > The table is 1.5M records with about 15 fields of different types. > I've roughly 2.5-3Gb of ram dedicated to postgres. > > UPDATE queries are simple, few of them use join and mainly consist > of updating records from temporary tables that contains a very > similar structure to the target. > > This updates are rare so I can afford to tune postgresql just for > this large update and then return to a more balanced configuration. > I can even afford to be the only user of the DB so responsiveness of > the application using the DB is not an issue. Duration of the update > is. > > Anything I can tune in postgresql.conf to speed up the UPDATE? > > I'm increasing maintenance_work_mem to 180MB just before recreating > the gin index. Should it be more? > The update should be monolithic and it is inside a single > transaction. Since I can afford to be the only user of the DB for a > while, is there anything I can tune to take advantage of it? > What else could I change to speed up the update? > > The triggers recreate the tsvector. One of the component of the > tsvector is taken from a join table. > > I'll surely drop the gin index and recreate it when everything is > over. > I'm not sure if it's a good idea to drop the triggers since I'll > have to update the tsvectr later and I suspect this will cause twice > the disk IO. > > thanks > > Is there an inherent value in a single transaction for such an update? By that I mean Do all the updates actually pertain to a single event? Nice as it is to get a clean slate if the a single record has a problem, it's also nice when N-1 of N batches succeed in a realistic amount of time and you're left hunting for the problematic record in one Nth of the records. Corollary: if you can afford to be the only user for a while perhaps you can afford to reload from dump if you need to get back to ground zero.
On Wed, Oct 27, 2010 at 10:26 PM, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > I'm increasing maintenance_work_mem to 180MB just before recreating > the gin index. Should it be more? > You can do this on a per-connection basis; no need to alter the config file. At the psql prompt (or via your script) just execute the query SET maintenance_work_mem="180MB" If you've got the RAM, just use more of it. 'd suspect your server has plenty of it, so use it! When I reindex, I often give it 1 or 2 GB. If you can fit the whole table into that much space, you're going to go really really fast. Also, if you are going to update that many rows you may want to increase your checkpoint_segments. Increasing that helps a *lot* when you're loading big data, so I would expect updating big data may also be helped. I suppose it depends on how wide your rows are. 1.5 Million rows is really not all that big unless you have lots and lots of text columns.
On Thu, 28 Oct 2010 08:58:34 -0400 Vick Khera <vivek@khera.org> wrote: > On Wed, Oct 27, 2010 at 10:26 PM, Ivan Sergio Borgonovo > <mail@webthatworks.it> wrote: > > I'm increasing maintenance_work_mem to 180MB just before > > recreating the gin index. Should it be more? > > > > You can do this on a per-connection basis; no need to alter the > config file. At the psql prompt (or via your script) just execute > the query > > SET maintenance_work_mem="180MB" > If you've got the RAM, just use more of it. 'd suspect your server > has plenty of it, so use it! When I reindex, I often give it 1 or > 2 GB. If you can fit the whole table into that much space, you're > going to go really really fast. > Also, if you are going to update that many rows you may want to > increase your checkpoint_segments. Increasing that helps a *lot* > when you're loading big data, so I would expect updating big data > may also be helped. I suppose it depends on how wide your rows > are. 1.5 Million rows is really not all that big unless you have > lots and lots of text columns. Actually I'm pretty happy with performance of the DB under normal circumstances. I never investigated to much if I could squeeze it more. But when I have to deal with such "huge" updates the performance is painful. You made me start to wonder if I could improve performances even under normal load. But right now I've to take care of this huge (well the use of huge is just related to the performance I'm obtaining right now) update. The things I've touched compared to stock configuration where: max_connections = 100 shared_buffers = 240M work_mem = 42MB maintenance_work_mem = 180MB #(generally it is 40MB) # these were touched as of autovacuum suggestion max_fsm_pages = 437616 max_fsm_relations = 1200 checkpoint_segments = 35 random_page_cost = 3.0 default_statistics_target = 30 log_min_duration_statement = 1000 The box is running apache, total average occupied length of tetxt for each row should be around 1Kb on the largest table. What I'm planning to do is: max_connections = 5 shared_buffers = 240M work_mem = 90MB maintenance_work_mem = 1GB max_fsm_pages = 437616 max_fsm_relations = 1200 checkpoint_segments = 70 default_statistics_target = 30 #log_min_duration_statement = 1000 Any improvement? -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Thu, Oct 28, 2010 at 1:06 PM, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > What I'm planning to do is: > max_connections = 5 > shared_buffers = 240M > work_mem = 90MB > maintenance_work_mem = 1GB > max_fsm_pages = 437616 > max_fsm_relations = 1200 > checkpoint_segments = 70 > default_statistics_target = 30 > #log_min_duration_statement = 1000 > default_statistics_target = 100 is the new "default" for newer postgres, and with good reason... try that. if you boost your checkpoint_segments, also twiddle the checkpoint_timeout (increase it) and checkpoint_completion_target (something like 0.8 would be good, depending on how fast your disks are) values to try to smooth out your I/O (ie, keep it from bursting at checkpoint timeout). Is 5 connections really enough for you? And like I said before, you can set the work_mem and/or maintenance_work_mem on a per-connection basis as needed, so for your big update you can increase those values just during that work without affecting the rest of the system.
On Fri, 29 Oct 2010 10:21:14 -0400 Vick Khera <vivek@khera.org> wrote: > On Thu, Oct 28, 2010 at 1:06 PM, Ivan Sergio Borgonovo > <mail@webthatworks.it> wrote: > > What I'm planning to do is: > > max_connections = 5 > > shared_buffers = 240M > > work_mem = 90MB > > maintenance_work_mem = 1GB > > max_fsm_pages = 437616 > > max_fsm_relations = 1200 > > checkpoint_segments = 70 > > default_statistics_target = 30 > > #log_min_duration_statement = 1000 > default_statistics_target = 100 is the new "default" for newer > postgres, and with good reason... try that. > > if you boost your checkpoint_segments, also twiddle the > checkpoint_timeout (increase it) and checkpoint_completion_target > (something like 0.8 would be good, depending on how fast your disks > are) values to try to smooth out your I/O (ie, keep it from > bursting at checkpoint timeout). Is 5 connections really enough > for you? No. 5 is too few. OK... this is what I end up with: max_connections = 100 shared_buffers = 240M work_mem = 90MB maintenance_work_mem = 1GB max_fsm_pages = 437616 max_fsm_relations = 1200 default_statistics_target = 100 checkpoint_segments = 70 checkpoint_timeout = 10min checkpoint_completion_target = 0.6 #(not very fast drives in raid5) #log_min_duration_statement = 1000 random_page_cost = 3.0 I tested this on a RAID10 SATA, 8Gb RAM and 2x4cores Xeons - updating 227985 records over roughly 1.4M took 197744.374 ms - recreating the gin index took 313962.162 ms - commit took 7699.595 ms - vacuum analyse 188261.481 ms The total update took around 13min. I've just heard that a similar update on a slower box (RAID1 SAS, 4Gb, 2x2Cores Xeon) running MS SQL took over 30min. Considering MUCH less pk/fk, constraint and actions where defined on the MS SQL DB, things now look much better for postgres. Furthermore postgresql full text search kicks ass to the MS SQL box even on the slowest box of all (RAID5 SATA, 4Gb, 2x1core HT Xeon, over 6 years old). I'll take note of performance even on the slower box as soon as I'll have large updates, still I'm looking how to make it faster. -- Ivan Sergio Borgonovo http://www.webthatworks.it