Thread: exceptionally large UPDATE

exceptionally large UPDATE

From
Ivan Sergio Borgonovo
Date:
I've to make large UPDATE to a DB.
The largest UPDATE involve a table that has triggers and a gin
index on a computed tsvector.
The table is 1.5M records with about 15 fields of different types.
I've roughly 2.5-3Gb of ram dedicated to postgres.

UPDATE queries are simple, few of them use join and mainly consist
of updating records from temporary tables that contains a very
similar structure to the target.

This updates are rare so I can afford to tune postgresql just for
this large update and then return to a more balanced configuration.
I can even afford to be the only user of the DB so responsiveness of
the application using the DB is not an issue. Duration of the update
is.

Anything I can tune in postgresql.conf to speed up the UPDATE?

I'm increasing maintenance_work_mem to 180MB just before recreating
the gin index. Should it be more?
The update should be monolithic and it is inside a single
transaction. Since I can afford to be the only user of the DB for a
while, is there anything I can tune to take advantage of it?
What else could I change to speed up the update?

The triggers recreate the tsvector. One of the component of the
tsvector is taken from a join table.

I'll surely drop the gin index and recreate it when everything is
over.
I'm not sure if it's a good idea to drop the triggers since I'll
have to update the tsvectr later and I suspect this will cause twice
the disk IO.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: exceptionally large UPDATE

From
Rob Sargent
Date:

Ivan Sergio Borgonovo wrote:
> I've to make large UPDATE to a DB.
> The largest UPDATE involve a table that has triggers and a gin
> index on a computed tsvector.
> The table is 1.5M records with about 15 fields of different types.
> I've roughly 2.5-3Gb of ram dedicated to postgres.
>
> UPDATE queries are simple, few of them use join and mainly consist
> of updating records from temporary tables that contains a very
> similar structure to the target.
>
> This updates are rare so I can afford to tune postgresql just for
> this large update and then return to a more balanced configuration.
> I can even afford to be the only user of the DB so responsiveness of
> the application using the DB is not an issue. Duration of the update
> is.
>
> Anything I can tune in postgresql.conf to speed up the UPDATE?
>
> I'm increasing maintenance_work_mem to 180MB just before recreating
> the gin index. Should it be more?
> The update should be monolithic and it is inside a single
> transaction. Since I can afford to be the only user of the DB for a
> while, is there anything I can tune to take advantage of it?
> What else could I change to speed up the update?
>
> The triggers recreate the tsvector. One of the component of the
> tsvector is taken from a join table.
>
> I'll surely drop the gin index and recreate it when everything is
> over.
> I'm not sure if it's a good idea to drop the triggers since I'll
> have to update the tsvectr later and I suspect this will cause twice
> the disk IO.
>
> thanks
>
>

Is there an inherent value in a single transaction for such an update?
By that I mean Do all the updates actually pertain to a single event?
Nice as it is to get a clean slate if the a single record has a problem,
it's also nice when N-1 of N batches succeed in a realistic amount of
time and you're left hunting for the problematic record in one Nth of
the records.

Corollary: if you can afford to be the only user for a while perhaps you
can afford to reload from dump if you need to get back to ground zero.



Re: exceptionally large UPDATE

From
Vick Khera
Date:
On Wed, Oct 27, 2010 at 10:26 PM, Ivan Sergio Borgonovo
<mail@webthatworks.it> wrote:
> I'm increasing maintenance_work_mem to 180MB just before recreating
> the gin index. Should it be more?
>

You can do this on a per-connection basis; no need to alter the config
file.  At the psql prompt (or via your script) just execute the query

SET maintenance_work_mem="180MB"

If you've got the RAM, just use more of it.  'd suspect your server
has plenty of it, so use it!  When I reindex, I often give it 1 or 2
GB.  If you can fit the whole table into that much space, you're going
to go really really fast.

Also, if you are going to update that many rows you may want to
increase your checkpoint_segments.  Increasing that helps a *lot* when
you're loading big data, so I would expect updating big data may also
be helped.  I suppose it depends on how wide your rows are.  1.5
Million rows is really not all that big unless you have lots and lots
of text columns.

Re: exceptionally large UPDATE

From
Ivan Sergio Borgonovo
Date:
On Thu, 28 Oct 2010 08:58:34 -0400
Vick Khera <vivek@khera.org> wrote:

> On Wed, Oct 27, 2010 at 10:26 PM, Ivan Sergio Borgonovo
> <mail@webthatworks.it> wrote:
> > I'm increasing maintenance_work_mem to 180MB just before
> > recreating the gin index. Should it be more?
> >
>
> You can do this on a per-connection basis; no need to alter the
> config file.  At the psql prompt (or via your script) just execute
> the query
>
> SET maintenance_work_mem="180MB"

> If you've got the RAM, just use more of it.  'd suspect your server
> has plenty of it, so use it!  When I reindex, I often give it 1 or
> 2 GB.  If you can fit the whole table into that much space, you're
> going to go really really fast.

> Also, if you are going to update that many rows you may want to
> increase your checkpoint_segments.  Increasing that helps a *lot*
> when you're loading big data, so I would expect updating big data
> may also be helped.  I suppose it depends on how wide your rows
> are.  1.5 Million rows is really not all that big unless you have
> lots and lots of text columns.

Actually I'm pretty happy with performance of the DB under normal
circumstances. I never investigated to much if I could squeeze it
more. But when I have to deal with such "huge" updates the
performance is painful.

You made me start to wonder if I could improve performances even
under normal load.

But right now I've to take care of this huge (well the use of huge
is just related to the performance I'm obtaining right now) update.

The things I've touched compared to stock configuration where:
max_connections = 100
shared_buffers = 240M
work_mem = 42MB
maintenance_work_mem = 180MB #(generally it is 40MB)
# these were touched as of autovacuum suggestion
max_fsm_pages = 437616
max_fsm_relations = 1200
checkpoint_segments = 35
random_page_cost = 3.0
default_statistics_target = 30
log_min_duration_statement = 1000

The box is running apache, total average occupied length of tetxt
for each row should be around 1Kb on the largest table.

What I'm planning to do is:
max_connections = 5
shared_buffers = 240M
work_mem = 90MB
maintenance_work_mem = 1GB
max_fsm_pages = 437616
max_fsm_relations = 1200
checkpoint_segments = 70
default_statistics_target = 30
#log_min_duration_statement = 1000

Any improvement?


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: exceptionally large UPDATE

From
Vick Khera
Date:
On Thu, Oct 28, 2010 at 1:06 PM, Ivan Sergio Borgonovo
<mail@webthatworks.it> wrote:
> What I'm planning to do is:
> max_connections = 5
> shared_buffers = 240M
> work_mem = 90MB
> maintenance_work_mem = 1GB
> max_fsm_pages = 437616
> max_fsm_relations = 1200
> checkpoint_segments = 70
> default_statistics_target = 30
> #log_min_duration_statement = 1000
>

default_statistics_target = 100 is the new "default" for newer
postgres, and with good reason... try that.

if you boost your checkpoint_segments, also twiddle the
checkpoint_timeout (increase it) and checkpoint_completion_target
(something like 0.8 would be good, depending on how fast your disks
are) values to try to smooth out your I/O (ie, keep it from bursting
at checkpoint timeout).  Is 5 connections really enough for you?

And like I said before, you can set the work_mem and/or
maintenance_work_mem on a per-connection basis as needed, so for your
big update you can increase those values just during that work without
affecting the rest of the system.

Re: exceptionally large UPDATE

From
Ivan Sergio Borgonovo
Date:
On Fri, 29 Oct 2010 10:21:14 -0400
Vick Khera <vivek@khera.org> wrote:

> On Thu, Oct 28, 2010 at 1:06 PM, Ivan Sergio Borgonovo
> <mail@webthatworks.it> wrote:
> > What I'm planning to do is:
> > max_connections = 5
> > shared_buffers = 240M
> > work_mem = 90MB
> > maintenance_work_mem = 1GB
> > max_fsm_pages = 437616
> > max_fsm_relations = 1200
> > checkpoint_segments = 70
> > default_statistics_target = 30
> > #log_min_duration_statement = 1000

> default_statistics_target = 100 is the new "default" for newer
> postgres, and with good reason... try that.
>
> if you boost your checkpoint_segments, also twiddle the
> checkpoint_timeout (increase it) and checkpoint_completion_target
> (something like 0.8 would be good, depending on how fast your disks
> are) values to try to smooth out your I/O (ie, keep it from
> bursting at checkpoint timeout).  Is 5 connections really enough
> for you?

No. 5 is too few.
OK... this is what I end up with:

max_connections = 100

shared_buffers = 240M
work_mem = 90MB
maintenance_work_mem = 1GB

max_fsm_pages = 437616
max_fsm_relations = 1200

default_statistics_target = 100

checkpoint_segments = 70
checkpoint_timeout = 10min
checkpoint_completion_target = 0.6 #(not very fast drives in raid5)
#log_min_duration_statement = 1000

random_page_cost = 3.0


I tested this on a RAID10 SATA, 8Gb RAM and 2x4cores Xeons

- updating 227985 records over roughly 1.4M took 197744.374 ms
- recreating the gin index took 313962.162 ms
- commit took 7699.595 ms
- vacuum analyse 188261.481 ms

The total update took around 13min.
I've just heard that a similar update on a slower box (RAID1 SAS,
4Gb, 2x2Cores Xeon) running MS SQL took over 30min.
Considering MUCH less pk/fk, constraint and actions where defined on
the MS SQL DB, things now look much better for postgres.

Furthermore postgresql full text search kicks ass to the MS SQL box
even on the slowest box of all (RAID5 SATA, 4Gb, 2x1core HT Xeon,
over 6 years old).

I'll take note of performance even on the slower box as soon as I'll
have large updates, still I'm looking how to make it faster.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it