Thread: Mass updates on a large table

Mass updates on a large table

From
"Mark Steben"
Date:

Good afternoon,

I am attempting an update on two new date field columns on a 17 million row table.  Every row gets updated.

The update statement is a simple one:

  UPDATE EMAILRCPTS SET ID = ID

 And the update of the new date fields themselves occurs as the result of a before trigger.

 

The update took 3 days, 10 hours to complete on the testing box.  I have already adjusted the CHECKPOINT_SEGMENTS parameter up by 3X

To minimize the impact of checkpoints.   The SHARED_BUFFERS parameter has been bumped up to 140000 on a 20meg RAM box.

There are about 9 indexes on this table although none of them reference the date fields so since there are no inserts I don’t think they would have an impact on the update  (I’ve been wrong before though)

 

Would an update statement referencing the date fields work faster than a trigger?  Do you have any other suggestions to speed this up?

We simply cannot afford this table to be down for 3+ days during a production update.  The production box is a 32meg RAM box.

 

We are at Postgres 7.4.5.

 

Thanks for your time and brainpower

 

Mark Steben

Senior Database Administrator
@utoRevenue™
A Dominion Enterprises Company
480 Pleasant Street
Suite B200
Lee, MA 01238
413-243-4800 Home Office 
413-243-4809 Corporate Fax

msteben@autorevenue.com

Visit our new website at
www.autorevenue.com

 

IMPORTANT: The information contained in this e-mail message is confidential and is intended only for the named addressee(s). If the reader of this e-mail message is not the intended recipient (or the individual responsible for the delivery of this e-mail message to the intended recipient), please be advised that any re-use, dissemination, distribution or copying of this e-mail message is prohibited.  If you have received this e-mail message in error, please reply to the sender that you have received this e-mail message in error and then delete it.  Thank you.

 

Re: Mass updates on a large table

From
"Scott Marlowe"
Date:
On 8/10/07, Mark Steben <msteben@autorevenue.com> wrote:
>
> Good afternoon,
>
> I am attempting an update on two new date field columns on a 17 million row
> table.  Every row gets updated.
>
> The update statement is a simple one:
>
>   UPDATE EMAILRCPTS SET ID = ID
>
>  And the update of the new date fields themselves occurs as the result of a
> before trigger.
>
> The update took 3 days, 10 hours to complete on the testing box.

That's quite a while for only 17 million rows.  Are these rows
particularly wide?
Is it possible to do it by groups with a vacuum in between each group?
 That would keep the bloat down.

You don't mention your vacuuming strategy.  That might affect performance here.

Also, are there any FKs to / from this table?

> To minimize the impact of checkpoints.   The SHARED_BUFFERS parameter has
> been bumped up to 140000 on a 20meg RAM box.

I assume you meant 20Gig box.

Under 7.4 larger shared_buffers may not be a good thing.  that's a
very large shared buffer setting for 7.4 to handle.

> There are about 9 indexes on this table although none of them reference the
> date fields so since there are no inserts I don't think they would have an
> impact on the update  (I've been wrong before though)

Doesn't matter.  PostgreSQL's implementation of MVCC means that each
update results in a new row, and therefore each index has to be
updated for each row updated.

> Would an update statement referencing the date fields work faster than a
> trigger?

Possibly.

>  Do you have any other suggestions to speed this up?
> We are at Postgres 7.4.5.

Upgrade to a modern version?  7.4 is getting old fast, and 7.4.5 has a
LOT of bugs that have been fixed in later versions. It's up to like
7.4.17 so you're missing a LOT of updates just in the branch you're
in.  But upgrading to 8.2.4 would definitely be a help.

> We simply cannot afford this table to be down for 3+ days during a
> production update.  The production box is a 32meg RAM box.

I would question the business process that requires an entire 17
million row table be updated.

Also, posting your schema and your triggers might help a bit as well.

Re: Mass updates on a large table

From
Tom Lane
Date:
"Mark Steben" <msteben@autorevenue.com> writes:
> I am attempting an update on two new date field columns on a 17 million row
> table.  Every row gets updated.
> The update statement is a simple one:
>   UPDATE EMAILRCPTS SET ID = ID
>  And the update of the new date fields themselves occurs as the result of a
> before trigger.

When you don't show us the trigger, it's hard to make any useful
comment ... but 60 rows/second seems slow enough to suspect that the
trigger is very inefficient.

> There are about 9 indexes on this table although none of them reference the
> date fields so since there are no inserts I don't think they would have an
> impact on the update  (I've been wrong before though)

And you are again ... but still, it's awfully slow.

> We are at Postgres 7.4.5.

That's a big problem right there.  One thing I can tell you is that it
is sheer folly to set shared_buffers so high on PG 7.x.  It wasn't till
about 8.1 that we had buffer management algorithms that were good enough
for lots of buffers.  Dial it down to something under 100K buffers.  And
think about an update.  At the very least you should be on a far more
current 7.4.x release.

            regards, tom lane

Re: Mass updates on a large table

From
"Scott Marlowe"
Date:
On 8/10/07, Mark Steben <msteben@autorevenue.com> wrote:
>
>
>
>
> Hi Scott, thanks for the response.

> 4.        The business reason for the update: we are replacing within a
> query a
> COALESCE function that gave us the most recent date between 3 dates. We
> Are instead doing the COALESCE as part of a new update trigger, populating
> the  New date fields with the most recent dates and referencing the new dates

So this is a one time data migration.  OK, then I'm no as worried
about the business case being a bit crazy.

> 5.        We are planning an upgrade to 8.2.4 in the near future – hopefully
> in a couple months.

Cool.  Start testing now.  You'd be surprised how many test cases
you'll find where your laptop running 8.2 will outrun your server
running 7.x...

Nothing too out of the ordinary in your schema, but those
multi-dimensional arrays might be a bit expensive, especially for
updates.