Home > mailing lists

Re: Optimizing bulk update performance - Mailing list pgsql-general

From	Yang Zhang
Subject	Re: Optimizing bulk update performance
Date	April 27, 2013 01:35:56
Msg-id	CAKxBDU8diA6tOLUT5sWaR_r0sD+1=oxrv6Oq=UKbfJmWVY5ADQ@mail.gmail.com Whole thread
In response to	Re: Optimizing bulk update performance (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Responses	Re: Optimizing bulk update performance
List	pgsql-general

Tree view

We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
m1.xlarge instances, which have:

15 GiB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
64-bit platform

(Yes, we're moving to EBS Optimized instances + Provisioned IOPS
volumes, but prelim. benchmarks suggest this won't get us enough of a
boost as much as possibly refactoring the way we're executing these
bulk updates in our application.)

On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
> On 27/04/13 12:14, Yang Zhang wrote:
>
> It currently takes up to 24h for us to run a large set of UPDATE
> statements on a database, which are of the form:
>
>     UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>     id = constid
>
> (We're just overwriting fields of objects identified by ID.)
>
> The tables have handfuls of indices each and no foreign key constraints.
> No COMMIT is made till the end.
>
> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
> baseline we should reasonably target.
>
> Short of producing a custom program that somehow reconstructs a dataset
> for Postgresql to re-import, is there anything we can do to bring the
> bulk UPDATE performance closer to that of the import?  (This is an area
> that we believe log-structured merge trees handle well, but we're
> wondering if there's anything we can do within Postgresql.)
>
> Some ideas:
>
> - dropping all non-ID indices and rebuilding afterward?
> - increasing checkpoint_segments, but does this actually help sustained
>   long-term throughput?
> - using the techniques mentioned here?  (Load new data as table, then
>   "merge in" old data where ID is not found in new data)
>
> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com>
>
> Basically there's a bunch of things to try and we're not sure what the
> most effective are or if we're overlooking other things.  We'll be
> spending the next few days experimenting, but we thought we'd ask here
> as well.
>
> Thanks.
>
>
> People will need to know your version of Postgres & which Operating System
> etc. plus details of CPU RAM, and Disks... AS well as what changes you have
> made to postgresql.conf...
>
> I would be inclined to DROP all indexes and reCREATE them later.
>
> Updating a row might lead to new row being added in a new disk page, so I
> suspect that updates will hit every index associated with the table with the
> (possible exception of partial indexes).
>
> Running too many updates in one transaction, may mean that Postgres may need
> to use disk work files.
>
> Depending on RAM etc, it may pay to increase some variables tat affect how
> Postgres uses RAM, some of these are per session.
>
>
> Cheers,
> Gavin



--
Yang Zhang
http://yz.mit.edu/

pgsql-general by date:

From: Gavin Flower
Date: 27 April 2013, 00:27:45
Subject: Re: Optimizing bulk update performance

From: Yang Zhang
Date: 27 April 2013, 01:45:00
Subject: Basic question on recovery and disk snapshotting

Re: Optimizing bulk update performance - Mailing list pgsql-general

Previous

Next