Home > mailing lists

Re: Optimizing bulk update performance - Mailing list pgsql-general

From	Gavin Flower
Subject	Re: Optimizing bulk update performance
Date	April 27, 2013 05:01:23
Msg-id	517B3163.9020708@archidevsys.co.nz Whole thread Raw
In response to	Re: Optimizing bulk update performance (Yang Zhang <yanghatespam@gmail.com>)
Responses	Re: Optimizing bulk update performance
List	pgsql-general

Tree view

Please do not top post, the convention in these list are to add stuff at
the end, apart from comments interspersed to make use of appropriate
context!

On 27/04/13 13:35, Yang Zhang wrote:
> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
> m1.xlarge instances, which have:
>
> 15 GiB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 64-bit platform
>
> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS
> volumes, but prelim. benchmarks suggest this won't get us enough of a
> boost as much as possibly refactoring the way we're executing these
> bulk updates in our application.)
>
> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
> <GavinFlower@archidevsys.co.nz> wrote:
>> On 27/04/13 12:14, Yang Zhang wrote:
>>
>> It currently takes up to 24h for us to run a large set of UPDATE
>> statements on a database, which are of the form:
>>
>>      UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>>      id = constid
>>
>> (We're just overwriting fields of objects identified by ID.)
>>
>> The tables have handfuls of indices each and no foreign key constraints.
>> No COMMIT is made till the end.
>>
>> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
>> baseline we should reasonably target.
>>
>> Short of producing a custom program that somehow reconstructs a dataset
>> for Postgresql to re-import, is there anything we can do to bring the
>> bulk UPDATE performance closer to that of the import?  (This is an area
>> that we believe log-structured merge trees handle well, but we're
>> wondering if there's anything we can do within Postgresql.)
>>
>> Some ideas:
>>
>> - dropping all non-ID indices and rebuilding afterward?
>> - increasing checkpoint_segments, but does this actually help sustained
>>    long-term throughput?
>> - using the techniques mentioned here?  (Load new data as table, then
>>    "merge in" old data where ID is not found in new data)
>>
>> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com>
>>
>> Basically there's a bunch of things to try and we're not sure what the
>> most effective are or if we're overlooking other things.  We'll be
>> spending the next few days experimenting, but we thought we'd ask here
>> as well.
>>
>> Thanks.
>>
>>
>> People will need to know your version of Postgres & which Operating System
>> etc. plus details of CPU RAM, and Disks... AS well as what changes you have
>> made to postgresql.conf...
>>
>> I would be inclined to DROP all indexes and reCREATE them later.
>>
>> Updating a row might lead to new row being added in a new disk page, so I
>> suspect that updates will hit every index associated with the table with the
>> (possible exception of partial indexes).
>>
>> Running too many updates in one transaction, may mean that Postgres may need
>> to use disk work files.
>>
>> Depending on RAM etc, it may pay to increase some variables tat affect how
>> Postgres uses RAM, some of these are per session.
>>
>>
>> Cheers,
>> Gavin
>
>
> --
> Yang Zhang
> http://yz.mit.edu/

pgsql-general by date:

From: Yang Zhang
Date: 27 April 2013, 04:45:00
Subject: Basic question on recovery and disk snapshotting

From: Yang Zhang
Date: 27 April 2013, 05:30:32
Subject: Re: Optimizing bulk update performance

Re: Optimizing bulk update performance - Mailing list pgsql-general

Previous

Next