Re: Optimizing bulk update performance - Mailing list pgsql-general

From Gavin Flower
Subject Re: Optimizing bulk update performance
Date
Msg-id 517B3163.9020708@archidevsys.co.nz
Whole thread Raw
In response to Re: Optimizing bulk update performance  (Yang Zhang <yanghatespam@gmail.com>)
Responses Re: Optimizing bulk update performance
List pgsql-general
Please do not top post, the convention in these list are to add stuff at
the end, apart from comments interspersed to make use of appropriate
context!

On 27/04/13 13:35, Yang Zhang wrote:
> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
> m1.xlarge instances, which have:
>
> 15 GiB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 64-bit platform
>
> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS
> volumes, but prelim. benchmarks suggest this won't get us enough of a
> boost as much as possibly refactoring the way we're executing these
> bulk updates in our application.)
>
> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
> <GavinFlower@archidevsys.co.nz> wrote:
>> On 27/04/13 12:14, Yang Zhang wrote:
>>
>> It currently takes up to 24h for us to run a large set of UPDATE
>> statements on a database, which are of the form:
>>
>>      UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>>      id = constid
>>
>> (We're just overwriting fields of objects identified by ID.)
>>
>> The tables have handfuls of indices each and no foreign key constraints.
>> No COMMIT is made till the end.
>>
>> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
>> baseline we should reasonably target.
>>
>> Short of producing a custom program that somehow reconstructs a dataset
>> for Postgresql to re-import, is there anything we can do to bring the
>> bulk UPDATE performance closer to that of the import?  (This is an area
>> that we believe log-structured merge trees handle well, but we're
>> wondering if there's anything we can do within Postgresql.)
>>
>> Some ideas:
>>
>> - dropping all non-ID indices and rebuilding afterward?
>> - increasing checkpoint_segments, but does this actually help sustained
>>    long-term throughput?
>> - using the techniques mentioned here?  (Load new data as table, then
>>    "merge in" old data where ID is not found in new data)
>>
>> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com>
>>
>> Basically there's a bunch of things to try and we're not sure what the
>> most effective are or if we're overlooking other things.  We'll be
>> spending the next few days experimenting, but we thought we'd ask here
>> as well.
>>
>> Thanks.
>>
>>
>> People will need to know your version of Postgres & which Operating System
>> etc. plus details of CPU RAM, and Disks... AS well as what changes you have
>> made to postgresql.conf...
>>
>> I would be inclined to DROP all indexes and reCREATE them later.
>>
>> Updating a row might lead to new row being added in a new disk page, so I
>> suspect that updates will hit every index associated with the table with the
>> (possible exception of partial indexes).
>>
>> Running too many updates in one transaction, may mean that Postgres may need
>> to use disk work files.
>>
>> Depending on RAM etc, it may pay to increase some variables tat affect how
>> Postgres uses RAM, some of these are per session.
>>
>>
>> Cheers,
>> Gavin
>
>
> --
> Yang Zhang
> http://yz.mit.edu/



pgsql-general by date:

Previous
From: Yang Zhang
Date:
Subject: Basic question on recovery and disk snapshotting
Next
From: Yang Zhang
Date:
Subject: Re: Optimizing bulk update performance