Re: Optimizing bulk update performance - Mailing list pgsql-general
From | Gavin Flower |
---|---|
Subject | Re: Optimizing bulk update performance |
Date | |
Msg-id | 517B3163.9020708@archidevsys.co.nz Whole thread Raw |
In response to | Re: Optimizing bulk update performance (Yang Zhang <yanghatespam@gmail.com>) |
Responses |
Re: Optimizing bulk update performance
|
List | pgsql-general |
Please do not top post, the convention in these list are to add stuff at the end, apart from comments interspersed to make use of appropriate context! On 27/04/13 13:35, Yang Zhang wrote: > We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on > m1.xlarge instances, which have: > > 15 GiB memory > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) > 64-bit platform > > (Yes, we're moving to EBS Optimized instances + Provisioned IOPS > volumes, but prelim. benchmarks suggest this won't get us enough of a > boost as much as possibly refactoring the way we're executing these > bulk updates in our application.) > > On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower > <GavinFlower@archidevsys.co.nz> wrote: >> On 27/04/13 12:14, Yang Zhang wrote: >> >> It currently takes up to 24h for us to run a large set of UPDATE >> statements on a database, which are of the form: >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >> id = constid >> >> (We're just overwriting fields of objects identified by ID.) >> >> The tables have handfuls of indices each and no foreign key constraints. >> No COMMIT is made till the end. >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >> baseline we should reasonably target. >> >> Short of producing a custom program that somehow reconstructs a dataset >> for Postgresql to re-import, is there anything we can do to bring the >> bulk UPDATE performance closer to that of the import? (This is an area >> that we believe log-structured merge trees handle well, but we're >> wondering if there's anything we can do within Postgresql.) >> >> Some ideas: >> >> - dropping all non-ID indices and rebuilding afterward? >> - increasing checkpoint_segments, but does this actually help sustained >> long-term throughput? >> - using the techniques mentioned here? (Load new data as table, then >> "merge in" old data where ID is not found in new data) >> >> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> >> >> Basically there's a bunch of things to try and we're not sure what the >> most effective are or if we're overlooking other things. We'll be >> spending the next few days experimenting, but we thought we'd ask here >> as well. >> >> Thanks. >> >> >> People will need to know your version of Postgres & which Operating System >> etc. plus details of CPU RAM, and Disks... AS well as what changes you have >> made to postgresql.conf... >> >> I would be inclined to DROP all indexes and reCREATE them later. >> >> Updating a row might lead to new row being added in a new disk page, so I >> suspect that updates will hit every index associated with the table with the >> (possible exception of partial indexes). >> >> Running too many updates in one transaction, may mean that Postgres may need >> to use disk work files. >> >> Depending on RAM etc, it may pay to increase some variables tat affect how >> Postgres uses RAM, some of these are per session. >> >> >> Cheers, >> Gavin > > > -- > Yang Zhang > http://yz.mit.edu/
pgsql-general by date: