Re: Optimizing bulk update performance - Mailing list pgsql-general

From Yang Zhang
Subject Re: Optimizing bulk update performance
Date
Msg-id CAKxBDU9vEbUoDYN7Pwe2SMrkJiifd5w3rtHXLqTuBkRiyJ=KZw@mail.gmail.com
Whole thread Raw
In response to Re: Optimizing bulk update performance  (Yang Zhang <yanghatespam@gmail.com>)
List pgsql-general
On Sat, Apr 27, 2013 at 12:24 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yang Zhang <yanghatespam@gmail.com> writes:
>>> It currently takes up to 24h for us to run a large set of UPDATE
>>> statements on a database, which are of the form:
>>
>>>     UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>>>     id = constid
>>
>>> (We're just overwriting fields of objects identified by ID.)
>>
>> Forgive the obvious question, but you do have an index on "id", right?
>> Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>
> Totally valid question.  That is the primary key with its own index.
> Yes, we verified that explain says it just use a simple index scan.
> Each individual query runs reasonably quickly (we can run several
> dozen such statements per second).
>
>>
>>> The tables have handfuls of indices each and no foreign key constraints.
>>
>> How much is a "handful"?
>
> The table with the largest volume of updates (our bottleneck) has four indexes:
>
>     "account_pkey" PRIMARY KEY, btree (id)
>     "account_createddate" btree (createddate)
>     "account_id_prefix" btree (id text_pattern_ops)
>     "account_recordtypeid" btree (recordtypeid)
>
>>
>>> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
>>> baseline we should reasonably target.
>>
>> Well, maybe.  You didn't say what percentage of the DB you're updating.
>
> It can be 10-50% of rows changed - a large portion.
>
>>
>> But the thing that comes to mind here is that you're probably incurring
>> a network round trip for each row, and maybe a query-planning round as
>> well, so you really can't expect that this is going to be anywhere near
>> as efficient as a bulk load operation.  You could presumably get rid of
>> the planner overhead by using a prepared statement.  Cutting the network
>> overhead is going to require a bit more ingenuity --- could you move
>> some logic into a stored procedure, perhaps, so that one command from
>> the client is sufficient to update multiple rows?
>
> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>
> If we ship many UPDATE statements per call to our DB API's execution
> function (we're using Python's psycopg2 if that matters, but I think
> that just binds libpq), would that avoid the network round trip per
> statement?
>
> If not, what if we use anonymous procedures (DO) to run multiple
> UPDATE statements?
>
> Finally, we could use the technique highlighted in my third bullet and
> use COPY (or at least multiple-value INSERT), then merging the new
> data with the old.  Would that be the most direct route to maximum
> performance?
>
> In any case, I assume deleting and rebuilding indexes is important
> here, yes?  But what about raising checkpoint_segments - does this
> actually help sustained throughput?

(I ask because I'm wondering if raising checkpoint_segments simply
postpones inevitable work, or if collecting a larger amount of changes
really does dramatically improve throughput somehow.)

>
>>
>>                         regards, tom lane
>
>
> --
> Yang Zhang
> http://yz.mit.edu/



--
Yang Zhang
http://yz.mit.edu/


pgsql-general by date:

Previous
From: Yang Zhang
Date:
Subject: Re: Optimizing bulk update performance
Next
From: Misa Simic
Date:
Subject: Re: Optimizing bulk update performance