Re: Optimizing bulk update performance - Mailing list pgsql-general

From Yang Zhang
Subject Re: Optimizing bulk update performance
Date
Msg-id CAKxBDU94Z5qowUBAkbxbb2N4GCfJn7+TGsq6fm5e4fgj8QRkeA@mail.gmail.com
Whole thread Raw
In response to Re: Optimizing bulk update performance  (Yang Zhang <yanghatespam@gmail.com>)
List pgsql-general
On Sat, Apr 27, 2013 at 2:54 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote:
>> Hi,
>>
>> If dataset for update is large...
>>
>> Maybe best would be:
>>
>> From client machine, instead of sending update statements with data - export
>> data to file ready for copy command
>> Transfer file to the server where pg is running
>> Make pgsql function which
>>
>> Create temp table
>> Copy to temp from the file
>>
>> Update original table with values in temp
>>
>> UPDATE foo
>> SET foo.col1 = bar.col1
>> FROM bar
>> WHERE foo.id = bar.id
>>
>> You dont need to do delete/insert - if you have just update comands....
>>
>> From client when file is transfered - call your import function on the the
>> server
>>
>> Optionaly you can run vacuum analyze after bulk operation...
>
> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
> scattered around) to mark rows as dead (per MVCC)?  I guess it comes
> down to: will PG be smart enough to mark dead rows in largely
> sequential scans (rather than, say, jumping around in whatever order
> rows from foo are yielded by the above join)?

(This then begs the question - how might I see this seemingly
substantial performance implication, one way or the other, in say
EXPLAIN output or something like that?)

>
> In other words, when considering the alternative of:
>
> CREATE TABLE newfoo AS
> SELECT * FROM bar
> UNION
> SELECT * FROM foo
> WHERE id NOT IN (SELECT id FROM bar);
>
> Wouldn't this alternative be faster?
>
>>
>> Kind regards,
>>
>> Misa
>>
>>
>>
>> On Saturday, April 27, 2013, Yang Zhang wrote:
>>>
>>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> > Yang Zhang <yanghatespam@gmail.com> writes:
>>> >> It currently takes up to 24h for us to run a large set of UPDATE
>>> >> statements on a database, which are of the form:
>>> >
>>> >>     UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
>>> >>     id = constid
>>> >
>>> >> (We're just overwriting fields of objects identified by ID.)
>>> >
>>> > Forgive the obvious question, but you do have an index on "id", right?
>>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>>>
>>> Totally valid question.  That is the primary key with its own index.
>>> Yes, we verified that explain says it just use a simple index scan.
>>> Each individual query runs reasonably quickly (we can run several
>>> dozen such statements per second).
>>>
>>> >
>>> >> The tables have handfuls of indices each and no foreign key
>>> >> constraints.
>>> >
>>> > How much is a "handful"?
>>>
>>> The table with the largest volume of updates (our bottleneck) has four
>>> indexes:
>>>
>>>     "account_pkey" PRIMARY KEY, btree (id)
>>>     "account_createddate" btree (createddate)
>>>     "account_id_prefix" btree (id text_pattern_ops)
>>>     "account_recordtypeid" btree (recordtypeid)
>>>
>>> >
>>> >> It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
>>> >> baseline we should reasonably target.
>>> >
>>> > Well, maybe.  You didn't say what percentage of the DB you're updating.
>>>
>>> It can be 10-50% of rows changed - a large portion.
>>>
>>> >
>>> > But the thing that comes to mind here is that you're probably incurring
>>> > a network round trip for each row, and maybe a query-planning round as
>>> > well, so you really can't expect that this is going to be anywhere near
>>> > as efficient as a bulk load operation.  You could presumably get rid of
>>> > the planner overhead by using a prepared statement.  Cutting the network
>>> > overhead is going to require a bit more ingenuity --- could you move
>>> > some logic into a stored procedure, perhaps, so that one command from
>>> > the client is sufficient to update multiple rows?
>>>
>>> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>>>
>>> If we ship many UPDATE statements per call to our DB API's execution
>>> function (we're using Python's psycopg2 if that matters, but I think
>>> that just binds libpq), would that avoid the network round trip per
>>> statement?
>>>
>>> If not, what if we use anonymous procedures (DO) to run multiple
>>> UPDATE statements?
>>>
>>> Finally, we could use the technique highlighted in my third bullet and
>>> use COPY (or at least multiple-value INSERT), then merging the new
>>> data with the old.  Would that be the most direct route to maximum
>>> performance?
>>>
>>> In any case, I assume deleting and rebuilding indexes is important
>>> here, yes?  But what about raising checkpoint_segments - does this
>>> actually help sustained throughput?
>>>
>>> >
>>> >                         regards, tom lane
>>>
>>>
>>> --
>>> Yang Zhang
>>> http://yz.mit.edu/
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/



--
Yang Zhang
http://yz.mit.edu/


pgsql-general by date:

Previous
From: Yang Zhang
Date:
Subject: Re: Optimizing bulk update performance
Next
From: Misa Simic
Date:
Subject: Re: Optimizing bulk update performance