Re: is single row update improved with function - Mailing list pgsql-general

From Rob Sargent
Subject Re: is single row update improved with function
Date
Msg-id E9E1FEFB-D29B-4B1D-B574-A05877ABC13D@gmail.com
Whole thread Raw
In response to Re: is single row update improved with function  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-general
> On Dec 31, 2017, at 4:31 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On 1 January 2018 at 12:06, Rob Sargent <robjsargent@gmail.com> wrote:
>> I must update 3M of 100M records, with tuple specific modifications.  I can generate the necessary sql, but I’m
wonderingif files of simple update statements affecting a single row is more effective than files of a function call
doingthe same update given the necessary values, including where clause restrictions?  The query plan set by the first
shouldbe decent for the remainder. 
>>
>> Alternatively, would a bulk load into a table of replacement values and join info be the fastest way?
>
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
>
Sorry, I didn’t address the question about the table's usage.  Currently we’re in a data loading phase and this table
isalmost completed.  Thereafter (post vacuum analyze) it will be 99.99% read-only.  The remainder will be updated in
muchthe same fashion as described early (set events_x = events_x + increment). 

This table will be analyzed a couple ways, mainly determining significance threshold across various slices, each of
whichexamines roughly one tenth of the records. 

>> Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions
(350rows  per tx) if less is more in this world. 
>
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
>
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
>
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
>
> --
> David Rowley                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services



pgsql-general by date:

Previous
From: Rob Sargent
Date:
Subject: Re: is single row update improved with function
Next
From: Ibrahim Edib Kokdemir
Date:
Subject: Re: invalid memory alloc request size 576460752438159360