Thread: Optimizing bulk update performance
It currently takes up to 24h for us to run a large set of UPDATE statements on a database, which are of the form: UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE id = constid (We're just overwriting fields of objects identified by ID.) The tables have handfuls of indices each and no foreign key constraints. No COMMIT is made till the end. It takes 2h to import a `pg_dump` of the entire DB. This seems like a baseline we should reasonably target. Short of producing a custom program that somehow reconstructs a dataset for Postgresql to re-import, is there anything we can do to bring the bulk UPDATE performance closer to that of the import? (This is an area that we believe log-structured merge trees handle well, but we're wondering if there's anything we can do within Postgresql.) Some ideas: - dropping all non-ID indices and rebuilding afterward? - increasing checkpoint_segments, but does this actually help sustained long-term throughput? - using the techniques mentioned here? (Load new data as table, then "merge in" old data where ID is not found in new data) <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> Basically there's a bunch of things to try and we're not sure what the most effective are or if we're overlooking other things. We'll be spending the next few days experimenting, but we thought we'd ask here as well. Thanks.
On 27/04/13 12:14, Yang Zhang wrote:
People will need to know your version of Postgres & which Operating System etc. plus details of CPU RAM, and Disks... AS well as what changes you have made to postgresql.conf...It currently takes up to 24h for us to run a large set of UPDATE statements on a database, which are of the form: UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE id = constid (We're just overwriting fields of objects identified by ID.) The tables have handfuls of indices each and no foreign key constraints. No COMMIT is made till the end. It takes 2h to import a `pg_dump` of the entire DB. This seems like a baseline we should reasonably target. Short of producing a custom program that somehow reconstructs a dataset for Postgresql to re-import, is there anything we can do to bring the bulk UPDATE performance closer to that of the import? (This is an area that we believe log-structured merge trees handle well, but we're wondering if there's anything we can do within Postgresql.) Some ideas: - dropping all non-ID indices and rebuilding afterward? - increasing checkpoint_segments, but does this actually help sustained long-term throughput? - using the techniques mentioned here? (Load new data as table, then "merge in" old data where ID is not found in new data) <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> Basically there's a bunch of things to try and we're not sure what the most effective are or if we're overlooking other things. We'll be spending the next few days experimenting, but we thought we'd ask here as well. Thanks.
I would be inclined to DROP all indexes and reCREATE them later.
Updating a row might lead to new row being added in a new disk page, so I suspect that updates will hit every index associated with the table with the (possible exception of partial indexes).
Running too many updates in one transaction, may mean that Postgres may need to use disk work files.
Depending on RAM etc, it may pay to increase some variables tat affect how Postgres uses RAM, some of these are per session.
Cheers,
Gavin
We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on m1.xlarge instances, which have: 15 GiB memory 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) 64-bit platform (Yes, we're moving to EBS Optimized instances + Provisioned IOPS volumes, but prelim. benchmarks suggest this won't get us enough of a boost as much as possibly refactoring the way we're executing these bulk updates in our application.) On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > On 27/04/13 12:14, Yang Zhang wrote: > > It currently takes up to 24h for us to run a large set of UPDATE > statements on a database, which are of the form: > > UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE > id = constid > > (We're just overwriting fields of objects identified by ID.) > > The tables have handfuls of indices each and no foreign key constraints. > No COMMIT is made till the end. > > It takes 2h to import a `pg_dump` of the entire DB. This seems like a > baseline we should reasonably target. > > Short of producing a custom program that somehow reconstructs a dataset > for Postgresql to re-import, is there anything we can do to bring the > bulk UPDATE performance closer to that of the import? (This is an area > that we believe log-structured merge trees handle well, but we're > wondering if there's anything we can do within Postgresql.) > > Some ideas: > > - dropping all non-ID indices and rebuilding afterward? > - increasing checkpoint_segments, but does this actually help sustained > long-term throughput? > - using the techniques mentioned here? (Load new data as table, then > "merge in" old data where ID is not found in new data) > > <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> > > Basically there's a bunch of things to try and we're not sure what the > most effective are or if we're overlooking other things. We'll be > spending the next few days experimenting, but we thought we'd ask here > as well. > > Thanks. > > > People will need to know your version of Postgres & which Operating System > etc. plus details of CPU RAM, and Disks... AS well as what changes you have > made to postgresql.conf... > > I would be inclined to DROP all indexes and reCREATE them later. > > Updating a row might lead to new row being added in a new disk page, so I > suspect that updates will hit every index associated with the table with the > (possible exception of partial indexes). > > Running too many updates in one transaction, may mean that Postgres may need > to use disk work files. > > Depending on RAM etc, it may pay to increase some variables tat affect how > Postgres uses RAM, some of these are per session. > > > Cheers, > Gavin -- Yang Zhang http://yz.mit.edu/
Please do not top post, the convention in these list are to add stuff at the end, apart from comments interspersed to make use of appropriate context! On 27/04/13 13:35, Yang Zhang wrote: > We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on > m1.xlarge instances, which have: > > 15 GiB memory > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) > 64-bit platform > > (Yes, we're moving to EBS Optimized instances + Provisioned IOPS > volumes, but prelim. benchmarks suggest this won't get us enough of a > boost as much as possibly refactoring the way we're executing these > bulk updates in our application.) > > On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower > <GavinFlower@archidevsys.co.nz> wrote: >> On 27/04/13 12:14, Yang Zhang wrote: >> >> It currently takes up to 24h for us to run a large set of UPDATE >> statements on a database, which are of the form: >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >> id = constid >> >> (We're just overwriting fields of objects identified by ID.) >> >> The tables have handfuls of indices each and no foreign key constraints. >> No COMMIT is made till the end. >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >> baseline we should reasonably target. >> >> Short of producing a custom program that somehow reconstructs a dataset >> for Postgresql to re-import, is there anything we can do to bring the >> bulk UPDATE performance closer to that of the import? (This is an area >> that we believe log-structured merge trees handle well, but we're >> wondering if there's anything we can do within Postgresql.) >> >> Some ideas: >> >> - dropping all non-ID indices and rebuilding afterward? >> - increasing checkpoint_segments, but does this actually help sustained >> long-term throughput? >> - using the techniques mentioned here? (Load new data as table, then >> "merge in" old data where ID is not found in new data) >> >> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> >> >> Basically there's a bunch of things to try and we're not sure what the >> most effective are or if we're overlooking other things. We'll be >> spending the next few days experimenting, but we thought we'd ask here >> as well. >> >> Thanks. >> >> >> People will need to know your version of Postgres & which Operating System >> etc. plus details of CPU RAM, and Disks... AS well as what changes you have >> made to postgresql.conf... >> >> I would be inclined to DROP all indexes and reCREATE them later. >> >> Updating a row might lead to new row being added in a new disk page, so I >> suspect that updates will hit every index associated with the table with the >> (possible exception of partial indexes). >> >> Running too many updates in one transaction, may mean that Postgres may need >> to use disk work files. >> >> Depending on RAM etc, it may pay to increase some variables tat affect how >> Postgres uses RAM, some of these are per session. >> >> >> Cheers, >> Gavin > > > -- > Yang Zhang > http://yz.mit.edu/
On Fri, Apr 26, 2013 at 7:01 PM, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > Please do not top post, the convention in these list are to add stuff at the > end, apart from comments interspersed to make use of appropriate context! Noted, thanks. Anyway, any performance hints are greatly appreciated. > > > On 27/04/13 13:35, Yang Zhang wrote: >> >> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on >> m1.xlarge instances, which have: >> >> 15 GiB memory >> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) >> 64-bit platform >> >> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS >> volumes, but prelim. benchmarks suggest this won't get us enough of a >> boost as much as possibly refactoring the way we're executing these >> bulk updates in our application.) >> >> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower >> <GavinFlower@archidevsys.co.nz> wrote: >>> >>> On 27/04/13 12:14, Yang Zhang wrote: >>> >>> It currently takes up to 24h for us to run a large set of UPDATE >>> statements on a database, which are of the form: >>> >>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> id = constid >>> >>> (We're just overwriting fields of objects identified by ID.) >>> >>> The tables have handfuls of indices each and no foreign key constraints. >>> No COMMIT is made till the end. >>> >>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> baseline we should reasonably target. >>> >>> Short of producing a custom program that somehow reconstructs a dataset >>> for Postgresql to re-import, is there anything we can do to bring the >>> bulk UPDATE performance closer to that of the import? (This is an area >>> that we believe log-structured merge trees handle well, but we're >>> wondering if there's anything we can do within Postgresql.) >>> >>> Some ideas: >>> >>> - dropping all non-ID indices and rebuilding afterward? >>> - increasing checkpoint_segments, but does this actually help sustained >>> long-term throughput? >>> - using the techniques mentioned here? (Load new data as table, then >>> "merge in" old data where ID is not found in new data) >>> >>> >>> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com> >>> >>> Basically there's a bunch of things to try and we're not sure what the >>> most effective are or if we're overlooking other things. We'll be >>> spending the next few days experimenting, but we thought we'd ask here >>> as well. >>> >>> Thanks. >>> >>> >>> People will need to know your version of Postgres & which Operating >>> System >>> etc. plus details of CPU RAM, and Disks... AS well as what changes you >>> have >>> made to postgresql.conf... >>> >>> I would be inclined to DROP all indexes and reCREATE them later. >>> >>> Updating a row might lead to new row being added in a new disk page, so I >>> suspect that updates will hit every index associated with the table with >>> the >>> (possible exception of partial indexes). >>> >>> Running too many updates in one transaction, may mean that Postgres may >>> need >>> to use disk work files. >>> >>> Depending on RAM etc, it may pay to increase some variables tat affect >>> how >>> Postgres uses RAM, some of these are per session. >>> >>> >>> Cheers, >>> Gavin >> >> >> >> -- >> Yang Zhang >> http://yz.mit.edu/ > > -- Yang Zhang http://yz.mit.edu/
Yang Zhang <yanghatespam@gmail.com> writes: > It currently takes up to 24h for us to run a large set of UPDATE > statements on a database, which are of the form: > UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE > id = constid > (We're just overwriting fields of objects identified by ID.) Forgive the obvious question, but you do have an index on "id", right? Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? > The tables have handfuls of indices each and no foreign key constraints. How much is a "handful"? > It takes 2h to import a `pg_dump` of the entire DB. This seems like a > baseline we should reasonably target. Well, maybe. You didn't say what percentage of the DB you're updating. But the thing that comes to mind here is that you're probably incurring a network round trip for each row, and maybe a query-planning round as well, so you really can't expect that this is going to be anywhere near as efficient as a bulk load operation. You could presumably get rid of the planner overhead by using a prepared statement. Cutting the network overhead is going to require a bit more ingenuity --- could you move some logic into a stored procedure, perhaps, so that one command from the client is sufficient to update multiple rows? regards, tom lane
On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yang Zhang <yanghatespam@gmail.com> writes: >> It currently takes up to 24h for us to run a large set of UPDATE >> statements on a database, which are of the form: > >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >> id = constid > >> (We're just overwriting fields of objects identified by ID.) > > Forgive the obvious question, but you do have an index on "id", right? > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? Totally valid question. That is the primary key with its own index. Yes, we verified that explain says it just use a simple index scan. Each individual query runs reasonably quickly (we can run several dozen such statements per second). > >> The tables have handfuls of indices each and no foreign key constraints. > > How much is a "handful"? The table with the largest volume of updates (our bottleneck) has four indexes: "account_pkey" PRIMARY KEY, btree (id) "account_createddate" btree (createddate) "account_id_prefix" btree (id text_pattern_ops) "account_recordtypeid" btree (recordtypeid) > >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >> baseline we should reasonably target. > > Well, maybe. You didn't say what percentage of the DB you're updating. It can be 10-50% of rows changed - a large portion. > > But the thing that comes to mind here is that you're probably incurring > a network round trip for each row, and maybe a query-planning round as > well, so you really can't expect that this is going to be anywhere near > as efficient as a bulk load operation. You could presumably get rid of > the planner overhead by using a prepared statement. Cutting the network > overhead is going to require a bit more ingenuity --- could you move > some logic into a stored procedure, perhaps, so that one command from > the client is sufficient to update multiple rows? You're right, we're only sequentially issuing (unprepared) UPDATEs. If we ship many UPDATE statements per call to our DB API's execution function (we're using Python's psycopg2 if that matters, but I think that just binds libpq), would that avoid the network round trip per statement? If not, what if we use anonymous procedures (DO) to run multiple UPDATE statements? Finally, we could use the technique highlighted in my third bullet and use COPY (or at least multiple-value INSERT), then merging the new data with the old. Would that be the most direct route to maximum performance? In any case, I assume deleting and rebuilding indexes is important here, yes? But what about raising checkpoint_segments - does this actually help sustained throughput? > > regards, tom lane -- Yang Zhang http://yz.mit.edu/
On Sat, Apr 27, 2013 at 12:24 AM, Yang Zhang <yanghatespam@gmail.com> wrote: > On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yang Zhang <yanghatespam@gmail.com> writes: >>> It currently takes up to 24h for us to run a large set of UPDATE >>> statements on a database, which are of the form: >> >>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> id = constid >> >>> (We're just overwriting fields of objects identified by ID.) >> >> Forgive the obvious question, but you do have an index on "id", right? >> Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? > > Totally valid question. That is the primary key with its own index. > Yes, we verified that explain says it just use a simple index scan. > Each individual query runs reasonably quickly (we can run several > dozen such statements per second). > >> >>> The tables have handfuls of indices each and no foreign key constraints. >> >> How much is a "handful"? > > The table with the largest volume of updates (our bottleneck) has four indexes: > > "account_pkey" PRIMARY KEY, btree (id) > "account_createddate" btree (createddate) > "account_id_prefix" btree (id text_pattern_ops) > "account_recordtypeid" btree (recordtypeid) > >> >>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> baseline we should reasonably target. >> >> Well, maybe. You didn't say what percentage of the DB you're updating. > > It can be 10-50% of rows changed - a large portion. > >> >> But the thing that comes to mind here is that you're probably incurring >> a network round trip for each row, and maybe a query-planning round as >> well, so you really can't expect that this is going to be anywhere near >> as efficient as a bulk load operation. You could presumably get rid of >> the planner overhead by using a prepared statement. Cutting the network >> overhead is going to require a bit more ingenuity --- could you move >> some logic into a stored procedure, perhaps, so that one command from >> the client is sufficient to update multiple rows? > > You're right, we're only sequentially issuing (unprepared) UPDATEs. > > If we ship many UPDATE statements per call to our DB API's execution > function (we're using Python's psycopg2 if that matters, but I think > that just binds libpq), would that avoid the network round trip per > statement? > > If not, what if we use anonymous procedures (DO) to run multiple > UPDATE statements? > > Finally, we could use the technique highlighted in my third bullet and > use COPY (or at least multiple-value INSERT), then merging the new > data with the old. Would that be the most direct route to maximum > performance? > > In any case, I assume deleting and rebuilding indexes is important > here, yes? But what about raising checkpoint_segments - does this > actually help sustained throughput? (I ask because I'm wondering if raising checkpoint_segments simply postpones inevitable work, or if collecting a larger amount of changes really does dramatically improve throughput somehow.) > >> >> regards, tom lane > > > -- > Yang Zhang > http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/
Hi,
Maybe best would be:
From client machine, instead of sending update statements with data - export data to file ready for copy command
Transfer file to the server where pg is running
Make pgsql function which
Copy to temp from the file
Update original table with values in temp
UPDATE foo
SET foo.col1 = bar.col1
FROM bar
You dont need to do delete/insert - if you have just update comands....
From client when file is transfered - call your import function on the the server
Optionaly you can run vacuum analyze after bulk operation...
Kind regards,
Misa
On Saturday, April 27, 2013, Yang Zhang wrote:
On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yang Zhang <yanghatespam@gmail.com> writes:
>> It currently takes up to 24h for us to run a large set of UPDATE
>> statements on a database, which are of the form:
>
>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>> id = constid
>
>> (We're just overwriting fields of objects identified by ID.)
>
> Forgive the obvious question, but you do have an index on "id", right?
> Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
Totally valid question. That is the primary key with its own index.
Yes, we verified that explain says it just use a simple index scan.
Each individual query runs reasonably quickly (we can run several
dozen such statements per second).
>
>> The tables have handfuls of indices each and no foreign key constraints.
>
> How much is a "handful"?
The table with the largest volume of updates (our bottleneck) has four indexes:
"account_pkey" PRIMARY KEY, btree (id)
"account_createddate" btree (createddate)
"account_id_prefix" btree (id text_pattern_ops)
"account_recordtypeid" btree (recordtypeid)
>
>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>> baseline we should reasonably target.
>
> Well, maybe. You didn't say what percentage of the DB you're updating.
It can be 10-50% of rows changed - a large portion.
>
> But the thing that comes to mind here is that you're probably incurring
> a network round trip for each row, and maybe a query-planning round as
> well, so you really can't expect that this is going to be anywhere near
> as efficient as a bulk load operation. You could presumably get rid of
> the planner overhead by using a prepared statement. Cutting the network
> overhead is going to require a bit more ingenuity --- could you move
> some logic into a stored procedure, perhaps, so that one command from
> the client is sufficient to update multiple rows?
You're right, we're only sequentially issuing (unprepared) UPDATEs.
If we ship many UPDATE statements per call to our DB API's execution
function (we're using Python's psycopg2 if that matters, but I think
that just binds libpq), would that avoid the network round trip per
statement?
If not, what if we use anonymous procedures (DO) to run multiple
UPDATE statements?
Finally, we could use the technique highlighted in my third bullet and
use COPY (or at least multiple-value INSERT), then merging the new
data with the old. Would that be the most direct route to maximum
performance?
In any case, I assume deleting and rebuilding indexes is important
here, yes? But what about raising checkpoint_segments - does this
actually help sustained throughput?
>
> regards, tom lane
--
Yang Zhang
http://yz.mit.edu/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote: > Hi, > > If dataset for update is large... > > Maybe best would be: > > From client machine, instead of sending update statements with data - export > data to file ready for copy command > Transfer file to the server where pg is running > Make pgsql function which > > Create temp table > Copy to temp from the file > > Update original table with values in temp > > UPDATE foo > SET foo.col1 = bar.col1 > FROM bar > WHERE foo.id = bar.id > > You dont need to do delete/insert - if you have just update comands.... > > From client when file is transfered - call your import function on the the > server > > Optionaly you can run vacuum analyze after bulk operation... But wouldn't a bulk UPDATE touch many existing pages (say, 20% scattered around) to mark rows as dead (per MVCC)? I guess it comes down to: will PG be smart enough to mark dead rows in largely sequential scans (rather than, say, jumping around in whatever order rows from foo are yielded by the above join)? In other words, when considering the alternative of: CREATE TABLE newfoo AS SELECT * FROM bar UNION SELECT * FROM foo WHERE id NOT IN (SELECT id FROM bar); Wouldn't this alternative be faster? > > Kind regards, > > Misa > > > > On Saturday, April 27, 2013, Yang Zhang wrote: >> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Yang Zhang <yanghatespam@gmail.com> writes: >> >> It currently takes up to 24h for us to run a large set of UPDATE >> >> statements on a database, which are of the form: >> > >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >> >> id = constid >> > >> >> (We're just overwriting fields of objects identified by ID.) >> > >> > Forgive the obvious question, but you do have an index on "id", right? >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? >> >> Totally valid question. That is the primary key with its own index. >> Yes, we verified that explain says it just use a simple index scan. >> Each individual query runs reasonably quickly (we can run several >> dozen such statements per second). >> >> > >> >> The tables have handfuls of indices each and no foreign key >> >> constraints. >> > >> > How much is a "handful"? >> >> The table with the largest volume of updates (our bottleneck) has four >> indexes: >> >> "account_pkey" PRIMARY KEY, btree (id) >> "account_createddate" btree (createddate) >> "account_id_prefix" btree (id text_pattern_ops) >> "account_recordtypeid" btree (recordtypeid) >> >> > >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >> >> baseline we should reasonably target. >> > >> > Well, maybe. You didn't say what percentage of the DB you're updating. >> >> It can be 10-50% of rows changed - a large portion. >> >> > >> > But the thing that comes to mind here is that you're probably incurring >> > a network round trip for each row, and maybe a query-planning round as >> > well, so you really can't expect that this is going to be anywhere near >> > as efficient as a bulk load operation. You could presumably get rid of >> > the planner overhead by using a prepared statement. Cutting the network >> > overhead is going to require a bit more ingenuity --- could you move >> > some logic into a stored procedure, perhaps, so that one command from >> > the client is sufficient to update multiple rows? >> >> You're right, we're only sequentially issuing (unprepared) UPDATEs. >> >> If we ship many UPDATE statements per call to our DB API's execution >> function (we're using Python's psycopg2 if that matters, but I think >> that just binds libpq), would that avoid the network round trip per >> statement? >> >> If not, what if we use anonymous procedures (DO) to run multiple >> UPDATE statements? >> >> Finally, we could use the technique highlighted in my third bullet and >> use COPY (or at least multiple-value INSERT), then merging the new >> data with the old. Would that be the most direct route to maximum >> performance? >> >> In any case, I assume deleting and rebuilding indexes is important >> here, yes? But what about raising checkpoint_segments - does this >> actually help sustained throughput? >> >> > >> > regards, tom lane >> >> >> -- >> Yang Zhang >> http://yz.mit.edu/ >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general -- Yang Zhang http://yz.mit.edu/
On Sat, Apr 27, 2013 at 2:54 AM, Yang Zhang <yanghatespam@gmail.com> wrote: > On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote: >> Hi, >> >> If dataset for update is large... >> >> Maybe best would be: >> >> From client machine, instead of sending update statements with data - export >> data to file ready for copy command >> Transfer file to the server where pg is running >> Make pgsql function which >> >> Create temp table >> Copy to temp from the file >> >> Update original table with values in temp >> >> UPDATE foo >> SET foo.col1 = bar.col1 >> FROM bar >> WHERE foo.id = bar.id >> >> You dont need to do delete/insert - if you have just update comands.... >> >> From client when file is transfered - call your import function on the the >> server >> >> Optionaly you can run vacuum analyze after bulk operation... > > But wouldn't a bulk UPDATE touch many existing pages (say, 20% > scattered around) to mark rows as dead (per MVCC)? I guess it comes > down to: will PG be smart enough to mark dead rows in largely > sequential scans (rather than, say, jumping around in whatever order > rows from foo are yielded by the above join)? (This then begs the question - how might I see this seemingly substantial performance implication, one way or the other, in say EXPLAIN output or something like that?) > > In other words, when considering the alternative of: > > CREATE TABLE newfoo AS > SELECT * FROM bar > UNION > SELECT * FROM foo > WHERE id NOT IN (SELECT id FROM bar); > > Wouldn't this alternative be faster? > >> >> Kind regards, >> >> Misa >> >> >> >> On Saturday, April 27, 2013, Yang Zhang wrote: >>> >>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> > Yang Zhang <yanghatespam@gmail.com> writes: >>> >> It currently takes up to 24h for us to run a large set of UPDATE >>> >> statements on a database, which are of the form: >>> > >>> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> >> id = constid >>> > >>> >> (We're just overwriting fields of objects identified by ID.) >>> > >>> > Forgive the obvious question, but you do have an index on "id", right? >>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? >>> >>> Totally valid question. That is the primary key with its own index. >>> Yes, we verified that explain says it just use a simple index scan. >>> Each individual query runs reasonably quickly (we can run several >>> dozen such statements per second). >>> >>> > >>> >> The tables have handfuls of indices each and no foreign key >>> >> constraints. >>> > >>> > How much is a "handful"? >>> >>> The table with the largest volume of updates (our bottleneck) has four >>> indexes: >>> >>> "account_pkey" PRIMARY KEY, btree (id) >>> "account_createddate" btree (createddate) >>> "account_id_prefix" btree (id text_pattern_ops) >>> "account_recordtypeid" btree (recordtypeid) >>> >>> > >>> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> >> baseline we should reasonably target. >>> > >>> > Well, maybe. You didn't say what percentage of the DB you're updating. >>> >>> It can be 10-50% of rows changed - a large portion. >>> >>> > >>> > But the thing that comes to mind here is that you're probably incurring >>> > a network round trip for each row, and maybe a query-planning round as >>> > well, so you really can't expect that this is going to be anywhere near >>> > as efficient as a bulk load operation. You could presumably get rid of >>> > the planner overhead by using a prepared statement. Cutting the network >>> > overhead is going to require a bit more ingenuity --- could you move >>> > some logic into a stored procedure, perhaps, so that one command from >>> > the client is sufficient to update multiple rows? >>> >>> You're right, we're only sequentially issuing (unprepared) UPDATEs. >>> >>> If we ship many UPDATE statements per call to our DB API's execution >>> function (we're using Python's psycopg2 if that matters, but I think >>> that just binds libpq), would that avoid the network round trip per >>> statement? >>> >>> If not, what if we use anonymous procedures (DO) to run multiple >>> UPDATE statements? >>> >>> Finally, we could use the technique highlighted in my third bullet and >>> use COPY (or at least multiple-value INSERT), then merging the new >>> data with the old. Would that be the most direct route to maximum >>> performance? >>> >>> In any case, I assume deleting and rebuilding indexes is important >>> here, yes? But what about raising checkpoint_segments - does this >>> actually help sustained throughput? >>> >>> > >>> > regards, tom lane >>> >>> >>> -- >>> Yang Zhang >>> http://yz.mit.edu/ >>> >>> >>> -- >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-general > > > > -- > Yang Zhang > http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/
I dont know - u can test :)
In whole solution it is just one command different - so easy to test and compare...
On Saturday, April 27, 2013, Yang Zhang wrote:
To me it doesnt sound as faster... Sounds as more operation needed what should be done...
And produce more problems...i.e what with table foo? What if another table refference foo etc...
On Saturday, April 27, 2013, Yang Zhang wrote:
On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote:
> Hi,
>
> If dataset for update is large...
>
> Maybe best would be:
>
> From client machine, instead of sending update statements with data - export
> data to file ready for copy command
> Transfer file to the server where pg is running
> Make pgsql function which
>
> Create temp table
> Copy to temp from the file
>
> Update original table with values in temp
>
> UPDATE foo
> SET foo.col1 = bar.col1
> FROM bar
> WHERE foo.id = bar.id
>
> You dont need to do delete/insert - if you have just update comands....
>
> From client when file is transfered - call your import function on the the
> server
>
> Optionaly you can run vacuum analyze after bulk operation...
But wouldn't a bulk UPDATE touch many existing pages (say, 20%
scattered around) to mark rows as dead (per MVCC)? I guess it comes
down to: will PG be smart enough to mark dead rows in largely
sequential scans (rather than, say, jumping around in whatever order
rows from foo are yielded by the above join)?
In other words, when considering the alternative of:
CREATE TABLE newfoo AS
SELECT * FROM bar
UNION
SELECT * FROM foo
WHERE id NOT IN (SELECT id FROM bar);
Wouldn't this alternative be faster?
>
> Kind regards,
>
> Misa
>
>
>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Yang Zhang <yanghatespam@gmail.com> writes:
>> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> statements on a database, which are of the form:
>> >
>> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>> >> id = constid
>> >
>> >> (We're just overwriting fields of objects identified by ID.)
>> >
>> > Forgive the obvious question, but you do have an index on "id", right?
>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>>
>> Totally valid question. That is the primary key with its own index.
>> Yes, we verified that explain says it just use a simple index scan.
>> Each individual query runs reasonably quickly (we can run several
>> dozen such statements per second).
>>
>> >
>> >> The tables have handfuls of indices each and no foreign key
>> >> constraints.
>> >
>> > How much is a "handful"?
>>
>> The table with the largest volume of updates (our bottleneck) has four
>> indexes:
>>
>> "account_pkey" PRIMARY KEY, btree (id)
>> "account_createddate" btree (createddate)
>> "account_id_prefix" btree (id text_pattern_ops)
>> "account_recordtypeid" btree (recordtypeid)
>>
>> >
>> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>> >> baseline we should reasonably target.
>> >
>> > Well, maybe. You didn't say what percentage of the DB you're updating.
>>
>> It can be 10-50% of rows changed - a large portion.
>>
>> >
>> > But the thing that comes to mind here is that you're probably incurring
>> > a network round trip for each row, and maybe a query-planning round as
>> > well, so you really can't expect that this is going to be anywhere near
>> > as efficient as a bulk load operation. You could presumably get rid of
>> > the planner overhead by using a prepared statement. Cutting the network
>> > overhead is going to require a bit more ingenuity --- could you move
>> > some logic into a stored procedure, perhaps, so that one command from
>> > the client is sufficient to update multiple rows?
>>
>> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>>
>> If we ship many UPDATE statements per call to our DB API's execution
>> function (we're using Python's psycopg2 if that matters, but I think
>> that just binds libpq), would that avoid the network round trip per
>> statement?
>>
>> If not, what if we use anonymous procedures (DO) to run multiple
>> UPDATE statements?
>>
>> Finally, we could use the technique highlighted in my third bullet and
>> use COPY (or at least multiple-value INSERT), then merging the new
>> data with the old. Would that be the most direct route to maximum
>> performance?
>>
>> In any case, I assume deleting and rebuilding indexes is important
>> here, yes? But what about raising checkpoint_segments - does this
>> actually help sustained throughput?
>>
>> >
>> > regards, tom lane
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@gmail.com> wrote: > I dont know - u can test :) I probably will, but I do have a huge stack of such experiments to run by now, and it's always tricky / takes care to get benchmarks right, avoid disk caches, etc. Certainly I think it would be helpful (or at least hopefully not harmful) to ask here to see if anyone might just know. That's what brought me to this list. :) > > In whole solution it is just one command different - so easy to test and > compare... > > To me it doesnt sound as faster... Sounds as more operation needed what > should be done... > > And produce more problems...i.e what with table foo? What if another table > refference foo etc... Yep, I guess more specifically I was just thinking of dumping to a temp table: CREATE TEMP TABLE tmp AS SELECT * FROM foo; TRUNCATE foo; INSERT INTO foo SELECT * FROM bar UNION SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar); The question I have remaining is whether the bulk UPDATE will be able to update many rows efficiently (smartly order them to do largely sequential scans) - if so, I imagine it would be faster than the above. > > On Saturday, April 27, 2013, Yang Zhang wrote: >> >> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote: >> > Hi, >> > >> > If dataset for update is large... >> > >> > Maybe best would be: >> > >> > From client machine, instead of sending update statements with data - >> > export >> > data to file ready for copy command >> > Transfer file to the server where pg is running >> > Make pgsql function which >> > >> > Create temp table >> > Copy to temp from the file >> > >> > Update original table with values in temp >> > >> > UPDATE foo >> > SET foo.col1 = bar.col1 >> > FROM bar >> > WHERE foo.id = bar.id >> > >> > You dont need to do delete/insert - if you have just update comands.... >> > >> > From client when file is transfered - call your import function on the >> > the >> > server >> > >> > Optionaly you can run vacuum analyze after bulk operation... >> >> But wouldn't a bulk UPDATE touch many existing pages (say, 20% >> scattered around) to mark rows as dead (per MVCC)? I guess it comes >> down to: will PG be smart enough to mark dead rows in largely >> sequential scans (rather than, say, jumping around in whatever order >> rows from foo are yielded by the above join)? >> >> In other words, when considering the alternative of: >> >> CREATE TABLE newfoo AS >> SELECT * FROM bar >> UNION >> SELECT * FROM foo >> WHERE id NOT IN (SELECT id FROM bar); >> >> Wouldn't this alternative be faster? >> >> > >> > Kind regards, >> > >> > Misa >> > >> > >> > >> > On Saturday, April 27, 2013, Yang Zhang wrote: >> >> >> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > Yang Zhang <yanghatespam@gmail.com> writes: >> >> >> It currently takes up to 24h for us to run a large set of UPDATE >> >> >> statements on a database, which are of the form: >> >> > >> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... >> >> >> WHERE >> >> >> id = constid >> >> > >> >> >> (We're just overwriting fields of objects identified by ID.) >> >> > >> >> > Forgive the obvious question, but you do have an index on "id", >> >> > right? >> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of >> >> > these)? >> >> >> >> Totally valid question. That is the primary key with its own index. >> >> Yes, we verified that explain says it just use a simple index scan. >> >> Each individual query runs reasonably quickly (we can run several >> >> dozen such statements per second). >> >> >> >> > >> >> >> The tables have handfuls of indices each and no foreign key >> >> >> constraints. >> >> > >> >> > How much is a "handful"? >> >> >> >> The table with the largest volume of updates (our bottleneck) has four >> >> indexes: >> >> >> >> "account_pkey" PRIMARY KEY, btree (id) >> >> "account_createddate" btree (createddate) >> >> "account_id_prefix" btree (id text_pattern_ops) >> >> "account_recordtypeid" btree (recordtypeid) >> >> >> >> > >> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like >> >> >> a >> >> >> baseline we should reasonably target. >> >> > >> >> > Well, maybe. You didn't say what percentage of the DB you're >> >> > updating. >> >> >> >> It can be 10-50% of rows changed - a large portion. >> >> >> >> > >> >> > But the thing that comes to mind here is that you're probably >> >> > incurring >> >> > a network round trip for each row, and maybe a query-planning round >> >> > as >> >> > well, so you really can't expect that this is going to be anywhere >> >> > near >> >> > as efficient as a bulk load operation. You could presumably get rid >> >> > of >> >> > the planner overhead by using a prepared statement. Cutting the >> >> > network >> >> > overhead is going to require a bit more ingenuity --- could you move >> >> > some logic into a stored procedure, perhaps, so that one command from >> >> > the client is sufficient to update multiple rows? >> >> >> >> You're right, we're only sequentially issuing (unprepared) UPDATEs. >> >> >> >> If we ship many UPDATE statements per call to our DB API's execution >> >> function (we're using Python's psycopg2 if that matters, but I think >> >> that just binds libpq), would that avoid the network round trip per >> >> statement? >> >> >> >> If not, what if we use anonymous procedures (DO) to run multiple >> >> UPDATE statements? >> >> >> >> Finally, we could use the technique highlighted in my third bullet and >> >> use COPY (or at least multiple-value INSERT), then merging the new >> >> data with the old. Would that be the most direct route to maximum >> >> performance? >> >> >> >> In any case, I assume deleting and rebuilding indexes is important >> >> here, yes? But what about raising checkpoint_segments - does this >> >> actually help sustained throughput? >> >> >> >> > >> >> > regards, tom lane >> >> >> >> >> >> -- >> >> Yang Zhang >> >> http://yz.mit.edu/ >> >> >> >> >> >> -- >> >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> >> To make changes to your subscription: >> >> http://www.postgresql.org/mailpref/pgsql-general >> >> > -- Yang Zhang http://yz.mit.edu/
Well
On Saturday, April 27, 2013, Yang Zhang wrote:
About best approach with large datasets - rarely there is "always true" best principle...
You will always see there are a few ways - best one just test confirms - depends on many things like hardware os etc... Sometimes even depends on dataset for update...
"
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
"
Above doesnt amke sense to me..
I would do:
CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;
COPY tmp FROM 'pathtofile';
UPDATE foo
SELECT * FROM foo where 1=2;
COPY tmp FROM 'pathtofile';
UPDATE foo
SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
In case I know I need just update... If in my dataset I have mix for update and potentially new rows
Instead of update command, I would do
INSERT INTO foo
SELECT * FROM tmp;
On Saturday, April 27, 2013, Yang Zhang wrote:
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@gmail.com> wrote:
> I dont know - u can test :)
I probably will, but I do have a huge stack of such experiments to run
by now, and it's always tricky / takes care to get benchmarks right,
avoid disk caches, etc. Certainly I think it would be helpful (or at
least hopefully not harmful) to ask here to see if anyone might just
know. That's what brought me to this list. :)
>
> In whole solution it is just one command different - so easy to test and
> compare...
>
> To me it doesnt sound as faster... Sounds as more operation needed what
> should be done...
>
> And produce more problems...i.e what with table foo? What if another table
> refference foo etc...
Yep, I guess more specifically I was just thinking of dumping to a temp table:
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
The question I have remaining is whether the bulk UPDATE will be able
to update many rows efficiently (smartly order them to do largely
sequential scans) - if so, I imagine it would be faster than the
above.
>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote:
>> > Hi,
>> >
>> > If dataset for update is large...
>> >
>> > Maybe best would be:
>> >
>> > From client machine, instead of sending update statements with data -
>> > export
>> > data to file ready for copy command
>> > Transfer file to the server where pg is running
>> > Make pgsql function which
>> >
>> > Create temp table
>> > Copy to temp from the file
>> >
>> > Update original table with values in temp
>> >
>> > UPDATE foo
>> > SET foo.col1 = bar.col1
>> > FROM bar
>> > WHERE foo.id = bar.id
>> >
>> > You dont need to do delete/insert - if you have just update comands....
>> >
>> > From client when file is transfered - call your import function on the
>> > the
>> > server
>> >
>> > Optionaly you can run vacuum analyze after bulk operation...
>>
>> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
>> scattered around) to mark rows as dead (per MVCC)? I guess it comes
>> down to: will PG be smart enough to mark dead rows in largely
>> sequential scans (rather than, say, jumping around in whatever order
>> rows from foo are yielded by the above join)?
>>
>> In other words, when considering the alternative of:
>>
>> CREATE TABLE newfoo AS
>> SELECT * FROM bar
>> UNION
>> SELECT * FROM foo
>> WHERE id NOT IN (SELECT id FROM bar);
>>
>> Wouldn't this alternative be faster?
>>
>> >
>> > Kind regards,
>> >
>> > Misa
>> >
>> >
>> >
>> > On Saturday, April 27, 2013, Yang Zhang wrote:
>> >>
>> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> > Yang Zhang <yanghatespam@gmail.com> writes:
>> >> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> >> statements on a database, which are of the form:
>> >> >
>> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ...
>> >> >> WHERE
>> >> >> id = constid
>> >> >
>> >> >> (We're just overwriting fields of objects identified by ID.)
>> >> >
>> >> > Forgive the obvious question, but you do have an index on "id",
>> >> > right?
>> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of
>> >> > these)?
>> >>
>> >> Totally valid question. That is the primary key with its own index.
>> >> Yes, we verified that explain says it just use a simple index scan.
>> >> Each individual query runs reasonably quickly (we can run several
>> >> dozen such statements per second).
>> >>
>> >> >
>> >> >> The tables have handfuls of indices each and no foreign key
>> >> >> constraints.
>> >> >
>> >> > How much is a "handful"?
>> >>
>> >> The table with the largest volume of updates (our bottleneck) has four
>> >> indexes:
>> >>
>> >> "account_pkey" PRIMARY KEY, btree (id)
>> >> "account_createddate" btree (createddate)
>> >> "account_id_prefix" btree (id text_pattern_ops)
>> >> "account_recordtypeid" btree (recordtypeid)
>> >>
>> >> >
>> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like
>> >> >> a
>
Yang Zhang <yanghatespam@gmail.com> writes: > You're right, we're only sequentially issuing (unprepared) UPDATEs. You definitely want to fix both parts of that, then. > If we ship many UPDATE statements per call to our DB API's execution > function (we're using Python's psycopg2 if that matters, but I think > that just binds libpq), would that avoid the network round trip per > statement? Possibly, not sure how psycopg2 handles that. > If not, what if we use anonymous procedures (DO) to run multiple > UPDATE statements? I don't think an anonymous procedure as such would result in any plan caching, at least not unless you could write it to have a single UPDATE in a loop. > Finally, we could use the technique highlighted in my third bullet and > use COPY (or at least multiple-value INSERT), then merging the new > data with the old. Would that be the most direct route to maximum > performance? It might help, you'd need to try it. > In any case, I assume deleting and rebuilding indexes is important > here, yes? But what about raising checkpoint_segments - does this > actually help sustained throughput? If you're updating as much as 50% of the table, and you don't need the indexes for other purposes meanwhile, dropping and rebuilding them would be worth trying. Also, you definitely want checkpoint_segments large enough so that checkpoints are at least a few minutes apart. Excess checkpoints do represent a sustained drag on performance because they mean a greater volume of disk writes. regards, tom lane
On 2013-04-27, Yang Zhang <yanghatespam@gmail.com> wrote: > On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@gmail.com> wrote: >> Optionaly you can run vacuum analyze after bulk operation... > > But wouldn't a bulk UPDATE touch many existing pages (say, 20% > scattered around) to mark rows as dead (per MVCC)? I guess it comes > down to: will PG be smart enough to mark dead rows in largely > sequential scans (rather than, say, jumping around in whatever order > rows from foo are yielded by the above join)? A plpgsql FOR-IN-query loop isn't going to be that smart, it's a procedural language ans does things procedurally, if you want to do set operations use SQL. this: UPDATE existing-table SET .... FROM temp-table WHERE join-condition; will likely get you a sequential scan over the existing table and should be reasonably performant as long as temp-table is small enough to fit in memory. -- ⚂⚃ 100% natural