Thread: Savepoints in transactions for speed?
I need to delete about 1.5 million records from a table and reload it in one transaction. The usual advice when loading with inserts seems to be group them into transactions of around 1k records. Committing at that point would leave the table in an inconsistent state. Would issuing a savepoint every 1k or so records negate whatever downside there is to keeping a transaction open for all 1.5 million records, or just add more overhead?
The data to reload the table is coming from a Perl DBI connection to a different database (not PostgreSQL) so I'm not sure the COPY alternative applies here.
Any suggestions are welcome.
Mike
On 27/11/12 22:04, Mike Blackwell wrote: > I need to delete about 1.5 million records from a table and reload it > in one transaction. > The data to reload the table is coming from a Perl DBI connection to a > different database (not PostgreSQL) so I'm not sure the COPY > alternative applies here. No reason why it shouldn't. https://metacpan.org/module/DBD::Pg#COPY-support -- Richard Huxton Archonet Ltd
On Nov 27, 2012, at 2:04 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote: > I need to delete about 1.5 million records from a table and reload it in one transaction. The usual advice when loadingwith inserts seems to be group them into transactions of around 1k records. Committing at that point would leavethe table in an inconsistent state. I'd probably just do the whole thing in one transaction. Do you have specific reasons you want to avoid a long transaction, or just relying on rules of thumb? Postgresql isn't goingto run out of resources doing a big transaction, in the way some other databases will. Long running transactions will interfere with vacuuming, but inserting a couple of million rows shouldn't take that long. > Would issuing a savepoint every 1k or so records negate whatever downside there is to keeping a transaction open for all1.5 million records, or just add more overhead? Savepoints are going to increase overhead and have no effect on the length of the transaction. If you want to catch errorsand not have to redo the entire transaction, they're great, but that's about it. > The data to reload the table is coming from a Perl DBI connection to a different database (not PostgreSQL) so I'm not surethe COPY alternative applies here. COPY works nicely from perl: $dbh->do("COPY foo FROM STDIN"); $dbh->pg_putcopydata("foo\tbar\tbaz\n"); $dbh->pg_putcopyend(); The details are in DBD::Pg. I use this a lot for doing big-ish (tens of millions of rows) bulk inserts. It's not as fastas you can get, but it's probably as fast as you can get with perl. Cheers, Steve
Mike,
Is there anything that the 1.5 million rows have in common that would allow you to use partitions? if so, you could load the new data into a partition at your leisure, start a transaction, alter the partition table with the old data to no longer inherit from the parent, alter the new partition table to
inherit from the parent, commit, then drop the old table. This operation would be very fast, the users probably won't even notice.
Bob Lunney
I need to delete about 1.5 million records from a table and reload it in one transaction. The usual advice when loading with inserts seems to be group them into transactions of around 1k records. Committing at that point would leave the table in an inconsistent state. Would issuing a savepoint every 1k or so records negate whatever downside there is to keeping a transaction open for all 1.5 million records, or just add more overhead?The data to reload the table is coming from a Perl DBI connection to a different database (not PostgreSQL) so I'm not sure the COPY alternative applies here.Any suggestions are welcome.
Mike
Steve Atkins wrote:
> Postgresql isn't going to run out of resources doing a big transaction, in the way some other databases will.
I thought I had read something at one point about keeping the transaction size on the order of a couple thousand because there were issues when it got larger. As that apparently is not an issue I went ahead and tried the DELETE and COPY in a transaction. The load time is quite reasonable this way.
Thanks!
Mike
On Tue, Nov 27, 2012 at 10:08 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote: > > > Postgresql isn't going to run out of resources doing a big transaction, in the way some other databases will. > > I thought I had read something at one point about keeping the transaction size on the order of a couple thousand becausethere were issues when it got larger. As that apparently is not an issue I went ahead and tried the DELETE and COPYin a transaction. The load time is quite reasonable this way. Updates, are faster if batched, if your business logic allows it, because it creates less bloat and creates more opportunities for with HOT updates. I don't think it applies to inserts, though, and I haven't heard it either. In any case, if your business logic doesn't allow it (and your case seems to suggest it), there's no point in worrying.
On Tue, Nov 27, 2012 at 6:26 PM, Steve Atkins <steve@blighty.com> wrote:
I do this as well - insert a few million rows into a table using the DBI::Pg copy interface. It works well.
I'd probably just do the whole thing in one transaction.
On Nov 27, 2012, at 2:04 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> I need to delete about 1.5 million records from a table and reload it in one transaction. The usual advice when loading with inserts seems to be group them into transactions of around 1k records. Committing at that point would leave the table in an inconsistent state.
Do you have specific reasons you want to avoid a long transaction, or just relying on rules of thumb? Postgresql isn't going to run out of resources doing a big transaction, in the way some other databases will.
Long running transactions will interfere with vacuuming, but inserting a couple of million rows shouldn't take that long.Savepoints are going to increase overhead and have no effect on the length of the transaction. If you want to catch errors and not have to redo the entire transaction, they're great, but that's about it.
> Would issuing a savepoint every 1k or so records negate whatever downside there is to keeping a transaction open for all 1.5 million records, or just add more overhead?COPY works nicely from perl:
> The data to reload the table is coming from a Perl DBI connection to a different database (not PostgreSQL) so I'm not sure the COPY alternative applies here.
$dbh->do("COPY foo FROM STDIN");
$dbh->pg_putcopydata("foo\tbar\tbaz\n");
$dbh->pg_putcopyend();
The details are in DBD::Pg. I use this a lot for doing big-ish (tens of millions of rows) bulk inserts. It's not as fast as you can get, but it's probably as fast as you can get with perl.
Cheers,
Steve
I ended up batching the copies so that each COPY statement only does a few hundred thousand at a time, but it's all one transaction.
The batching was necessary because of an idiosyncrasy of COPY in Pg 8.1: each COPY statement's contents was buffered in a malloc'd space, and if there were several million rows buffered up, the allocated virtual memory could get quite large - as in several GB. It plus the buffer pool sometimes exceeded the amount of RAM I had available at that time (several years ago), with bad effects on performance.
This may have been fixed since then, or maybe RAM's gotten big enough that it's not a problem.
Dan Franklin
Savepoint are not created for performance. If you have one very long running transactions that fails in the end, it will all be rolled back. So be pretty sure about your dataquality or use safepoints.
On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Updates, are faster if batched, if your business logic allows it,because it creates less bloat and creates more opportunities for with
HOT updates. I don't think it applies to inserts, though, and I
haven't heard it either.
Ah. That must have been what I'd half-remembered. Thanks for the clarification.
Mike
Commitmarks are written to disk after each transaction. So transactionsize has impact on performance.
Date: Wed, 28 Nov 2012 09:18:20 -0600
Subject: Re: [PERFORM] Savepoints in transactions for speed?
From: mike.blackwell@rrd.com
To: klaussfreire@gmail.com
CC: pgsql-performance@postgresql.org
On Tue, Nov 27, 2012 at 7:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Updates, are faster if batched, if your business logic allows it,because it creates less bloat and creates more opportunities for with
HOT updates. I don't think it applies to inserts, though, and I
haven't heard it either.
Ah. That must have been what I'd half-remembered. Thanks for the clarification.
Mike
On Tue, 2012-11-27 at 22:16 -0300, Claudio Freire wrote: > Updates, are faster if batched, if your business logic allows it, > because it creates less bloat and creates more opportunities for with > HOT updates. I don't think it applies to inserts, though, and I > haven't heard it either. Huge updates (e.g. UPDATE with no WHERE clause) are less likely to benefit from HOT. HOT has two main optimizations: 1. Remove dead tuples faster without waiting for VACUUM -- this only works if the transaction that updated/deleted the tuple actually finished (otherwise the tuple can't be removed yet), so it only benefits the *next* update to come along. But if it's one big update, then VACUUM is probably just as good at cleaning up the space. 2. Doesn't make new index entries for the new tuple; reuses the old index entries -- this only works if the update is on the same page, but large updates tend to fill pages up (because of the buildup of dead tuples) and force new to go to new pages. HOT is essentially designed for lots of small updates, which didn't perform well before PG 8.3. Batching of inserts/updates/deletes has a big benefit over separate transactions, but only up to a point, after which it levels off. I'm not sure exactly when that point is, but after that, the downsides of keeping a transaction open (like inability to remove the previous version of an updated tuple) take over. Regards, Jeff Davis
On Tue, 2012-11-27 at 16:04 -0600, Mike Blackwell wrote: > I need to delete about 1.5 million records from a table and reload it > in one transaction. The usual advice when loading with inserts seems > to be group them into transactions of around 1k records. Committing > at that point would leave the table in an inconsistent state. Would > issuing a savepoint every 1k or so records negate whatever downside > there is to keeping a transaction open for all 1.5 million records, or > just add more overhead? A large transaction isn't really a big problem for postgres, and 1.5M records should be processed quickly anyway. The main problem with a long-running delete or update transaction is that the dead tuples (deleted tuples or the old version of an updated tuple) can't be removed until the transaction finishes. That can cause temporary "bloat", but 1.5M records shouldn't be noticeable. Adding subtransactions into the mix won't help, but probably won't hurt, either. The transaction will still run just as long, and you still can't delete the tuples ahead of time (unless you abort a subtransaction). If you *do* use subtransactions, make sure to release them as quickly as you create them (don't just use ROLLBACK TO, that still leaves the savepoint there); having 1500 open subtransactions might cause performance problems elsewhere. Regards, Jeff Davis
On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > The main problem with a long-running delete or update transaction is > that the dead tuples (deleted tuples or the old version of an updated > tuple) can't be removed until the transaction finishes. That can cause > temporary "bloat", but 1.5M records shouldn't be noticeable. Not really that fast if you have indices (and who doesn't have a PK or two). I've never been able to update (update) 2M rows in one transaction in reasonable times (read: less than several hours) without dropping indices. Doing it in batches is way faster if you can't drop the indices, and if you can leverage HOT updates.
On Wed, Nov 28, 2012 at 9:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis <pgsql@j-davis.com> wrote:Not really that fast if you have indices (and who doesn't have a PK or two).
>
> The main problem with a long-running delete or update transaction is
> that the dead tuples (deleted tuples or the old version of an updated
> tuple) can't be removed until the transaction finishes. That can cause
> temporary "bloat", but 1.5M records shouldn't be noticeable.
I've never been able to update (update) 2M rows in one transaction in
reasonable times (read: less than several hours) without dropping
indices. Doing it in batches is way faster if you can't drop the
indices, and if you can leverage HOT updates.
What I'm trying at this point is:
BEGIN;
DROP INDEX -- only one unique index exists
DELETE FROM table;
COPY table FROM STDIN;
COMMIT;
CREATE INDEX CONCURRENTLY;
Do I understand correctly that DROP/CREATE index are not transactional, and thus the index will disappear immediately for other transactions? Am I better off in that case moving the DROP INDEX outside the transaction?
The access pattern for the table is such that I can afford the occasional stray hit without an index during the reload time. It's been pretty quick using the above.
Mike
On Thu, Nov 29, 2012 at 9:38 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote: > On Wed, Nov 28, 2012 at 9:48 PM, Claudio Freire <klaussfreire@gmail.com> > wrote: >> >> On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> > >> > The main problem with a long-running delete or update transaction is >> > that the dead tuples (deleted tuples or the old version of an updated >> > tuple) can't be removed until the transaction finishes. That can cause >> > temporary "bloat", but 1.5M records shouldn't be noticeable. >> >> Not really that fast if you have indices (and who doesn't have a PK or >> two). >> >> I've never been able to update (update) 2M rows in one transaction in >> reasonable times (read: less than several hours) without dropping >> indices. Doing it in batches is way faster if you can't drop the >> indices, and if you can leverage HOT updates. > > > What I'm trying at this point is: > > BEGIN; > DROP INDEX -- only one unique index exists > DELETE FROM table; > COPY table FROM STDIN; > COMMIT; > CREATE INDEX CONCURRENTLY; > > Do I understand correctly that DROP/CREATE index are not transactional, and > thus the index will disappear immediately for other transactions? Am I > better off in that case moving the DROP INDEX outside the transaction? > > The access pattern for the table is such that I can afford the occasional > stray hit without an index during the reload time. It's been pretty quick > using the above. Drop / create index ARE transactional, like most other things in postgresql (only drop / create database and drop / create tablespace are non-transactional). Your current sequence will result in the table you are dropping the index on being locked for other transactions until commit or rollback. Run two psql sessions and test it to see.
Ah. So it does. Testing with two psql sessions locks as you said, and moving the DROP INDEX to a separate transaction give the results I was looking for.
Thanks,
Mike
Mike
__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com
On Thu, Nov 29, 2012 at 10:54 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Drop / create index ARE transactional, like most other things inOn Thu, Nov 29, 2012 at 9:38 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> On Wed, Nov 28, 2012 at 9:48 PM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>> On Wed, Nov 28, 2012 at 8:28 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> >
>> > The main problem with a long-running delete or update transaction is
>> > that the dead tuples (deleted tuples or the old version of an updated
>> > tuple) can't be removed until the transaction finishes. That can cause
>> > temporary "bloat", but 1.5M records shouldn't be noticeable.
>>
>> Not really that fast if you have indices (and who doesn't have a PK or
>> two).
>>
>> I've never been able to update (update) 2M rows in one transaction in
>> reasonable times (read: less than several hours) without dropping
>> indices. Doing it in batches is way faster if you can't drop the
>> indices, and if you can leverage HOT updates.
>
>
> What I'm trying at this point is:
>
> BEGIN;
> DROP INDEX -- only one unique index exists
> DELETE FROM table;
> COPY table FROM STDIN;
> COMMIT;
> CREATE INDEX CONCURRENTLY;
>
> Do I understand correctly that DROP/CREATE index are not transactional, and
> thus the index will disappear immediately for other transactions? Am I
> better off in that case moving the DROP INDEX outside the transaction?
>
> The access pattern for the table is such that I can afford the occasional
> stray hit without an index during the reload time. It's been pretty quick
> using the above.
postgresql (only drop / create database and drop / create tablespace
are non-transactional). Your current sequence will result in the
table you are dropping the index on being locked for other
transactions until commit or rollback. Run two psql sessions and test
it to see.
On Thu, Nov 29, 2012 at 8:38 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote: > > What I'm trying at this point is: > > BEGIN; > DROP INDEX -- only one unique index exists > DELETE FROM table; > COPY table FROM STDIN; > COMMIT; > CREATE INDEX CONCURRENTLY; > > Do I understand correctly that DROP/CREATE index are not transactional, and > thus the index will disappear immediately for other transactions? The DROP is transactional. But the way it works here is that the index is access exclusively locked when the DROP is encountered (and so is the table) so any other transaction will block on it, even though the index is still there. (Transactionality does not inherently demand this behavior, it is just the way PG implements it. For example, it could take a weaker lock at the time DROP is encountered and then escalate it to exclusive only during the commit processing. But that would greatly expand the risk of deadlock, and would certainly be more complicated to code.) > Am I > better off in that case moving the DROP INDEX outside the transaction? > > The access pattern for the table is such that I can afford the occasional > stray hit without an index during the reload time. If you don't mind queries doing doing full table scans, and not having the benefit of the unique constraint, for that period, then yes you should move the drop index into a separate transaction. But If you do keep the drop index inside the transaction, then you would probably be better off using truncate rather than delete, and rebuild the index non-concurrently and move that inside the transaction as well. Cheers, Jeff
On Thu, Nov 29, 2012 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Hmm.... From the 9.2 manual it seems that might not work out so well:
But If you do keep the drop index inside the transaction, then you
would probably be better off using truncate rather than delete, and
rebuild the index non-concurrently and move that inside the
transaction as well.
TRUNCATE is not MVCC-safe (see Chapter 13 for general information about MVCC). After truncation, the table will appear empty to all concurrent transactions, even if they are using a snapshot taken before the truncation occurred.
It looks like other transactions could find an empty table while it was being reloaded under that approach.
On Thu, 2012-11-29 at 00:48 -0300, Claudio Freire wrote: > Not really that fast if you have indices (and who doesn't have a PK or two). > > I've never been able to update (update) 2M rows in one transaction in > reasonable times (read: less than several hours) without dropping > indices. Doing it in batches is way faster if you can't drop the > indices, and if you can leverage HOT updates. I tried a quick test with 2M tuples and 3 indexes over int8, numeric, and text (generated data). There was also an unindexed bytea column. Using my laptop, a full update of the int8 column (which is indexed, forcing cold updates) took less than 4 minutes. I'm sure there are other issues with real-world workloads, and I know that it's wasteful compared to something that can make use of HOT updates. But unless there is something I'm missing, it's not really worth the effort to batch if that is the size of the update. Regards, Jeff Davis
On Thu, Nov 29, 2012 at 3:32 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2012-11-29 at 00:48 -0300, Claudio Freire wrote: >> Not really that fast if you have indices (and who doesn't have a PK or two). >> >> I've never been able to update (update) 2M rows in one transaction in >> reasonable times (read: less than several hours) without dropping >> indices. Doing it in batches is way faster if you can't drop the >> indices, and if you can leverage HOT updates. > > I tried a quick test with 2M tuples and 3 indexes over int8, numeric, > and text (generated data). There was also an unindexed bytea column. > Using my laptop, a full update of the int8 column (which is indexed, > forcing cold updates) took less than 4 minutes. > > I'm sure there are other issues with real-world workloads, and I know > that it's wasteful compared to something that can make use of HOT > updates. But unless there is something I'm missing, it's not really > worth the effort to batch if that is the size of the update. On a pre-production database I have (that is currently idle), on a server with 4G RAM and a single SATA disk (probably similar to your laptop in that sense more or less, possibly more TPS since the HD rpm is 7k and your laptop probably is 5k), it's been running for half an hour and is still running (and I don't expect it to finish today if past experience is to be believed). The database sees somewhat real test workloads from time to time, so it's probably a good example of a live database (sans the concurrent load). The table is probably a lot wider than yours, having many columns, some text typed, and many indices too. Updating one indexed int4 like so: begin; update users set country_id = 1 where id < (328973625/2); rollback; (the where condition returns about 2M rows out of a ~5M total) There is quite a few foreign key constraints that I expect are interfering as well. I could try dropping them, just not on this database. The schema: CREATE TABLE users ( id integer NOT NULL, about_me character varying(500), birth_date timestamp without time zone, confirmed boolean, creation_date timestamp without time zone, email character varying(255), first_name character varying(255), image_link character varying(255), is_native_location boolean, is_panelist boolean, last_name character varying(255), privacy bigint NOT NULL, sex integer, username character varying(255), city_id integer, country_id integer, state_id integer, last_access_to_inbox timestamp without time zone, profile_background_color character varying(255), profile_background_image_link character varying(255), url character varying(255), notifications bigint, last_activity_date timestamp without time zone, site_country_id integer, show_welcome_message boolean NOT NULL, invited boolean, partner_id integer, panelist_update bigint, unregistered boolean DEFAULT false, import_state integer, show_alerts_since timestamp without time zone, super_user boolean DEFAULT false, survey_id integer, site_id smallint NOT NULL, panelist_percentage smallint NOT NULL DEFAULT 0, reason integer, unregistered_date timestamp without time zone, is_panelist_update_date timestamp without time zone, confirmation_update_date timestamp without time zone, no_panelist_reason integer, facebook_connect_status smallint, CONSTRAINT user_pkey PRIMARY KEY (id ), CONSTRAINT fk36ebcb26f1a196 FOREIGN KEY (site_country_id) REFERENCES countries (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT fk_city_id FOREIGN KEY (city_id) REFERENCES cities (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT fk_country_id FOREIGN KEY (country_id) REFERENCES countries (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT fk_state_id FOREIGN KEY (state_id) REFERENCES states (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT fk_survey_id FOREIGN KEY (survey_id) REFERENCES surveys (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT users_id_check CHECK (id >= 0) ) Indices on: (creation_date ); (panelist_update ) WHERE panelist_update IS NOT NULL; (birth_date ); (email COLLATE pg_catalog."default" ); (lower(email::text) COLLATE pg_catalog."default" ); (partner_id , creation_date ); (site_id , country_id , city_id ); (site_id , country_id , state_id ); (username COLLATE pg_catalog."default" );
On Thu, Nov 29, 2012 at 10:14 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote: > > > > On Thu, Nov 29, 2012 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> >> But If you do keep the drop index inside the transaction, then you >> would probably be better off using truncate rather than delete, and >> rebuild the index non-concurrently and move that inside the >> transaction as well. >> > > > Hmm.... From the 9.2 manual it seems that might not work out so well: > > TRUNCATE is not MVCC-safe (see Chapter 13 for general information about > MVCC). After truncation, the table will appear empty to all concurrent > transactions, even if they are using a snapshot taken before the truncation > occurred. > > It looks like other transactions could find an empty table while it was > being reloaded under that approach. They would block during the load, it is just after the load that they would see the table as empty. I thought that that would only be a problem for repeatable read or higher, but a test shows that read committed has that problem as well. But yeah, that could definitely be a problem with that method. Cheers, Jeff
On Thu, Nov 29, 2012 at 11:58 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Nov 29, 2012 at 3:32 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> >> I tried a quick test with 2M tuples and 3 indexes over int8, numeric, >> and text (generated data). There was also an unindexed bytea column. >> Using my laptop, a full update of the int8 column (which is indexed, >> forcing cold updates) took less than 4 minutes. >> >> I'm sure there are other issues with real-world workloads, and I know >> that it's wasteful compared to something that can make use of HOT >> updates. But unless there is something I'm missing, it's not really >> worth the effort to batch if that is the size of the update. > > On a pre-production database I have (that is currently idle), on a > server with 4G RAM and a single SATA disk (probably similar to your > laptop in that sense more or less, possibly more TPS since the HD rpm > is 7k and your laptop probably is 5k), it's been running for half an > hour and is still running (and I don't expect it to finish today if > past experience is to be believed). So probably Jeff Davis's indexes fit in RAM (or the part that can be dirtied without causing thrashing), and yours do not. But, does batching them up help at all? I doubt it does. Cheers, Jeff