Thread: INSERT performance deteriorates quickly during a large import
INSERT performance deteriorates quickly during a large import
From
"Krasimir Hristozov \(InterMedia Ltd\)"
Date:
We need to import data from a relatively large MySQL database into an existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL and INSERTs in PostgreSQL. A part of the import involves moving about 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The problem is that the insert performance inevitably deteriorates as the number of inserts increases. We tried different approaches: * selecting only parts of the source table data based on a certain condition * selecting all of the source data coupled with either of these: * inserting without explicit transactions * inserting all the data in a single transaction * inserting the data in partial transactions of about 100, 1000, 10000, 100000 inserts each While there were performance benefits in some of the cases (selecting all the data and inserting in transaction packets of about 1000 each being the fastest), the problem was that it still deteriorated as the import progressed. We tried removing all foreign keys and indices from the postgres table, still gained performance, but it deteriorated as well. The latest (and best performing) test we did was under the following conditions: * 11851 pre-existing records in the destination table * the table was vacuumed just before the import * all foreign keys and indices were removed from the destination table * selected all of the data from the source table at once * inserted in transactions of 1000 inserts each We displayed time statistics on each 100 inserts. The process started at about 1 second per 100 inserts. This estimated to about 4 hours for the entire process. 14 hours later it had imported about a quarter of the data (a bit more than 330000 records), and 100 inserts now took nearly 40 seconds. We tested reading from MySQL alone, without inserting the data in Postgres. All records were read in about a minute and a half, so MySQL performance is not a part of the issue. The PHP script selects the MySQL data, fetches rows sequentially, occasionally performs a couple of selects against PostgreSQL data (which is cached in a PHP array to reduce the DB operations; no more than 80000 array elements, integer keys, integer data), and inserts into PostgreSQL. The algorithm seems to be linear in nature and perfomance deterioration most probably doesn't have to do with the PHP code. Has anyone had an issue like this, and perhaps a suggestion for a possible cause and solution? Is it common for PostgreSQL to grow so slow as the amount of data in the tables increases? If so, is it just the insert operation or all kinds of queries? Isn't 300000 records too low a threshold for such performance deterioration? Here are some technical details, that might be helpful: * PHP, MySQL and PostgreSQL all work on the same server, sharing the same memory and hard drive. * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor 3000+ (2GHz K8 class CPU) with 1GB RAM * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL 4.1.22, PostgreSQL 8.1.8 * postgresql.conf variables other than defaults are: max_connections = 40, shared_buffers = 1000 (this is the default) * we have also tried these on another server with Red Hat Enterprise Linux ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm) Processor 270 (4x2GHz logical CPUs) with 2GB RAM * both servers run in x86_64 mode, PostgreSQL footprint in memory stays relatively small, CPU usage maxes out on import, there is no resource starvation in any way DDL statement for the creation of the PostgreSQL table in question: CREATE TABLE "public"."sp_thread_replies" ( "id" SERIAL, "thread_id" INTEGER NOT NULL, "body" TEXT NOT NULL, "ts_added" INTEGER DEFAULT 0 NOT NULL, "user_id" INTEGER NOT NULL, "thread_offset" INTEGER DEFAULT 0, "approved" SMALLINT DEFAULT 1, "title" TEXT, "deleted" SMALLINT DEFAULT 0, "edit_reason" VARCHAR(255), "edit_user_id" INTEGER, "edit_time" INTEGER, CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"), CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id") REFERENCES "public"."sp_threads"("id") ON DELETE CASCADE ON UPDATE NO ACTION NOT DEFERRABLE, CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id") REFERENCES "public"."sp_users"("id") ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE ) WITH OIDS; The table is a part of a custom forum engine. It stores all thread posts. It's most often queried with SELECTs and INSERTSs, less often with UPDATEs, and records are deleted quite seldom in normal operation of the application (though we may delete records manually from the console from time to time).
Try to one of these: a) don't use INSERT statements, use a COPY instead b) from time to time run ANALYZE on the "public" table (say 1000 inserts, then one analyze) c) create the table without constraints (primary / foreign keys in this case), import all the data, and then create the constraints The (b) and (c) may be combined, i.e. import without constraints and analyze from time to time. I'd probably try the (a) at first, anyway. Try to gather some more statistics - is the problem related to CPU or I/O? Use 'dstat' for example - this might say give you a hint in case the advices mentioned above don't help. Tomas > We need to import data from a relatively large MySQL database into an > existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL > and INSERTs in PostgreSQL. A part of the import involves moving about > 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The > problem is that the insert performance inevitably deteriorates as the > number > of inserts increases. > > We tried different approaches: > > * selecting only parts of the source table data based on a certain > condition > * selecting all of the source data > > coupled with either of these: > > * inserting without explicit transactions > * inserting all the data in a single transaction > * inserting the data in partial transactions of about 100, 1000, 10000, > 100000 inserts each > > While there were performance benefits in some of the cases (selecting all > the data and inserting in transaction packets of about 1000 each being the > fastest), the problem was that it still deteriorated as the import > progressed. > > We tried removing all foreign keys and indices from the postgres table, > still gained performance, but it deteriorated as well. > > The latest (and best performing) test we did was under the following > conditions: > > * 11851 pre-existing records in the destination table > * the table was vacuumed just before the import > * all foreign keys and indices were removed from the destination table > * selected all of the data from the source table at once > * inserted in transactions of 1000 inserts each > > We displayed time statistics on each 100 inserts. The process started at > about 1 second per 100 inserts. This estimated to about 4 hours for the > entire process. 14 hours later it had imported about a quarter of the data > (a bit more than 330000 records), and 100 inserts now took nearly 40 > seconds. > > We tested reading from MySQL alone, without inserting the data in Postgres. > All records were read in about a minute and a half, so MySQL performance is > not a part of the issue. The PHP script selects the MySQL data, fetches > rows > sequentially, occasionally performs a couple of selects against PostgreSQL > data (which is cached in a PHP array to reduce the DB operations; no more > than 80000 array elements, integer keys, integer data), and inserts into > PostgreSQL. The algorithm seems to be linear in nature and perfomance > deterioration most probably doesn't have to do with the PHP code. > > Has anyone had an issue like this, and perhaps a suggestion for a possible > cause and solution? Is it common for PostgreSQL to grow so slow as the > amount of data in the tables increases? If so, is it just the insert > operation or all kinds of queries? Isn't 300000 records too low a threshold > for such performance deterioration? > > Here are some technical details, that might be helpful: > > * PHP, MySQL and PostgreSQL all work on the same server, sharing the same > memory and hard drive. > * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor > 3000+ (2GHz K8 class CPU) with 1GB RAM > * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL > 4.1.22, PostgreSQL 8.1.8 > * postgresql.conf variables other than defaults are: max_connections = 40, > shared_buffers = 1000 (this is the default) > * we have also tried these on another server with Red Hat Enterprise Linux > ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm) > Processor 270 (4x2GHz logical CPUs) with 2GB RAM > * both servers run in x86_64 mode, PostgreSQL footprint in memory stays > relatively small, CPU usage maxes out on import, there is no resource > starvation in any way > > DDL statement for the creation of the PostgreSQL table in question: > > CREATE TABLE "public"."sp_thread_replies" ( > "id" SERIAL, > "thread_id" INTEGER NOT NULL, > "body" TEXT NOT NULL, > "ts_added" INTEGER DEFAULT 0 NOT NULL, > "user_id" INTEGER NOT NULL, > "thread_offset" INTEGER DEFAULT 0, > "approved" SMALLINT DEFAULT 1, > "title" TEXT, > "deleted" SMALLINT DEFAULT 0, > "edit_reason" VARCHAR(255), > "edit_user_id" INTEGER, > "edit_time" INTEGER, > CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"), > CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id") > REFERENCES "public"."sp_threads"("id") > ON DELETE CASCADE > ON UPDATE NO ACTION > NOT DEFERRABLE, > CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id") > REFERENCES "public"."sp_users"("id") > ON DELETE NO ACTION > ON UPDATE NO ACTION > NOT DEFERRABLE > ) WITH OIDS; > > The table is a part of a custom forum engine. It stores all thread posts. > It's most often queried with SELECTs and INSERTSs, less often with UPDATEs, > and records are deleted quite seldom in normal operation of the application > (though we may delete records manually from the console from time to time). > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
"Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com> writes: > We need to import data from a relatively large MySQL database into an > existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL > and INSERTs in PostgreSQL. A part of the import involves moving about > 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The > problem is that the insert performance inevitably deteriorates as the number > of inserts increases. Are you *certain* you've gotten rid of all the indexes and foreign keys? A simple insert ought to be pretty much constant-time in Postgres, so it seems to me that you've missed something. It also seems possible that you are wrong to disregard PHP as a possible source of the problem. Have you tried watching the PHP and PG backend processes with "top" (or similar tool) to see who's consuming CPU time and/or memory space? regards, tom lane
In response to "Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com>: > We need to import data from a relatively large MySQL database into an > existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL > and INSERTs in PostgreSQL. A part of the import involves moving about > 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The > problem is that the insert performance inevitably deteriorates as the number > of inserts increases. > > We tried different approaches: > > * selecting only parts of the source table data based on a certain > condition > * selecting all of the source data > > coupled with either of these: > > * inserting without explicit transactions > * inserting all the data in a single transaction > * inserting the data in partial transactions of about 100, 1000, 10000, > 100000 inserts each > > While there were performance benefits in some of the cases (selecting all > the data and inserting in transaction packets of about 1000 each being the > fastest), the problem was that it still deteriorated as the import > progressed. > > We tried removing all foreign keys and indices from the postgres table, > still gained performance, but it deteriorated as well. > > The latest (and best performing) test we did was under the following > conditions: > > * 11851 pre-existing records in the destination table > * the table was vacuumed just before the import > * all foreign keys and indices were removed from the destination table > * selected all of the data from the source table at once > * inserted in transactions of 1000 inserts each > > We displayed time statistics on each 100 inserts. The process started at > about 1 second per 100 inserts. This estimated to about 4 hours for the > entire process. 14 hours later it had imported about a quarter of the data > (a bit more than 330000 records), and 100 inserts now took nearly 40 > seconds. > > We tested reading from MySQL alone, without inserting the data in Postgres. > All records were read in about a minute and a half, so MySQL performance is > not a part of the issue. The PHP script selects the MySQL data, fetches rows > sequentially, occasionally performs a couple of selects against PostgreSQL > data (which is cached in a PHP array to reduce the DB operations; no more > than 80000 array elements, integer keys, integer data), and inserts into > PostgreSQL. The algorithm seems to be linear in nature and perfomance > deterioration most probably doesn't have to do with the PHP code. > > Has anyone had an issue like this, and perhaps a suggestion for a possible > cause and solution? Is it common for PostgreSQL to grow so slow as the > amount of data in the tables increases? If so, is it just the insert > operation or all kinds of queries? Isn't 300000 records too low a threshold > for such performance deterioration? > > Here are some technical details, that might be helpful: > > * PHP, MySQL and PostgreSQL all work on the same server, sharing the same > memory and hard drive. This makes it very difficult to blame PostgreSQL. If the insert process is CPU bound, and PHP is using a ton of CPU, then PG will be starved. You kinda contradict yourself, saying PG is not starved, then saying that the CPU is maxed out. In any event, having all three on one machine will make it more fun to isolate where the actual bottleneck is. > * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor > 3000+ (2GHz K8 class CPU) with 1GB RAM This is another problem. 5.3 has the worst performance of any version of FreeBSD I've ever used. Even downgrading to 4.11 (not recommended) would produce a performance improvement, but you you should get this system to 5.5 (at least) or 6.2 (preferable). > * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL > 4.1.22, PostgreSQL 8.1.8 > * postgresql.conf variables other than defaults are: max_connections = 40, > shared_buffers = 1000 (this is the default) 1000 shared_buffers is pretty low for any real work. While your tables aren't huge, they're big enough to warrant more shared_buffers. However, you've only got 1G of RAM on this system to share between two DB servers, which is going to constrain you a good bit. There are other settings important to insert performance that you haven't mentioned. checkpoint_segments and the like, for example. You also don't describe your disk subsystem, it's entirely possible you've filled up the cache on the disk controllers (which is why it looked initially fast) and now are hitting up against the max speed the disks can do. With only 1G of RAM, it's possible that MySQL is reading, PostgreSQL is writing, and PHP is swapping. It doesn't take a lot of disk contention to flush performance down the toilet. See what iostat says. Even better, use top in "m" mode (hit m after top start) to see how much IO each process is using (I believe that was added in 5.X, but it may only be available in 6.X versions of FreeBSD's top) > * we have also tried these on another server with Red Hat Enterprise Linux > ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm) > Processor 270 (4x2GHz logical CPUs) with 2GB RAM While 2G is better, that's still not a lot of RAM if you're trying to run 2 DB servers and a web server on a single system. If you haven't tuned PG to take advantage of it, then it won't help much anyway. Additionally, you've neglected to mention the disk subsystem on this machine as well. Is it running cheapo SATA drives because the price/gig is right? > * both servers run in x86_64 mode, PostgreSQL footprint in memory stays > relatively small, Of course it does, because you've told it not to use more than 8M of RAM. > CPU usage maxes out on import, there is no resource > starvation in any way You do realize that you're contradicting yourself here, right? The advice provided by others is good as well, so I won't repeat it. -- Bill Moran http://www.potentialtech.com
tv@fuzzy.cz ("=?UTF-8?B?VG9tw6HFoSBWb25kcmE=?=") writes: > Try to one of these: > > a) don't use INSERT statements, use a COPY instead > > b) from time to time run ANALYZE on the "public" table (say 1000 > inserts, then one analyze) > > c) create the table without constraints (primary / foreign keys in this > case), import all the data, and then create the constraints > > The (b) and (c) may be combined, i.e. import without constraints and > analyze from time to time. I'd probably try the (a) at first, anyway. > > Try to gather some more statistics - is the problem related to CPU or > I/O? Use 'dstat' for example - this might say give you a hint in case > the advices mentioned above don't help. I agree with all but b). - If you use COPY, that copies data "in bulk" which is *way* faster than submitting individual INSERT statements that must each be parsed. So I certainly agree with a). - There are two prime reasons to expect the data load to slow down: 1. Because adding entries to the index gets more expensive the larger the table gets; 2. Because searching through foreign key constraints tends to get more expensive as the target table grows. Those point to doing c). If you put off evaluating indices and foreign key constraints until all of the data is loaded, there should be no need to run ANALYZE during the COPY process. And there should be no reason for loading data to get more costly as the size of the table increases. -- let name="cbbrowne" and tld="linuxfinances.info" in name ^ "@" ^ tld;; http://linuxdatabases.info/info/advocacy.html Rules of the Evil Overlord #116. "If I capture the hero's starship, I will keep it in the landing bay with the ramp down, only a few token guards on duty and a ton of explosives set to go off as soon as it clears the blast-range." <http://www.eviloverlord.com/>
Re: INSERT performance deteriorates quickly during a large import
From
"Márcio Geovani Jasinski"
Date:
Hello Krasimir,
You got a lot of good advices above and I would like to add another one:
d) Make sure of your PHP code is not recursive. As you said the memory is stable so I think your method is iterative.
A recursive method certainly will increase a little time for each insert using more memory.
But iterative methods must be correctly to be called just once and maybe your code is running much more than need.
Pay attention on Tomas advices, and after that (I agree with Cris) "there should be no reason for loading data to get more costly as
the size of the table increases" - Please check your code.
I did some experiences long time ago with 40000 data with a lot of BLOBs. I used PHP code using SELECT/INSERT from Postgres to Postgres and the time wasn't constant but wasn't so bad as your case. (And I didn't the Tomas a, b and c advices)
Good Luck
--
Márcio Geovani Jasinski
You got a lot of good advices above and I would like to add another one:
d) Make sure of your PHP code is not recursive. As you said the memory is stable so I think your method is iterative.
A recursive method certainly will increase a little time for each insert using more memory.
But iterative methods must be correctly to be called just once and maybe your code is running much more than need.
Pay attention on Tomas advices, and after that (I agree with Cris) "there should be no reason for loading data to get more costly as
the size of the table increases" - Please check your code.
I did some experiences long time ago with 40000 data with a lot of BLOBs. I used PHP code using SELECT/INSERT from Postgres to Postgres and the time wasn't constant but wasn't so bad as your case. (And I didn't the Tomas a, b and c advices)
Good Luck
--
Márcio Geovani Jasinski
Re: INSERT performance deteriorates quickly during a large import
From
"Krasimir Hristozov \(InterMedia Ltd\)"
Date:
Thanks to all who responded. Using COPY instead of INSERT really solved the problem - the whole process took about 1h 20min on an indexed table, with constraints (which is close to our initial expectations). We're performing some additional tests now. I'll post some more observations when finished.
----- Original Message -----From: Márcio Geovani JasinskiSent: Friday, November 09, 2007 1:52 PMSubject: Re: INSERT performance deteriorates quickly during a large importHello Krasimir,
You got a lot of good advices above and I would like to add another one:
d) Make sure of your PHP code is not recursive. As you said the memory is stable so I think your method is iterative.
A recursive method certainly will increase a little time for each insert using more memory.
But iterative methods must be correctly to be called just once and maybe your code is running much more than need.
Pay attention on Tomas advices, and after that (I agree with Cris) "there should be no reason for loading data to get more costly as
the size of the table increases" - Please check your code.
I did some experiences long time ago with 40000 data with a lot of BLOBs. I used PHP code using SELECT/INSERT from Postgres to Postgres and the time wasn't constant but wasn't so bad as your case. (And I didn't the Tomas a, b and c advices)
Good Luck
--
Márcio Geovani Jasinski
>> Try to one of these: >> >> a) don't use INSERT statements, use a COPY instead >> >> b) from time to time run ANALYZE on the "public" table (say 1000 >> inserts, then one analyze) >> >> c) create the table without constraints (primary / foreign keys in this >> case), import all the data, and then create the constraints >> >> The (b) and (c) may be combined, i.e. import without constraints and >> analyze from time to time. I'd probably try the (a) at first, anyway. >> >> Try to gather some more statistics - is the problem related to CPU or >> I/O? Use 'dstat' for example - this might say give you a hint in case >> the advices mentioned above don't help. > > I agree with all but b). You're right - this combination (no constraints or indices and using ANALYZE at the same time) won't help. The original list of advices was a little bit different, but I've changed it and haven't checked the following paragraphs ... Tomas
Re: INSERT performance deteriorates quickly during a large import
From
Julio Cesar Sánchez González
Date:
El jue, 08-11-2007 a las 13:01 -0500, Tom Lane escribió: > "Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com> writes: > > We need to import data from a relatively large MySQL database into an > > existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL > > and INSERTs in PostgreSQL. A part of the import involves moving about > > 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The > > problem is that the insert performance inevitably deteriorates as the number > > of inserts increases. > > Are you *certain* you've gotten rid of all the indexes and foreign keys? > A simple insert ought to be pretty much constant-time in Postgres, so it > seems to me that you've missed something. > > It also seems possible that you are wrong to disregard PHP as a possible > source of the problem. Have you tried watching the PHP and PG backend > processes with "top" (or similar tool) to see who's consuming CPU time > and/or memory space? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings Hi for all, Why better try use PERL DBI, may be better than. -- Regards, Julio Cesar Sánchez González. -- Ahora me he convertido en la muerte, destructora de mundos. Soy la Muerte que se lleva todo, la fuente de las cosas que vendran. www.sistemasyconectividad.com.mx http://darkavngr.blogspot.com/