Thread: INSERT performance deteriorates quickly during a large import

INSERT performance deteriorates quickly during a large import

From
"Krasimir Hristozov \(InterMedia Ltd\)"
Date:
We need to import data from a relatively large MySQL database into an
existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
and INSERTs in PostgreSQL. A part of the import involves moving about
1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
problem is that the insert performance inevitably deteriorates as the number
of inserts increases.

We tried different approaches:

 * selecting only parts of the source table data based on a certain
condition
 * selecting all of the source data

coupled with either of these:

 * inserting without explicit transactions
 * inserting all the data in a single transaction
 * inserting the data in partial transactions of about 100, 1000, 10000,
100000 inserts each

While there were performance benefits in some of the cases (selecting all
the data and inserting in transaction packets of about 1000 each being the
fastest), the problem was that it still deteriorated as the import
progressed.

We tried removing all foreign keys and indices from the postgres table,
still gained performance, but it deteriorated as well.

The latest (and best performing) test we did was under the following
conditions:

 * 11851 pre-existing records in the destination table
 * the table was vacuumed just before the import
 * all foreign keys and indices were removed from the destination table
 * selected all of the data from the source table at once
 * inserted in transactions of 1000 inserts each

We displayed time statistics on each 100 inserts. The process started at
about 1 second per 100 inserts. This estimated to about 4 hours for the
entire process. 14 hours later it had imported about a quarter of the data
(a bit more than 330000 records), and 100 inserts now took nearly 40
seconds.

We tested reading from MySQL alone, without inserting the data in Postgres.
All records were read in about a minute and a half, so MySQL performance is
not a part of the issue. The PHP script selects the MySQL data, fetches rows
sequentially, occasionally performs a couple of selects against PostgreSQL
data (which is cached in a PHP array to reduce the DB operations; no more
than 80000 array elements, integer keys, integer data), and inserts into
PostgreSQL. The algorithm seems to be linear in nature and perfomance
deterioration most probably doesn't have to do with the PHP code.

Has anyone had an issue like this, and perhaps a suggestion for a possible
cause and solution? Is it common for PostgreSQL to grow so slow as the
amount of data in the tables increases? If so, is it just the insert
operation or all kinds of queries? Isn't 300000 records too low a threshold
for such performance deterioration?

Here are some technical details, that might be helpful:

 * PHP, MySQL and PostgreSQL all work on the same server, sharing the same
memory and hard drive.
 * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor
3000+ (2GHz K8 class CPU) with 1GB RAM
 * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL
4.1.22, PostgreSQL 8.1.8
 * postgresql.conf variables other than defaults are: max_connections = 40,
shared_buffers = 1000 (this is the default)
 * we have also tried these on another server with Red Hat Enterprise Linux
ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm)
Processor 270 (4x2GHz logical CPUs) with 2GB RAM
 * both servers run in x86_64 mode, PostgreSQL footprint in memory stays
relatively small, CPU usage maxes out on import, there is no resource
starvation in any way

DDL statement for the creation of the PostgreSQL table in question:

 CREATE TABLE "public"."sp_thread_replies" (
   "id" SERIAL,
   "thread_id" INTEGER NOT NULL,
   "body" TEXT NOT NULL,
   "ts_added" INTEGER DEFAULT 0 NOT NULL,
   "user_id" INTEGER NOT NULL,
   "thread_offset" INTEGER DEFAULT 0,
   "approved" SMALLINT DEFAULT 1,
   "title" TEXT,
   "deleted" SMALLINT DEFAULT 0,
   "edit_reason" VARCHAR(255),
   "edit_user_id" INTEGER,
   "edit_time" INTEGER,
   CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"),
   CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id")
     REFERENCES "public"."sp_threads"("id")
     ON DELETE CASCADE
     ON UPDATE NO ACTION
     NOT DEFERRABLE,
   CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id")
     REFERENCES "public"."sp_users"("id")
     ON DELETE NO ACTION
     ON UPDATE NO ACTION
     NOT DEFERRABLE
 ) WITH OIDS;

The table is a part of a custom forum engine. It stores all thread posts.
It's most often queried with SELECTs and INSERTSs, less often with UPDATEs,
and records are deleted quite seldom in normal operation of the application
(though we may delete records manually from the console from time to time).


Re: INSERT performance deteriorates quickly during a large import

From
Tomáš Vondra
Date:
Try to one of these:

a) don't use INSERT statements, use a COPY instead

b) from time to time run ANALYZE on the "public" table (say 1000
    inserts, then one analyze)

c) create the table without constraints (primary / foreign keys in this
    case), import all the data, and then create the constraints

The (b) and (c) may be combined, i.e. import without constraints and
analyze from time to time. I'd probably try the (a) at first, anyway.

Try to gather some more statistics - is the problem related to CPU or
I/O? Use 'dstat' for example - this might say give you a hint in case
the advices mentioned above don't help.

Tomas

> We need to import data from a relatively large MySQL database into an
> existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
> and INSERTs in PostgreSQL. A part of the import involves moving about
> 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
> problem is that the insert performance inevitably deteriorates as the
> number
> of inserts increases.
>
> We tried different approaches:
>
> * selecting only parts of the source table data based on a certain
> condition
> * selecting all of the source data
>
> coupled with either of these:
>
> * inserting without explicit transactions
> * inserting all the data in a single transaction
> * inserting the data in partial transactions of about 100, 1000, 10000,
> 100000 inserts each
>
> While there were performance benefits in some of the cases (selecting all
> the data and inserting in transaction packets of about 1000 each being the
> fastest), the problem was that it still deteriorated as the import
> progressed.
>
> We tried removing all foreign keys and indices from the postgres table,
> still gained performance, but it deteriorated as well.
>
> The latest (and best performing) test we did was under the following
> conditions:
>
> * 11851 pre-existing records in the destination table
> * the table was vacuumed just before the import
> * all foreign keys and indices were removed from the destination table
> * selected all of the data from the source table at once
> * inserted in transactions of 1000 inserts each
>
> We displayed time statistics on each 100 inserts. The process started at
> about 1 second per 100 inserts. This estimated to about 4 hours for the
> entire process. 14 hours later it had imported about a quarter of the data
> (a bit more than 330000 records), and 100 inserts now took nearly 40
> seconds.
>
> We tested reading from MySQL alone, without inserting the data in Postgres.
> All records were read in about a minute and a half, so MySQL performance is
> not a part of the issue. The PHP script selects the MySQL data, fetches
> rows
> sequentially, occasionally performs a couple of selects against PostgreSQL
> data (which is cached in a PHP array to reduce the DB operations; no more
> than 80000 array elements, integer keys, integer data), and inserts into
> PostgreSQL. The algorithm seems to be linear in nature and perfomance
> deterioration most probably doesn't have to do with the PHP code.
>
> Has anyone had an issue like this, and perhaps a suggestion for a possible
> cause and solution? Is it common for PostgreSQL to grow so slow as the
> amount of data in the tables increases? If so, is it just the insert
> operation or all kinds of queries? Isn't 300000 records too low a threshold
> for such performance deterioration?
>
> Here are some technical details, that might be helpful:
>
> * PHP, MySQL and PostgreSQL all work on the same server, sharing the same
> memory and hard drive.
> * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor
> 3000+ (2GHz K8 class CPU) with 1GB RAM
> * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL
> 4.1.22, PostgreSQL 8.1.8
> * postgresql.conf variables other than defaults are: max_connections = 40,
> shared_buffers = 1000 (this is the default)
> * we have also tried these on another server with Red Hat Enterprise Linux
> ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm)
> Processor 270 (4x2GHz logical CPUs) with 2GB RAM
> * both servers run in x86_64 mode, PostgreSQL footprint in memory stays
> relatively small, CPU usage maxes out on import, there is no resource
> starvation in any way
>
> DDL statement for the creation of the PostgreSQL table in question:
>
> CREATE TABLE "public"."sp_thread_replies" (
>   "id" SERIAL,
>   "thread_id" INTEGER NOT NULL,
>   "body" TEXT NOT NULL,
>   "ts_added" INTEGER DEFAULT 0 NOT NULL,
>   "user_id" INTEGER NOT NULL,
>   "thread_offset" INTEGER DEFAULT 0,
>   "approved" SMALLINT DEFAULT 1,
>   "title" TEXT,
>   "deleted" SMALLINT DEFAULT 0,
>   "edit_reason" VARCHAR(255),
>   "edit_user_id" INTEGER,
>   "edit_time" INTEGER,
>   CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"),
>   CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id")
>     REFERENCES "public"."sp_threads"("id")
>     ON DELETE CASCADE
>     ON UPDATE NO ACTION
>     NOT DEFERRABLE,
>   CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id")
>     REFERENCES "public"."sp_users"("id")
>     ON DELETE NO ACTION
>     ON UPDATE NO ACTION
>     NOT DEFERRABLE
> ) WITH OIDS;
>
> The table is a part of a custom forum engine. It stores all thread posts.
> It's most often queried with SELECTs and INSERTSs, less often with UPDATEs,
> and records are deleted quite seldom in normal operation of the application
> (though we may delete records manually from the console from time to time).
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>


Re: INSERT performance deteriorates quickly during a large import

From
Tom Lane
Date:
"Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com> writes:
> We need to import data from a relatively large MySQL database into an
> existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
> and INSERTs in PostgreSQL. A part of the import involves moving about
> 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
> problem is that the insert performance inevitably deteriorates as the number
> of inserts increases.

Are you *certain* you've gotten rid of all the indexes and foreign keys?
A simple insert ought to be pretty much constant-time in Postgres, so it
seems to me that you've missed something.

It also seems possible that you are wrong to disregard PHP as a possible
source of the problem.  Have you tried watching the PHP and PG backend
processes with "top" (or similar tool) to see who's consuming CPU time
and/or memory space?

            regards, tom lane

Re: INSERT performance deteriorates quickly during a large import

From
Bill Moran
Date:
In response to "Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com>:

> We need to import data from a relatively large MySQL database into an
> existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
> and INSERTs in PostgreSQL. A part of the import involves moving about
> 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
> problem is that the insert performance inevitably deteriorates as the number
> of inserts increases.
>
> We tried different approaches:
>
>  * selecting only parts of the source table data based on a certain
> condition
>  * selecting all of the source data
>
> coupled with either of these:
>
>  * inserting without explicit transactions
>  * inserting all the data in a single transaction
>  * inserting the data in partial transactions of about 100, 1000, 10000,
> 100000 inserts each
>
> While there were performance benefits in some of the cases (selecting all
> the data and inserting in transaction packets of about 1000 each being the
> fastest), the problem was that it still deteriorated as the import
> progressed.
>
> We tried removing all foreign keys and indices from the postgres table,
> still gained performance, but it deteriorated as well.
>
> The latest (and best performing) test we did was under the following
> conditions:
>
>  * 11851 pre-existing records in the destination table
>  * the table was vacuumed just before the import
>  * all foreign keys and indices were removed from the destination table
>  * selected all of the data from the source table at once
>  * inserted in transactions of 1000 inserts each
>
> We displayed time statistics on each 100 inserts. The process started at
> about 1 second per 100 inserts. This estimated to about 4 hours for the
> entire process. 14 hours later it had imported about a quarter of the data
> (a bit more than 330000 records), and 100 inserts now took nearly 40
> seconds.
>
> We tested reading from MySQL alone, without inserting the data in Postgres.
> All records were read in about a minute and a half, so MySQL performance is
> not a part of the issue. The PHP script selects the MySQL data, fetches rows
> sequentially, occasionally performs a couple of selects against PostgreSQL
> data (which is cached in a PHP array to reduce the DB operations; no more
> than 80000 array elements, integer keys, integer data), and inserts into
> PostgreSQL. The algorithm seems to be linear in nature and perfomance
> deterioration most probably doesn't have to do with the PHP code.
>
> Has anyone had an issue like this, and perhaps a suggestion for a possible
> cause and solution? Is it common for PostgreSQL to grow so slow as the
> amount of data in the tables increases? If so, is it just the insert
> operation or all kinds of queries? Isn't 300000 records too low a threshold
> for such performance deterioration?
>
> Here are some technical details, that might be helpful:
>
>  * PHP, MySQL and PostgreSQL all work on the same server, sharing the same
> memory and hard drive.

This makes it very difficult to blame PostgreSQL.  If the insert process
is CPU bound, and PHP is using a ton of CPU, then PG will be starved.
You kinda contradict yourself, saying PG is not starved, then saying
that the CPU is maxed out.  In any event, having all three on one machine
will make it more fun to isolate where the actual bottleneck is.

>  * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor
> 3000+ (2GHz K8 class CPU) with 1GB RAM

This is another problem.  5.3 has the worst performance of any version
of FreeBSD I've ever used.  Even downgrading to 4.11 (not recommended)
would produce a performance improvement, but you you should get this
system to 5.5 (at least) or 6.2 (preferable).

>  * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL
> 4.1.22, PostgreSQL 8.1.8
>  * postgresql.conf variables other than defaults are: max_connections = 40,
> shared_buffers = 1000 (this is the default)

1000 shared_buffers is pretty low for any real work.  While your tables
aren't huge, they're big enough to warrant more shared_buffers.  However,
you've only got 1G of RAM on this system to share between two DB servers,
which is going to constrain you a good bit.

There are other settings important to insert performance that you haven't
mentioned.  checkpoint_segments and the like, for example.  You also don't
describe your disk subsystem, it's entirely possible you've filled up the
cache on the disk controllers (which is why it looked initially fast) and
now are hitting up against the max speed the disks can do.  With only 1G
of RAM, it's possible that MySQL is reading, PostgreSQL is writing, and
PHP is swapping.  It doesn't take a lot of disk contention to flush
performance down the toilet.  See what iostat says.  Even better, use
top in "m" mode (hit m after top start) to see how much IO each process
is using (I believe that was added in 5.X, but it may only be available
in 6.X versions of FreeBSD's top)

>  * we have also tried these on another server with Red Hat Enterprise Linux
> ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm)
> Processor 270 (4x2GHz logical CPUs) with 2GB RAM

While 2G is better, that's still not a lot of RAM if you're trying to run
2 DB servers and a web server on a single system.  If you haven't tuned
PG to take advantage of it, then it won't help much anyway.  Additionally,
you've neglected to mention the disk subsystem on this machine as well.
Is it running cheapo SATA drives because the price/gig is right?

>  * both servers run in x86_64 mode, PostgreSQL footprint in memory stays
> relatively small,

Of course it does, because you've told it not to use more than 8M of RAM.

> CPU usage maxes out on import, there is no resource
> starvation in any way

You do realize that you're contradicting yourself here, right?

The advice provided by others is good as well, so I won't repeat it.

--
Bill Moran
http://www.potentialtech.com

Re: INSERT performance deteriorates quickly during a large import

From
Chris Browne
Date:
tv@fuzzy.cz ("=?UTF-8?B?VG9tw6HFoSBWb25kcmE=?=") writes:
> Try to one of these:
>
> a) don't use INSERT statements, use a COPY instead
>
> b) from time to time run ANALYZE on the "public" table (say 1000
>    inserts, then one analyze)
>
> c) create the table without constraints (primary / foreign keys in this
>    case), import all the data, and then create the constraints
>
> The (b) and (c) may be combined, i.e. import without constraints and
> analyze from time to time. I'd probably try the (a) at first, anyway.
>
> Try to gather some more statistics - is the problem related to CPU or
> I/O? Use 'dstat' for example - this might say give you a hint in case
> the advices mentioned above don't help.

I agree with all but b).

- If you use COPY, that copies data "in bulk" which is *way* faster
  than submitting individual INSERT statements that must each be parsed.

  So I certainly agree with a).

- There are two prime reasons to expect the data load to slow down:

  1. Because adding entries to the index gets more expensive the
     larger the table gets;

  2. Because searching through foreign key constraints tends to get
     more expensive as the target table grows.

  Those point to doing c).

If you put off evaluating indices and foreign key constraints until
all of the data is loaded, there should be no need to run ANALYZE
during the COPY process.

And there should be no reason for loading data to get more costly as
the size of the table increases.
--
let name="cbbrowne" and tld="linuxfinances.info" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/advocacy.html
Rules of the Evil Overlord #116.  "If I capture the hero's starship, I
will keep it in  the landing bay with the ramp down,  only a few token
guards on  duty and a ton  of explosives set to  go off as  soon as it
clears the blast-range." <http://www.eviloverlord.com/>

Re: INSERT performance deteriorates quickly during a large import

From
"Márcio Geovani Jasinski"
Date:
Hello Krasimir,

You got a lot of good advices above and I would like to add another one:

d) Make sure of your PHP code is not recursive. As you said the memory is stable so I think your method is iterative.
A recursive method certainly will increase a little time for each insert using more memory.
But iterative methods must be correctly to be called just once and maybe your code is running much more than need.

Pay attention on Tomas advices, and after that (I agree with Cris) "there should be no reason for loading data to get more costly as
the size of the table increases" - Please check your code.

I did some experiences long time ago with 40000 data with a lot of BLOBs. I used PHP code using SELECT/INSERT from Postgres to Postgres and the time wasn't constant but wasn't so bad as your case.  (And I didn't the Tomas a, b and c advices)

Good Luck
--
Márcio Geovani Jasinski

Re: INSERT performance deteriorates quickly during a large import

From
"Krasimir Hristozov \(InterMedia Ltd\)"
Date:
Thanks to all who responded. Using COPY instead of INSERT really solved the problem - the whole process took about 1h 20min on an indexed table, with constraints (which is close to our initial expectations). We're performing some additional tests now. I'll post some more observations when finished.
----- Original Message -----
Sent: Friday, November 09, 2007 1:52 PM
Subject: Re: INSERT performance deteriorates quickly during a large import

Hello Krasimir,

You got a lot of good advices above and I would like to add another one:

d) Make sure of your PHP code is not recursive. As you said the memory is stable so I think your method is iterative.
A recursive method certainly will increase a little time for each insert using more memory.
But iterative methods must be correctly to be called just once and maybe your code is running much more than need.

Pay attention on Tomas advices, and after that (I agree with Cris) "there should be no reason for loading data to get more costly as
the size of the table increases" - Please check your code.

I did some experiences long time ago with 40000 data with a lot of BLOBs. I used PHP code using SELECT/INSERT from Postgres to Postgres and the time wasn't constant but wasn't so bad as your case.  (And I didn't the Tomas a, b and c advices)

Good Luck
--
Márcio Geovani Jasinski

Re: INSERT performance deteriorates quickly during a large import

From
Tomas Vondra
Date:
>> Try to one of these:
>>
>> a) don't use INSERT statements, use a COPY instead
>>
>> b) from time to time run ANALYZE on the "public" table (say 1000
>>    inserts, then one analyze)
>>
>> c) create the table without constraints (primary / foreign keys in this
>>    case), import all the data, and then create the constraints
>>
>> The (b) and (c) may be combined, i.e. import without constraints and
>> analyze from time to time. I'd probably try the (a) at first, anyway.
>>
>> Try to gather some more statistics - is the problem related to CPU or
>> I/O? Use 'dstat' for example - this might say give you a hint in case
>> the advices mentioned above don't help.
>
> I agree with all but b).

You're right - this combination (no constraints or indices and using
ANALYZE at the same time) won't help. The original list of advices was a
little bit different, but I've changed it and haven't checked the
following paragraphs ...

Tomas

Re: INSERT performance deteriorates quickly during a large import

From
Julio Cesar Sánchez González
Date:
El jue, 08-11-2007 a las 13:01 -0500, Tom Lane escribió:
> "Krasimir Hristozov \(InterMedia Ltd\)" <krasi@imedia-dev.com> writes:
> > We need to import data from a relatively large MySQL database into an
> > existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
> > and INSERTs in PostgreSQL. A part of the import involves moving about
> > 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
> > problem is that the insert performance inevitably deteriorates as the number
> > of inserts increases.
>
> Are you *certain* you've gotten rid of all the indexes and foreign keys?
> A simple insert ought to be pretty much constant-time in Postgres, so it
> seems to me that you've missed something.
>
> It also seems possible that you are wrong to disregard PHP as a possible
> source of the problem.  Have you tried watching the PHP and PG backend
> processes with "top" (or similar tool) to see who's consuming CPU time
> and/or memory space?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


Hi for all,

Why better try use PERL DBI, may be better than.

--
Regards,

Julio Cesar Sánchez González.

--
Ahora me he convertido en la muerte, destructora de mundos.
Soy la Muerte que se lleva todo, la fuente de las cosas que vendran.

www.sistemasyconectividad.com.mx        http://darkavngr.blogspot.com/