INSERT performance deteriorates quickly during a large import - Mailing list pgsql-general

From Krasimir Hristozov \(InterMedia Ltd\)
Subject INSERT performance deteriorates quickly during a large import
Date
Msg-id 02b101c82221$4f59ced0$0400000a@imediadev.com
Whole thread Raw
Responses Re: INSERT performance deteriorates quickly during a large import  (Tomáš Vondra <tv@fuzzy.cz>)
Re: INSERT performance deteriorates quickly during a large import  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: INSERT performance deteriorates quickly during a large import  (Bill Moran <wmoran@potentialtech.com>)
List pgsql-general
We need to import data from a relatively large MySQL database into an
existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
and INSERTs in PostgreSQL. A part of the import involves moving about
1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
problem is that the insert performance inevitably deteriorates as the number
of inserts increases.

We tried different approaches:

 * selecting only parts of the source table data based on a certain
condition
 * selecting all of the source data

coupled with either of these:

 * inserting without explicit transactions
 * inserting all the data in a single transaction
 * inserting the data in partial transactions of about 100, 1000, 10000,
100000 inserts each

While there were performance benefits in some of the cases (selecting all
the data and inserting in transaction packets of about 1000 each being the
fastest), the problem was that it still deteriorated as the import
progressed.

We tried removing all foreign keys and indices from the postgres table,
still gained performance, but it deteriorated as well.

The latest (and best performing) test we did was under the following
conditions:

 * 11851 pre-existing records in the destination table
 * the table was vacuumed just before the import
 * all foreign keys and indices were removed from the destination table
 * selected all of the data from the source table at once
 * inserted in transactions of 1000 inserts each

We displayed time statistics on each 100 inserts. The process started at
about 1 second per 100 inserts. This estimated to about 4 hours for the
entire process. 14 hours later it had imported about a quarter of the data
(a bit more than 330000 records), and 100 inserts now took nearly 40
seconds.

We tested reading from MySQL alone, without inserting the data in Postgres.
All records were read in about a minute and a half, so MySQL performance is
not a part of the issue. The PHP script selects the MySQL data, fetches rows
sequentially, occasionally performs a couple of selects against PostgreSQL
data (which is cached in a PHP array to reduce the DB operations; no more
than 80000 array elements, integer keys, integer data), and inserts into
PostgreSQL. The algorithm seems to be linear in nature and perfomance
deterioration most probably doesn't have to do with the PHP code.

Has anyone had an issue like this, and perhaps a suggestion for a possible
cause and solution? Is it common for PostgreSQL to grow so slow as the
amount of data in the tables increases? If so, is it just the insert
operation or all kinds of queries? Isn't 300000 records too low a threshold
for such performance deterioration?

Here are some technical details, that might be helpful:

 * PHP, MySQL and PostgreSQL all work on the same server, sharing the same
memory and hard drive.
 * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor
3000+ (2GHz K8 class CPU) with 1GB RAM
 * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL
4.1.22, PostgreSQL 8.1.8
 * postgresql.conf variables other than defaults are: max_connections = 40,
shared_buffers = 1000 (this is the default)
 * we have also tried these on another server with Red Hat Enterprise Linux
ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm)
Processor 270 (4x2GHz logical CPUs) with 2GB RAM
 * both servers run in x86_64 mode, PostgreSQL footprint in memory stays
relatively small, CPU usage maxes out on import, there is no resource
starvation in any way

DDL statement for the creation of the PostgreSQL table in question:

 CREATE TABLE "public"."sp_thread_replies" (
   "id" SERIAL,
   "thread_id" INTEGER NOT NULL,
   "body" TEXT NOT NULL,
   "ts_added" INTEGER DEFAULT 0 NOT NULL,
   "user_id" INTEGER NOT NULL,
   "thread_offset" INTEGER DEFAULT 0,
   "approved" SMALLINT DEFAULT 1,
   "title" TEXT,
   "deleted" SMALLINT DEFAULT 0,
   "edit_reason" VARCHAR(255),
   "edit_user_id" INTEGER,
   "edit_time" INTEGER,
   CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"),
   CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id")
     REFERENCES "public"."sp_threads"("id")
     ON DELETE CASCADE
     ON UPDATE NO ACTION
     NOT DEFERRABLE,
   CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id")
     REFERENCES "public"."sp_users"("id")
     ON DELETE NO ACTION
     ON UPDATE NO ACTION
     NOT DEFERRABLE
 ) WITH OIDS;

The table is a part of a custom forum engine. It stores all thread posts.
It's most often queried with SELECTs and INSERTSs, less often with UPDATEs,
and records are deleted quite seldom in normal operation of the application
(though we may delete records manually from the console from time to time).


pgsql-general by date:

Previous
From: "Albe Laurenz"
Date:
Subject: Re: "Resurrected" data files - problem?
Next
From: Bill Moran
Date:
Subject: Re: Calculation for Max_FSM_pages : Any rules of thumb?