Thread: bulk insert performance problem

bulk insert performance problem

From
"Christian Bourque"
Date:
Hi,

I have a performance problem with a script that does massive bulk
insert in 6 tables. When the script starts the performance is really
good but will degrade minute after minute and take almost a day to
finish!

I almost tried everything suggested on this list, changed our external
raid array from raid 5 to raid 10, tweaked postgresql.conf to the best
of my knowledge, moved pg_xlog to a different array, dropped the
tables before running the script. But the performance gain was
negligible even after all these changes...

IMHO the hardware that we use should be up to the task: Dell PowerEdge
6850, 4 x 3.0Ghz Dual Core Xeon, 8GB RAM, 3 x 300GB SAS 10K in raid 5
for / and 6 x 300GB SAS 10K in raid 10 (MD1000) for PG data, the data
filesystem is ext3 mounted with noatime and data=writeback. Running on
openSUSE 10.3 with PostgreSQL 8.2.7. The server is dedicated for
PostgreSQL...

We tested the same script and schema with Oracle 10g on the same
machine and it took only 2.5h to complete!

What I don't understand is that with Oracle the performance seems
always consistent but with PG it deteriorates over time...

Any idea? Is there any other improvements I could do?

Thanks

Christian

Re: bulk insert performance problem

From
Craig Ringer
Date:
Christian Bourque wrote:
> Hi,
>
> I have a performance problem with a script that does massive bulk
> insert in 6 tables. When the script starts the performance is really
> good but will degrade minute after minute and take almost a day to
> finish!
>
Would I be correct in guessing that there are foreign key relationships
between those tables, and that there are significant numbers of indexes
in use?

The foreign key checking costs will go up as the tables grow, and AFAIK
the indexes get a bit more expensive to maintain too.

If possible you should probably drop your foreign key relationships and
drop your indexes, insert your data, then re-create the indexes and
foreign keys. The foreign keys will be rechecked when you recreate them,
and it's *vastly* faster to do it that way. Similarly, building an index
from scratch is quite a bit faster than progressively adding to it. Of
course, dropping the indices is only useful if you aren't querying the
tables as you build them.

Also, if you're loading data using stored procedures you should avoid
the use of exception blocks. I had some major problems with my bulk data
conversion code due to overuse of exception blocks creating large
numbers of subtransactions behind the scenes and slowing everything to a
crawl.

--
Craig Ringer

Re: bulk insert performance problem

From
Chris
Date:
Craig Ringer wrote:
> Christian Bourque wrote:
>> Hi,
>>
>> I have a performance problem with a script that does massive bulk
>> insert in 6 tables. When the script starts the performance is really
>> good but will degrade minute after minute and take almost a day to
>> finish!
>>
> Would I be correct in guessing that there are foreign key relationships
> between those tables, and that there are significant numbers of indexes
> in use?
>
> The foreign key checking costs will go up as the tables grow, and AFAIK
> the indexes get a bit more expensive to maintain too.
>
> If possible you should probably drop your foreign key relationships and
> drop your indexes, insert your data, then re-create the indexes and
> foreign keys. The foreign keys will be rechecked when you recreate them,
> and it's *vastly* faster to do it that way. Similarly, building an index
> from scratch is quite a bit faster than progressively adding to it. Of
> course, dropping the indices is only useful if you aren't querying the
> tables as you build them.

If you are, add "analyze" commands through the import, eg every 10,000
rows. Then your checks should be a bit faster.

The other suggestion would be to do block commits:

begin;
do stuff for 5000 rows;
commit;

repeat until finished.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: bulk insert performance problem

From
"bitaoxiao"
Date:
I use 10000 rows,have big blob
 
 
2008-04-08

bitaoxiao

发件人: Chris
发送时间: 2008-04-08  11:35:57
收件人: Christian Bourque
抄送: pgsql-performance@postgresql.org
主题: Re: [PERFORM] bulk insert performance problem
Craig Ringer wrote:
> Christian Bourque wrote:
>> Hi,
>>
>> I have a performance problem with a script that does massive bulk
>> insert in 6 tables. When the script starts the performance is really
>> good but will degrade minute after minute and take almost a day to
>> finish!
>>   
> Would I be correct in guessing that there are foreign key relationships 
> between those tables, and that there are significant numbers of indexes 
> in use?
> The foreign key checking costs will go up as the tables grow, and AFAIK 
> the indexes get a bit more expensive to maintain too.
> If possible you should probably drop your foreign key relationships and 
> drop your indexes, insert your data, then re-create the indexes and 
> foreign keys. The foreign keys will be rechecked when you recreate them, 
> and it's *vastly* faster to do it that way. Similarly, building an index 
> from scratch is quite a bit faster than progressively adding to it. Of 
> course, dropping the indices is only useful if you aren't querying the 
> tables as you build them.
If you are, add "analyze" commands through the import, eg every 10,000 
rows. Then your checks should be a bit faster.
The other suggestion would be to do block commits:
begin;
do stuff for 5000 rows;
commit;
repeat until finished.
-- 
Postgresql & php tutorials
http://www.designmagick.com/
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: bulk insert performance problem

From
hubert depesz lubaczewski
Date:
On Mon, Apr 07, 2008 at 11:01:18PM -0400, Christian Bourque wrote:
> I have a performance problem with a script that does massive bulk
> insert in 6 tables. When the script starts the performance is really
> good but will degrade minute after minute and take almost a day to
> finish!

how do you do this bulk insert?

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: bulk insert performance problem

From
Mark Stosberg
Date:
Christian Bourque wrote:
>
> Any idea? Is there any other improvements I could do?

Are you using the "COPY" syntax in the import script or individual
insert statements? Using COPY will always be *much* faster.

I believe COPY always appends to tables rather than replacing the
contents, you can combine this technique with the possibility of
splitting up the task into multiple copy statements, but that has never
been necessary in my case, switching from INSERTS to a COPY statement
always provided the huge performance improvement I needed.

It's easy to confuse "pg_dump -d" with "psql -d" ...it's too bad they
mean very different things.

For pg_dump, "-d" causes INSERT statements to be generated instead of a
COPY statement, and is has been a mistake I made in the past, because I
expected to work like "psql -d", where "-d" means "database name".

I suppose the safe thing to do is to avoid using "-d" altogether!


    Mark

Re: bulk insert performance problem

From
Matthew
Date:
On Tue, 8 Apr 2008, Mark Stosberg wrote:
>> Any idea? Is there any other improvements I could do?
>
> Are you using the "COPY" syntax in the import script or individual insert
> statements? Using COPY will always be *much* faster.

PostgreSQL (latest versions at least) has an optimisation if you create a
table in the same transaction as you load data into it. So, if you have a
database dump, load it in using psql -1, which wraps the entire operation
in a single transaction. Of course, a COPY dump will load a lot faster
than a INSERT dump.

Matthew

--
What goes up must come down. Ask any system administrator.

Re: bulk insert performance problem

From
PFC
Date:
> I have a performance problem with a script that does massive bulk
> insert in 6 tables. When the script starts the performance is really
> good but will degrade minute after minute and take almost a day to
> finish!

    Looks like foreign key checks slow you down.

    - Batch INSERTS in transactions (1000-10000 per transaction)
    - Run ANALYZE once in a while so the FK checks use indexes
    - Are there any DELETEs in your script which might hit nonidexed
REFERENCES... columns to cascade ?
    - Do you really need to check for FKs on the fly while inserting ?
    ie. do you handle FK violations ?
    Or perhaps your data is already consistent ?
    In this case, load the data without any constraints (and without any
indexes), and add indexes and foreign key constraints after the loading is
finished.
    - Use COPY instead of INSERT.

    If you use your script to process data, perhaps you could import raw
unprocessed data in a table (with COPY) and process it with SQL. This is
usually much faster than doing a zillion inserts.