Big question on insert performance/using COPY FROM - Mailing list pgsql-performance

From Morgan Kita
Subject Big question on insert performance/using COPY FROM
Date
Msg-id 08B420FF5BF7BC42A064212C2EB768801C10BE@neutron.verseon.com
Whole thread Raw
Responses Re: Big question on insert performance/using COPY FROM
List pgsql-performance
Hi,

I am currently trying to speed up the insertion of bulk loads to my database. I have fiddled with all of the parameters
thatI have seen suggested(aka checkpoint_segments, checkpoint_timeout, maintinence_work_mem, and shared buffers) with
nosuccess. I even turned off fysnc with no effect so I am pretty sure the biggest problem is that the DB is CPU limited
atthe moment because of the rather weak machine that postmaster is running on(Athlon 2400+ xp with 512 RAM), but that
willchange in the future so I am trying to get performance increases that don't involve changing the machine at the
moment.

I am currently inserting into the database through lipqxx's C++ interface. I am using prepared statements that perform
regularinserts. I would like to use COPY FROM since I have read so much about its increased performance with respect to
INSERT,but I am not sure how to use it in my case. So let me give you an idea on how the tables are laid out.  

The real DB has more tables, but lets say for the sake of argument I have 3 tables; TB1, TB2, TB3. Lets say that TB1
hasa primary key PK1 and a unique identifier column(type text) UK1 that has an index on it. TB2 then has a PK2, a
UK2(typetext) of its own with an index, and a foreign key FK2 that points to TB1's PK1. TB3 has a PK3 and a FK3 that
pointsto FK2.  
TB1             TB2                               TB3
--------------     -------------------------------     ----------------------
PK1, UK1     PK2, UK2, FK2(PK1)     PK3, FK3(PK2)

Now in lipqxx I am parsing an input list of objects that are then written to these tables. Each object may produce one
rowin TB1, one row in TB2, and one row in TB3. The UK1 and UK2 indentifiers are used to prevent duplicate entries for
TB1and TB2 respectively. I know COPY FROM obeys these unique checks; however, my problem is the FKs. So lets say I try
toinsert a row into TB1. If it is unique on UK1 then it inserts a new row with some new primary key int4 identifier and
ifit is a duplicate then no insert is done but the already existing row's primary key identifier is returned. This
identifier(duplicateor not) is used when populating TB2's row as the FK2 identifier. The row that is to be inserted
intoTB2 needs the primary key indentifier from the result of the attempted insert into TB1. Similarily the insert into
TB3needs the result of the pk indentifier of the attempted insert into TB2. Once that is done then I move on to parsing
thenext object for insertion into the 3 tables. 

So lets say I want to insert a list of objects using COPY FROM... whats the way to do it? How can I at the very least
geta list of the primary keys of TB1(newly inserted rows or from already existings row) returned from the COPY FROM
insertinto TB1 so I can use them for the COPY FROM insert into TB2 and so on? Is there a better way to do this? 

P.S. I am going to setup autovacuum for these bulk loads. My question though is why for bulkloads is VACUUM useful? I
understandthat it frees up dead rows as a result of UPDATE and such, but where are the dead rows created from plain
INSERTS?

Thanks,
Morgan

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Swapping
Next
From: Tobias Brox
Date:
Subject: Re: Swapping