Re: how to handle very large data object efficiently - Mailing list pgsql-jdbc

From Mark Lewis
Subject Re: how to handle very large data object efficiently
Date
Msg-id 1185468142.19184.48.camel@archimedes
Whole thread Raw
In response to how to handle very large data object efficiently  (Xiaoyu <zhangxiaoyu912@gmail.com>)
List pgsql-jdbc
On Thu, 2007-07-26 at 15:54 +0000, Xiaoyu wrote:
> Hi, folks:
> I am a new programmer to JDBC, if the question is naive, please
> forgive me.
> question description:
> 1. store the data in a text file into database, the format of the text
> file is similar to:
> ......
> 218596813 235940555 4387359 3 386658 4000 4 4
> 218597197 235940333 4366832 17 388842 5000 5 5
> 218597485 235940805 4374620 8 386226 4000 4 4
> ......
> each record of the database corresponding to each line of the file,
> and each element corresponding to each number.
>
> 2. the file is very huge, normally there are 9,000,000 lines in each
> file. My current program read the file line by line and parse the line
> and store the elements into the database. However, because the file is
> huge, it may take days to store one file, and there are 50 similar
> files need to be processed.
>
> 3. Can anyone give a better solution to take place "read line by
> line"? Is there any method like manipulate block of data(equal several
> lines) in JDBC? Many thanks

Basic FAQ for large data imports:

0. Make sure Postgres is set up correctly, including a sane vacuuming
policy.  Depending on your workload you may want to play with the
checkpoint settings, work_mem settings, auto-vacuum, etc.

1. Use a recent version of Postgres (8.2.4 preferably).

2. Use a few big transactions instead of many small transactions.  In
JDBC this means setAutoCommit(false).

3. If possible, do the import first and add any relevant
constraints/indexes later.

4. Use the COPY method.  This isn't supported in the official driver
yet, but look through the mailing list archives and you'll find the
patch to enable it.

5. Depending on your server hardware, it can be a big win to split the
import across multiple threads.  Try and see.

If you do all of the above and have a relatively recent database server,
you should be able to import simple data like above at at least several
thousands of rows per second.  If it takes more than 30 minutes to
import one of your 9M row files, then you're doing something wrong.

-- Mark Lewis

pgsql-jdbc by date:

Previous
From: Xiaoyu
Date:
Subject: how to handle very large data object efficiently
Next
From: "Hui Ye"
Date:
Subject: Issue with batch update and timestamp escape syntax error