Thread: Poor addBatch performance. Why dosn't it use copy ?

Poor addBatch performance. Why dosn't it use copy ?

From
femski
Date:
Folks !

I took this discussion in Postgres performance  list and came out with
conclusion that its a
client side JDBC issue - so I am psting it here.

I have a batch application that writes approx. 4 million rows into a narrow
(2 column) table. I am using JDBC addBatch/ExecuteBatch with auto commit
turned off. Batch size is 1000. So far I am seeing Postgres take
roughly five times (280 sec) the time it takes to do this in the Oracle
(60). This is on a Linux
server with Xeon woodcrest 5310 process. Plenty of memory. I have played
with many parameters on
the server side and they seem to have little effect - I am sure Postgres is
a very capable server and its
not a database server issue. Someone mentioned:

"I actually went and looked at the JDBC api and realized 'addBatch' means to
run multiple stmts at once, not batch
inserting.  femski, your best bet is to lobby the JDBC folks to build
support for 'copy' into the driver for faster bulk loads (or help out in
that regard). "

Based on other responses  I am convinced this is indeed the problem and I
think its a pretty serious limitation.

Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ?
What is the best way to speedup
do bulk insert at this time or in near future (I was to use standard JDBC
API) ?

Thank you,

-Sanjay
--
View this message in context:
http://www.nabble.com/Poor-addBatch-performance.-Why-dosn%27t-it-use-copy---tf3616055.html#a10099075
Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.


Re: Poor addBatch performance. Why dosn't it use copy ?

From
Giuseppe Sacco
Date:
[send post, since the first one was rejected by MTA]

Hi femski,

Il giorno ven, 20/04/2007 alle 04.39 -0700, femski ha scritto:
[...]
> Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ?
> What is the best way to speedup
> do bulk insert at this time or in near future (I was to use standard JDBC
> API) ?

I use addBatch() very often when, but I use it with statement that work
on different tables. I think it would really be difficult to check if
the statements are referring the same table and the same field.

Why don't you write all record in one insert statement like "INSERT INTO
table (fields) VALUES (...,...), (...,...), (...,..);" I think both
oracle and postgres 8.2 should accept this syntax.

Bye,
Giuseppe

Re: Poor addBatch performance. Why dosn't it use copy ?

From
"Ido M. Tamir"
Date:
On Friday 20 April 2007 13:39, femski wrote:
> Folks !
>

> "I actually went and looked at the JDBC api and realized 'addBatch' means
> to run multiple stmts at once, not batch
> inserting.  femski, your best bet is to lobby the JDBC folks to build
> support for 'copy' into the driver for faster bulk loads (or help out in
> that regard). "
>
> Based on other responses  I am convinced this is indeed the problem and I
> think its a pretty serious limitation.
>
> Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ?
> What is the best way to speedup
> do bulk insert at this time or in near future (I was to use standard JDBC
> API) ?

I would also love to have copy support (again) in the driver, but
did you actually test that it would speed up your inserts?
You could save your 4m rows into a txt file and then
something like

cat file.out | psql -d db -c "copy table from stdin delimiter '\t';"
and then you would know if copy would really speed your
inserts up.

best wishes
ido




Re: Poor addBatch performance. Why dosn't it use copy ?

From
Kris Jurka
Date:

On Fri, 20 Apr 2007, femski wrote:

> I took this discussion in Postgres performance  list and came out with
> conclusion that its a
> client side JDBC issue - so I am psting it here.
>
> I have a batch application that writes approx. 4 million rows into a narrow
> (2 column) table. I am using JDBC addBatch/ExecuteBatch with auto commit
> turned off. Batch size is 1000. So far I am seeing Postgres take
> roughly five times (280 sec) the time it takes to do this in the Oracle
> (60). This is on a Linux
> server with Xeon woodcrest 5310 process. Plenty of memory. I have played
> with many parameters on
> the server side and they seem to have little effect - I am sure Postgres is
> a very capable server and its
> not a database server issue. Someone mentioned:
>
> "I actually went and looked at the JDBC api and realized 'addBatch' means to
> run multiple stmts at once, not batch
> inserting.  femski, your best bet is to lobby the JDBC folks to build
> support for 'copy' into the driver for faster bulk loads (or help out in
> that regard). "

This comment is inaccurate.  addBatch is indeed useful for batch inserts.
Consider the attached test case which inserts a million rows into a single
column table using prepared statements with and without batches.  I get:

No batch: 148.92
With batch: 48.008

So using batch execution is about three times faster.

Additionaly gains are possible by using multiple threads and multiple
connections in your client to parallelize the work.  This should provide
linear scaling to the number of cores/io bandwidth you have available.
Perhaps oracle is able to do this behind the scenes on the server, but
postgresql does not.

> Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ?
> What is the best way to speedup
> do bulk insert at this time or in near future (I was to use standard JDBC
> API) ?
>

Automatically converting from insert to copy is possible in only limited
circumstances that can be difficult to detect.  For example copy
cannot handle function calls in data, so you couldn't convert this
to copy: INSERT INTO t VALUES (now(), ?).  What if t is actually a view
with an ON INSERT rule?  Copy doesn't work with views.  So some
complicated parsing and analysis of the query is required.

Sure it's possible, but it's non-trivial and no one has done the work.

Kris Jurka


Attachment

Re: Poor addBatch performance. Why dosn't it use copy ?

From
Kris Jurka
Date:

On Fri, 20 Apr 2007, Giuseppe Sacco wrote:

> I use addBatch() very often when, but I use it with statement that work
> on different tables. I think it would really be difficult to check if
> the statements are referring the same table and the same field.

Well Statement.addBatch and PreparedStatement.addBatch are different
beasts.  It wouldn't be worth it to try and parse each Statement, but
parsing the PreparedStatement would be possible.

Kris Jurka