Thread: Poor addBatch performance. Why dosn't it use copy ?
Folks ! I took this discussion in Postgres performance list and came out with conclusion that its a client side JDBC issue - so I am psting it here. I have a batch application that writes approx. 4 million rows into a narrow (2 column) table. I am using JDBC addBatch/ExecuteBatch with auto commit turned off. Batch size is 1000. So far I am seeing Postgres take roughly five times (280 sec) the time it takes to do this in the Oracle (60). This is on a Linux server with Xeon woodcrest 5310 process. Plenty of memory. I have played with many parameters on the server side and they seem to have little effect - I am sure Postgres is a very capable server and its not a database server issue. Someone mentioned: "I actually went and looked at the JDBC api and realized 'addBatch' means to run multiple stmts at once, not batch inserting. femski, your best bet is to lobby the JDBC folks to build support for 'copy' into the driver for faster bulk loads (or help out in that regard). " Based on other responses I am convinced this is indeed the problem and I think its a pretty serious limitation. Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ? What is the best way to speedup do bulk insert at this time or in near future (I was to use standard JDBC API) ? Thank you, -Sanjay -- View this message in context: http://www.nabble.com/Poor-addBatch-performance.-Why-dosn%27t-it-use-copy---tf3616055.html#a10099075 Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.
[send post, since the first one was rejected by MTA] Hi femski, Il giorno ven, 20/04/2007 alle 04.39 -0700, femski ha scritto: [...] > Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ? > What is the best way to speedup > do bulk insert at this time or in near future (I was to use standard JDBC > API) ? I use addBatch() very often when, but I use it with statement that work on different tables. I think it would really be difficult to check if the statements are referring the same table and the same field. Why don't you write all record in one insert statement like "INSERT INTO table (fields) VALUES (...,...), (...,...), (...,..);" I think both oracle and postgres 8.2 should accept this syntax. Bye, Giuseppe
On Friday 20 April 2007 13:39, femski wrote: > Folks ! > > "I actually went and looked at the JDBC api and realized 'addBatch' means > to run multiple stmts at once, not batch > inserting. femski, your best bet is to lobby the JDBC folks to build > support for 'copy' into the driver for faster bulk loads (or help out in > that regard). " > > Based on other responses I am convinced this is indeed the problem and I > think its a pretty serious limitation. > > Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ? > What is the best way to speedup > do bulk insert at this time or in near future (I was to use standard JDBC > API) ? I would also love to have copy support (again) in the driver, but did you actually test that it would speed up your inserts? You could save your 4m rows into a txt file and then something like cat file.out | psql -d db -c "copy table from stdin delimiter '\t';" and then you would know if copy would really speed your inserts up. best wishes ido
On Fri, 20 Apr 2007, femski wrote: > I took this discussion in Postgres performance list and came out with > conclusion that its a > client side JDBC issue - so I am psting it here. > > I have a batch application that writes approx. 4 million rows into a narrow > (2 column) table. I am using JDBC addBatch/ExecuteBatch with auto commit > turned off. Batch size is 1000. So far I am seeing Postgres take > roughly five times (280 sec) the time it takes to do this in the Oracle > (60). This is on a Linux > server with Xeon woodcrest 5310 process. Plenty of memory. I have played > with many parameters on > the server side and they seem to have little effect - I am sure Postgres is > a very capable server and its > not a database server issue. Someone mentioned: > > "I actually went and looked at the JDBC api and realized 'addBatch' means to > run multiple stmts at once, not batch > inserting. femski, your best bet is to lobby the JDBC folks to build > support for 'copy' into the driver for faster bulk loads (or help out in > that regard). " This comment is inaccurate. addBatch is indeed useful for batch inserts. Consider the attached test case which inserts a million rows into a single column table using prepared statements with and without batches. I get: No batch: 148.92 With batch: 48.008 So using batch execution is about three times faster. Additionaly gains are possible by using multiple threads and multiple connections in your client to parallelize the work. This should provide linear scaling to the number of cores/io bandwidth you have available. Perhaps oracle is able to do this behind the scenes on the server, but postgresql does not. > Why doesn't the Postgres JDBC driver use "copy" for faster bulk insert ? > What is the best way to speedup > do bulk insert at this time or in near future (I was to use standard JDBC > API) ? > Automatically converting from insert to copy is possible in only limited circumstances that can be difficult to detect. For example copy cannot handle function calls in data, so you couldn't convert this to copy: INSERT INTO t VALUES (now(), ?). What if t is actually a view with an ON INSERT rule? Copy doesn't work with views. So some complicated parsing and analysis of the query is required. Sure it's possible, but it's non-trivial and no one has done the work. Kris Jurka
Attachment
On Fri, 20 Apr 2007, Giuseppe Sacco wrote: > I use addBatch() very often when, but I use it with statement that work > on different tables. I think it would really be difficult to check if > the statements are referring the same table and the same field. Well Statement.addBatch and PreparedStatement.addBatch are different beasts. It wouldn't be worth it to try and parse each Statement, but parsing the PreparedStatement would be possible. Kris Jurka