Thread: Inserting large BLOBs via JDBC - OutOfMemoryError
(please apologize if I'm double-posting, but I assume my first mail did not work because I was using the wrong mail account) Hi all, I am facing a problem when writing blobs via jdbc. Small BLOBs work, but with BLOBs of a certain size my code throws a java.lang.OutOfMemoryError. The file I tried has about 2-3 MB. Environment: - PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96 - Mandrake 8.2 - 192 MB RAM - blob column created as datatype "bytea" (maybe that is the problem) - code posted below (BlobPostgresTest.java) - console output posted below - java command to run program: java -verbose:gc -Xmx128m -classpath .:pgjdbc2.jar BlobPostgresTest - java version as follows from "java -version" command: java version "1.4.0_01" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03) Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode) To me it seems that the implementation of setBinaryStream() could be improved. Details follow below. let me know what you think. holger Console output (including GC activities): ========================================= java -verbose:gc -Xmx128m -classpath .:pgjdbc2.jar BlobPostgresTest BLOB/PostgreSQL Demo started [GC 511K->230K(1984K), 0.0163708 secs] preparing statement setting binary stream [GC 402K->273K(1984K), 0.0092325 secs] [Full GC 273K->273K(1984K), 0.0771294 secs] [GC 4351K->4058K(4908K), 0.0046329 secs] [Full GC 4058K->3770K(4908K), 0.0831070 secs] [GC 4922K->4922K(6860K), 0.0009556 secs] [Full GC 4922K->4346K(6860K), 0.0890570 secs] [GC 6650K->6650K(7820K), 0.0011110 secs] [Full GC 6650K->5496K(7820K), 0.1420860 secs] [GC 10104K->10104K(14480K), 0.0012827 secs] [Full GC 10104K->7800K(14480K), 0.1236744 secs] [GC 17016K->17016K(24084K), 0.0015421 secs] [Full GC 17016K->12408K(24084K), 0.1688843 secs] [GC 30840K->30840K(43224K), 0.0020800 secs] [Full GC 30840K->21624K(43224K), 0.2547274 secs] [GC 76920K->74616K(81500K), 0.0041685 secs] [Full GC 74616K->49272K(81500K), 0.5688448 secs] [GC 67704K->67704K(88332K), 0.0033407 secs] [Full GC 67704K->58488K(88332K), 0.2558231 secs] executing update [GC 95352K->95352K(104844K), 0.0932741 secs] [Full GC 95352K->40056K(104844K), 0.9644251 secs] [GC 69245K->69245K(104844K), 0.0036631 secs] [Full GC 69245K->69245K(104844K), 0.0814962 secs] [Full GC 69245K->66324K(129728K), 1.1439123 secs] Exception in thread "main" java.lang.OutOfMemoryError as you can see the program stops when setBinaryStream() is executed. the memory allocated is exceeding 128 MB (the whole thing fails with a max. heap of 192MB as well). -> my assumption is that setBinaryStream() is allocating a huge amount of memory looking into the PostreSQL source code, I find org.postgresql.jdbc1.AbstractJdbc1Statement#setBinaryStream org.postgresql.util.PGbytea#toPGString mainly toPGString seems to be programmed inefficiently in terms of memory consumption, e.g. the string buffer allocation public static String toPGString(byte[] p_buf) throws SQLException { if (p_buf == null) return null; StringBuffer l_strbuf = new StringBuffer(); should contain a senseful inital size public static String toPGString(byte[] p_buf) throws SQLException { if (p_buf == null) return null; StringBuffer l_strbuf = new StringBuffer(p_buf.length); or even StringBuffer l_strbuf = new StringBuffer(p_buf.length*2); because of special characters to avoid a repeated re-allocation of the internal char[] array in the stringbuffer. BlobPostgresTest.java ===================== import java.sql.*; import java.io.*; /* drop table blobdemo; drop sequence blobdemo_id_seq; create table blobdemo( id serial not null primary key, name varchar(50), content bytea); */ public final class BlobPostgresTest { private final static String NAME = "TEST"; //private final static String FILE_NAME = "/tmp/blob/2mb.xxx"; private final static String FILE_NAME = "BlobPostgresTest.java"; public final static void main(String[] args) throws Throwable { Connection con = null; Statement statement = null; PreparedStatement insertStatement = null; ResultSet rs = null; File file = null; FileInputStream fis = null; BufferedInputStream bis = null; try { System.out.println("BLOB/PostgreSQL Demo started"); Class.forName("org.postgresql.Driver"); con = DriverManager.getConnection ("jdbc:postgresql://localhost/template1", "postgres", ""); con.setAutoCommit(true); statement = con.createStatement(); // cleanup statement.executeUpdate("delete from blobdemo"); // file file = new File(FILE_NAME); fis = new FileInputStream(file); bis = new BufferedInputStream(fis); // insert one record System.out.println("preparing statement"); insertStatement = con.prepareStatement ("insert into blobdemo ( name, content ) values ( ?, ? )"); insertStatement.setString(1, NAME); System.out.println("setting binary stream"); insertStatement.setBinaryStream(2, bis, (int)file.length()); System.out.println("executing update"); insertStatement.executeUpdate(); // retrieve rs = statement.executeQuery ("select id, name, content from blobdemo"); while(rs.next()) { System.out.println("id=" + rs.getObject(1)); System.out.println("name=" + rs.getObject(2)); byte[] bytes = rs.getBytes(3); String content = new String(bytes); //System.out.println(content); System.out.println("retrieved " + bytes.length + " bytes"); } } finally { if(rs != null) rs.close(); if(statement != null) statement.close(); if(insertStatement != null) insertStatement.close(); if(con != null) con.close(); if(fis != null) fis.close(); if(bis != null) bis.close(); } System.out.println("BLOB/PostgreSQL Demo complete"); } } -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Holger, There certainly are areas where the driver can be improved. And this is probably one of them. But like many things the solution isn't as easy as it first appears. hhaag@gmx.de wrote: >mainly toPGString seems to be programmed inefficiently in terms of memory >consumption, e.g. > >the string buffer allocation > > public static String toPGString(byte[] p_buf) throws SQLException > { > if (p_buf == null) > return null; > StringBuffer l_strbuf = new StringBuffer(); > > >should contain a senseful inital size > > > public static String toPGString(byte[] p_buf) throws SQLException > { > if (p_buf == null) > return null; > StringBuffer l_strbuf = new StringBuffer(p_buf.length); > >or even > > StringBuffer l_strbuf = new StringBuffer(p_buf.length*2); > >because of special characters > > >to avoid a repeated re-allocation of the internal char[] array in the >stringbuffer. > > > > While "new StringBuffer(p_buf.length)" is probably an improvement, it is difficult to predict what size buffer you will really need. This is because depending on the data you will see between zero and four times data expansion. Because the protocol postgres uses to talk between the client and server is string based, the binary data needs to be encoded in an ascii safe way. The encoding for the bytea datatype is to use \OOO octal escaping. Therefore each byte of data may take up to four bytes in the output. However if the data is mostly printable 7bit ascii bytes then there will be little expansion. I think your idea of initializing the buffer to be the size of the byte[] is a good idea. I will apply that change unless someone has a better suggestion. thanks, --Barry PS. It should also be possible to not prebuild the entire sql statement in memory (which includes the entire escaped binary value), but to stream the value back to the server. Thus instead of building the following String in memory: insert into table foo (bar, baz, boo) values (1000, '\123\432\543\567... continuing on for possibly megabytes...', 'boo value') The driver should be able to just send back the pieces and stream the bytea value from the inputstream: send -> "insert int table foo (bar, baz, boo) values(" send -> 1000 (value for column bar) send -> ", " send -> (value for column baz by reading from the input stream here, encoding the results byte by byte and sending them) send -> ", " send -> "'boo value'" (value for column boo) send -> ")" I am not sure how much work this will be, but it is probably the correct long term solution. Any volunteers for looking into this and submitting a patch?
>While "new StringBuffer(p_buf.length)" is probably an improvement, it is >difficult to predict what size buffer you will really need. This is >because depending on the data you will see between zero and four times >data expansion. Because the protocol postgres uses to talk between the >client and server is string based, the binary data needs to be encoded >in an ascii safe way. The encoding for the bytea datatype is to use >\OOO octal escaping. Therefore each byte of data may take up to four >bytes in the output. However if the data is mostly printable 7bit ascii >bytes then there will be little expansion. > >I think your idea of initializing the buffer to be the size of the >byte[] is a good idea. I will apply that change unless someone has a >better suggestion. I think it's at least better than initializing the stringbuffer with the default capacity, which is 16. And as long as the stringbuffer is used only internally (as a local variable) in a private method, no other parts of the code should be affected. Of course you cannot predict the final size of the created string. There are also other places where StringBuffer usage could be improved in my opinion: (1) org.postgresql.jdbc1.AbstractJdbc1Statement#setString() // Some performance caches private StringBuffer sbuf = new StringBuffer(); ... current: public void setString(int parameterIndex, String x) throws SQLException { .... synchronized (sbuf) { sbuf.setLength(0); proposed: StringBuffer sbuf = new StringBuffer(x.length()); --> use a local, non-synchronized variable. initialize the stringbuffer with a smart capacity. please note that I have not fully explored the usage of synchronized and the re-usage of the stringbuffer. but as the synchronized keyword indicates, this variable will only be accessed by one thread at a time. additionally the actual contents of the stringbuffer are always disposed at the beginning of a method. so a local variable should be fine - and faster than a synchronized instance variable (2) org.postgresql.jdbc1.AbstractJdbc1Statement#compileQuery() protected synchronized String compileQuery() throws SQLException { sbuf.setLength(0); int i; if (isFunction && !returnTypeSet) throw new PSQLException("postgresql.call.noreturntype"); if (isFunction) { // set entry 1 to dummy entry.. inStrings[0] = ""; // dummy entry which ensured that no one overrode // and calls to setXXX (2,..) really went to first arg in a function call.. } for (i = 0 ; i < inStrings.length ; ++i) { if (inStrings[i] == null) throw new PSQLException("postgresql.prep.param", new Integer(i + 1)); sbuf.append (templateStrings[i]).append (inStrings[i]); } sbuf.append(templateStrings[inStrings.length]); return sbuf.toString(); } also in this case the stringbuffer should be initialized with a smart capacity. something like the sum of all string lengths to be appended. I'm a bit in a rush today, but I'll try to find an algorithm in the next few days -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
>PS. It should also be possible to not prebuild the entire sql statement >in memory (which includes the entire escaped binary value), but to >stream the value back to the server. Thus instead of building the >following String in memory: >insert into table foo (bar, baz, boo) values (1000, '\123\432\543\567... >continuing on for possibly megabytes...', 'boo value') >The driver should be able to just send back the pieces and stream the >bytea value from the inputstream: >send -> "insert int table foo (bar, baz, boo) values(" >send -> 1000 (value for column bar) >send -> ", " >send -> (value for column baz by reading from the input stream here, >encoding the results byte by byte and sending them) >send -> ", " >send -> "'boo value'" (value for column boo) >send -> ")" of course the prebuilding of the statement is the root of all troubles. with all the escaping which has to be done, the arrays will always build up in memory. The organization of the source is pretty good in terms of oop, I think, because the responsibilites are pretty simple to understand. however, this approach is only workable for regular types (string, int etc.) or small blobs. The "perfect" way would be to stream the BLOB (which is passed as a stream to setBinaryStream) directly to the db server. escaping could be done by nesting streams. in this case, no arrays would be needed at all. While being probably the best way, this would imply some sort of big overhaul of the code. >I am not sure how much work this will be, but it is probably the correct >long term solution. Any volunteers for looking into this and submitting >a patch? I would be interested in doing/helping this, but as I am not familiar with the PostgreSQL JDBC source code, I'd have to work together with somebody who really knows what's going on. -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
There might be a workaround: From JDBC 2.0 on, the interface java.sql.Blob allows to manipulate BLOBs. AbstractJdbc2Statement.setBlob() (code posted below) seems to create a BLOB upfront storing it in the database. The actual INSERT command will then only contain the OID, avoiding all memory problems. I'll let you know how it goes. public void setBlob(int i, Blob x) throws SQLException { InputStream l_inStream = x.getBinaryStream(); int l_length = (int) x.length(); LargeObjectManager lom = connection.getLargeObjectAPI(); int oid = lom.create(); LargeObject lob = lom.open(oid); OutputStream los = lob.getOutputStream(); try { // could be buffered, but then the OutputStream returned by LargeObject // is buffered internally anyhow, so there would be no performance // boost gained, if anything it would be worse! int c = l_inStream.read(); int p = 0; while (c > -1 && p < l_length) { los.write(c); c = l_inStream.read(); p++; } los.close(); } catch (IOException se) { throw new PSQLException("postgresql.unusual", se); } // lob is closed by the stream so don't call lob.close() setInt(i, oid); } -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Make sure you have read and understand the limitations in postgres for this approach. The java.sql.Blob support in the postgres jdbc driver uses the postgres LargeObject API. There is a section in the postgres jdbc documentation that explains the pros/cons of the LargeObject API as compared to using the bytea datatype. If you can live with the limitations, then this is a fine solution. thanks, --Barry hhaag@gmx.de wrote: >There might be a workaround: > >From JDBC 2.0 on, the interface java.sql.Blob allows to manipulate BLOBs. >AbstractJdbc2Statement.setBlob() (code posted below) seems to create a BLOB >upfront storing it in the database. The actual INSERT command will then only >contain the OID, avoiding all memory problems. > >I'll let you know how it goes. > > > > public void setBlob(int i, Blob x) throws SQLException > { > InputStream l_inStream = x.getBinaryStream(); > int l_length = (int) x.length(); > LargeObjectManager lom = connection.getLargeObjectAPI(); > int oid = lom.create(); > LargeObject lob = lom.open(oid); > OutputStream los = lob.getOutputStream(); > try > { > // could be buffered, but then the OutputStream returned by LargeObject > // is buffered internally anyhow, so there would be no performance > // boost gained, if anything it would be worse! > int c = l_inStream.read(); > int p = 0; > while (c > -1 && p < l_length) > { > los.write(c); > c = l_inStream.read(); > p++; > } > los.close(); > } > catch (IOException se) > { > throw new PSQLException("postgresql.unusual", se); > } > // lob is closed by the stream so don't call lob.close() > setInt(i, oid); > } > > > > >
hhaag@gmx.de wrote: >I think it's at least better than initializing the stringbuffer with the >default capacity, which is 16. And as long as the stringbuffer is used only >internally (as a local variable) in a private method, no other parts of the code >should be affected. Of course you cannot predict the final size of the >created string. > I agree. I will check in a patch for this later today. >There are also other places where StringBuffer usage could be improved in my >opinion: > >(1) org.postgresql.jdbc1.AbstractJdbc1Statement#setString() > > // Some performance caches > private StringBuffer sbuf = new StringBuffer(); >... > >current: > > public void setString(int parameterIndex, String x) throws SQLException { > .... > synchronized (sbuf) { > sbuf.setLength(0); > > >proposed: > > StringBuffer sbuf = new StringBuffer(x.length()); > >--> use a local, non-synchronized variable. initialize the stringbuffer with >a smart capacity. > > The use of the member variable sbuf is there to solve the same problem. Instead of creating new StringBuffers every time you call setString() in a PreparedStatement it reuses the one object thus leading to less object creation and less memory usage. However as above it probably makes sense to not use the default size, and to call ensureCapacity() to resize it before use. I will check in fixes for this today as well. >(2) org.postgresql.jdbc1.AbstractJdbc1Statement#compileQuery() > > >protected synchronized String compileQuery() >throws SQLException >{ > sbuf.setLength(0); > int i; > > if (isFunction && !returnTypeSet) > throw new PSQLException("postgresql.call.noreturntype"); > if (isFunction) { // set entry 1 to dummy entry.. > inStrings[0] = ""; // dummy entry which ensured that no one overrode > // and calls to setXXX (2,..) really went to first arg in a function >call.. > } > > for (i = 0 ; i < inStrings.length ; ++i) > { > if (inStrings[i] == null) > throw new PSQLException("postgresql.prep.param", new Integer(i + 1)); > sbuf.append (templateStrings[i]).append (inStrings[i]); > } > sbuf.append(templateStrings[inStrings.length]); > return sbuf.toString(); >} > > >also in this case the stringbuffer should be initialized with a smart >capacity. > > > This one is a little tricky. I am open to suggestions on what would be reasonable. thanks, --Barry
hhaag@gmx.de wrote: >>I am not sure how much work this will be, but it is probably the correct >>long term solution. Any volunteers for looking into this and submitting >>a patch? >> >> > >I would be interested in doing/helping this, but as I am not familiar with >the PostgreSQL JDBC source code, I'd have to work together with somebody who >really knows what's going on. > Feel free to contact me with any questions you may have. I am always willing to help others get upto speed with the source. thanks, --Barry