Thread: ps.setCharacterStream() and memory usage
Hi, Coming back to another problem I have with the following insert statement: (Postgres 7.4, driver build 215, jdbc 3) ps.setCharacterStream(1, reader, (int) messageFile.length()); ps.executeUpdate(); The reason I do this (using a character stream to load a text file to a TEXT field) is that I wish to avoid loading the file into memory completely. I already noticed that the postgres driver does not stream the data to the backend (I don't know if postgres actually supports this). But what is worse, is that the file actually gets copied THREE times into memory, causing my Java file to already get a java.lang.OutOfMemory error with an insert of a 10M text file (Java allocates 20M for this, since it has 2 byte chars, and the driver makes (at least) 3 copies (referenced at the same time, not allowing one to be GC'ed), making 60M). First of all, in setCharacterStream() it loads the file into a char array. Then this is cloned into a String and passed to setString. (Here one can already cause the char[] to go out of scope, at least allowing it to be cleaned up). Secondly, setString causes another clone in escapeString(). (Ideally, one could read the stream into a StringBuffer big enough to allow it to be escaped, and thereby only load the file into memory once instead of three times). Finally, the driver (possibly) keeps large objects in memory too long: for example, in setString() right after the bind, one can already do a "sbuf.setLength(0)". This way, one does not have to wait for setString (or another setter) to be called before the sbuf variable is cleared. Greetings, Sebastiaan van Erk
Sebastiaan van Erk wrote: > ps.setCharacterStream(1, reader, (int) messageFile.length()); > ps.executeUpdate(); > I already noticed that the postgres driver does not stream the data to > the backend (I don't know if postgres actually supports this). Character streams are going to be hard to support as we need to know the length, in bytes, of the parameter before we start sending it to the backend. With a character stream the number of bytes is unpredictable because it's being encoded into UTF8, which has a variable number of bytes per character. Binary streams we can stream, and the development driver already does so without intermediate copies. > First of all, in setCharacterStream() it loads the file into a char array. > Then this is cloned into a String and passed to setString. (Here one can > already cause the char[] to go out of scope, at least allowing it to be > cleaned up). Secondly, setString causes another clone in escapeString(). [...] I suggest you look at the current development driver before looking at solutions in this area. The parameter storage mechanisms have changed substantially compared to the stable driver. It should be much easier to, for example, encode the stream directly into a bytearray, then write from that array to the server on demand. -O
On Friday 29 October 2004 07:56, Sebastiaan van Erk wrote: > Coming back to another problem I have with the following insert > statement: (Postgres 7.4, driver build 215, jdbc 3) > > ps.setCharacterStream(1, reader, (int) messageFile.length()); > ps.executeUpdate(); > > The reason I do this (using a character stream to load a text file > to a TEXT field) is that I wish to avoid loading the file into > memory completely. [...] > [...] the file actually gets copied THREE times into memory, causing > my Java file to already get a java.lang.OutOfMemory error with an > insert of a 10M text file (Java allocates 20M for this, since it has > 2 byte chars, and the driver makes (at least) 3 copies (referenced > at the same time, not allowing one to be GC'ed), making 60M). I assume you're talking about org/postgresql/jdbc2/AbstractJdbc2Statement.java which has this piece of code (slightly reformatted to fit on a 80-column page): public void setCharacterStream(int i, Reader x, int length) throws SQLException { checkClosed(); if (length < 0) throw new PSQLException(GT.tr("Invalid stream length {0}.", new Integer(length))); if (connection.haveMinimumCompatibleVersion("7.2")) { char[] l_chars = new char[length]; int l_charsRead = 0; try { while(true) { int n = x.read(l_chars, l_charsRead, length - l_charsRead); if (n == -1) break; l_charsRead += n; if (l_charsRead == length) break; } } catch (IOException l_ioe) { throw new PSQLException(GT.tr("Provided Reader failed."), PSQLState.UNEXPECTED_ERROR, l_ioe); } setString(i, new String(l_chars, 0, l_charsRead)); } else { } } This does appear to be optimized for reading smallish chunks of text. If the character source behind the Reader is sufficiently large, I believe we'd be better off building up a StringBuffer rather than a character array. Something like this: if (connection.haveMinimumCompatibleVersion("7.2")) { StringBuffer sb = new StringBuffer(length); BufferedReader br = new BufferedReader(x); try { while(true) { String chunk = br.readLine(); if (chunk == null) { break; } else {sb.append(chunk); } } } catch (IOException l_ioe) { throw new PSQLException(GT.tr("Provided Reader failed."), PSQLState.UNEXPECTED_ERROR, l_ioe); } setString(i, sb.toString()); } The reason this is better is because (a) the String(char[],int,int) constructor always defensively copies the passed in character array. So, as you point out, if you pass in a 10-million-character array, a new 10-million character array will be allocated. (b) in contrast, if you construct a String from a StringBuffer, they can share the underlying character array: http://www.google.com/search?q=Heinz+Kabutz+StringBuffer+Issue+068&btnI
Vadim Nasardinov wrote: > I assume you're talking about > org/postgresql/jdbc2/AbstractJdbc2Statement.java which has this piece > of code (slightly reformatted to fit on a 80-column page): [snip] > This does appear to be optimized for reading smallish chunks of text. > If the character source behind the Reader is sufficiently large, I > believe we'd be better off building up a StringBuffer rather than a > character array. Something like this: > > if (connection.haveMinimumCompatibleVersion("7.2")) { > StringBuffer sb = new StringBuffer(length); > BufferedReader br = new BufferedReader(x); > try { > while(true) { > String chunk = br.readLine(); > if (chunk == null) { break; } > else {sb.append(chunk); } > } > } catch (IOException l_ioe) { > throw new PSQLException(GT.tr("Provided Reader failed."), > PSQLState.UNEXPECTED_ERROR, l_ioe); > } > setString(i, sb.toString()); > } > > The reason this is better is because > > (a) the String(char[],int,int) constructor always defensively copies > the passed in character array. So, as you point out, if you > pass in a 10-million-character array, a new 10-million character > array will be allocated. > > (b) in contrast, if you construct a String from a StringBuffer, they > can share the underlying character array: > http://www.google.com/search?q=Heinz+Kabutz+StringBuffer+Issue+068&btnI This is indeed part of the code I was talking about. And indeed a StringBuffer is already a definate improvement. However, the setString() method will also cause the data to be duplicated when escapeString() is called. A version of escapeString() which works on a StringBuffer would be another improvement in my opinion, saving (if the allocated StringBuffer is made a bit larger to allow for the escaping), in most cases another allocation of the whole string. Greetings, Sebastiaan van Erk
On Friday 29 October 2004 07:56, Sebastiaan van Erk wrote: > ps.setCharacterStream(1, reader, (int) messageFile.length()); > ps.executeUpdate(); > > The reason I do this (using a character stream to load a text file > to a TEXT field) is that I wish to avoid loading the file into > memory completely. By doing this, you may overallocate. Note that File#length() [1] returns the length of the file in _bytes_. AbstractJdbc2Statement#setCharacterStream uses the passed in number to allocate a _character_ array of that length. If your file is encoded in, say, UTF-16, its byte length is twice its character length. Half of the character array allocated by the setCharacterStream method is wasted in this case. Vadim Footnotes 1. http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#length%28%29
Sebastiaan van Erk wrote: > This is indeed part of the code I was talking about. And indeed a > StringBuffer is already a definate improvement. > > However, the setString() method will also cause the data to be duplicated > when escapeString() is called. A version of escapeString() which works > on a StringBuffer would be another improvement in my opinion, saving > (if the allocated StringBuffer is made a bit larger to allow for the > escaping), in most cases another allocation of the whole string. I would point out again that this is all different in the current development driver. escapeString() no longer exists, and the V3 protocol path does not do any additional string escaping as the string is passed directly as a parameter in a Bind message. There's still much scope for improvement, but I'd strongly suggest you use the development driver as a starting point. -O