Thread: ps.setCharacterStream() and memory usage

ps.setCharacterStream() and memory usage

From

Sebastiaan van Erk

Date:

29 October 2004, 20:56:30

Hi,

Coming back to another problem I have with the following insert statement:
(Postgres 7.4, driver build 215, jdbc 3)

   ps.setCharacterStream(1, reader, (int) messageFile.length());
   ps.executeUpdate();

The reason I do this (using a character stream to load a text file to a
TEXT field) is that I wish to avoid loading the file into memory completely.

I already noticed that the postgres driver does not stream the data to
the backend (I don't know if postgres actually supports this).

But what is worse, is that the file actually gets copied THREE times
into memory, causing my Java file to already get a java.lang.OutOfMemory
error with an insert of a 10M text file (Java allocates 20M for this,
since it has 2 byte chars, and the driver makes (at least) 3 copies
(referenced at the same time, not allowing one to be GC'ed), making 60M).

First of all, in setCharacterStream() it loads the file into a char array.
Then this is cloned into a String and passed to setString. (Here one can
already cause the char[] to go out of scope, at least allowing it to be
cleaned up). Secondly, setString causes another clone in escapeString().
(Ideally, one could read the stream into a StringBuffer big enough to allow
it to be escaped, and thereby only load the file into memory once instead
of three times). Finally, the driver (possibly) keeps large objects in
memory too long: for example, in setString() right after the bind, one can
already do a "sbuf.setLength(0)". This way, one does not have to wait
for setString (or another setter) to be called before the sbuf variable
is cleared.

Greetings,
Sebastiaan van Erk

Re: ps.setCharacterStream() and memory usage

From

Oliver Jowett

Date:

29 October 2004, 22:29:19

Sebastiaan van Erk wrote:

>    ps.setCharacterStream(1, reader, (int) messageFile.length());
>    ps.executeUpdate();

> I already noticed that the postgres driver does not stream the data to
> the backend (I don't know if postgres actually supports this).

Character streams are going to be hard to support as we need to know the
length, in bytes, of the parameter before we start sending it to the
backend. With a character stream the number of bytes is unpredictable
because it's being encoded into UTF8, which has a variable number of
bytes per character.

Binary streams we can stream, and the development driver already does so
without intermediate copies.

> First of all, in setCharacterStream() it loads the file into a char array.
> Then this is cloned into a String and passed to setString. (Here one can
> already cause the char[] to go out of scope, at least allowing it to be
> cleaned up). Secondly, setString causes another clone in escapeString(). [...]

I suggest you look at the current development driver before looking at
solutions in this area. The parameter storage mechanisms have changed
substantially compared to the stable driver. It should be much easier
to, for example, encode the stream directly into a bytearray, then write
from that array to the server on demand.

-O

Re: ps.setCharacterStream() and memory usage

From

Vadim Nasardinov

Date:

02 November 2004, 14:40:15

On Friday 29 October 2004 07:56, Sebastiaan van Erk wrote:
> Coming back to another problem I have with the following insert
> statement: (Postgres 7.4, driver build 215, jdbc 3)
>
>    ps.setCharacterStream(1, reader, (int) messageFile.length());
>    ps.executeUpdate();
>
> The reason I do this (using a character stream to load a text file
> to a TEXT field) is that I wish to avoid loading the file into
> memory completely.
[...]
> [...] the file actually gets copied THREE times into memory, causing
> my Java file to already get a java.lang.OutOfMemory error with an
> insert of a 10M text file (Java allocates 20M for this, since it has
> 2 byte chars, and the driver makes (at least) 3 copies (referenced
> at the same time, not allowing one to be GC'ed), making 60M).

I assume you're talking about
org/postgresql/jdbc2/AbstractJdbc2Statement.java which has this piece
of code (slightly reformatted to fit on a 80-column page):


    public void setCharacterStream(int i, Reader x, int length)
        throws SQLException {

        checkClosed();
        if (length < 0)
            throw new PSQLException(GT.tr("Invalid stream length {0}.",
                                          new Integer(length)));

        if (connection.haveMinimumCompatibleVersion("7.2")) {
            char[] l_chars = new char[length];
            int l_charsRead = 0;
            try {
                while(true) {
                    int n = x.read(l_chars,
                                   l_charsRead,
                                   length - l_charsRead);
                    if (n == -1)
                        break;
                    l_charsRead += n;
                    if (l_charsRead == length)
                        break;
                }
            } catch (IOException l_ioe) {
                throw new PSQLException(GT.tr("Provided Reader failed."),
                                        PSQLState.UNEXPECTED_ERROR, l_ioe);
            }
            setString(i, new String(l_chars, 0, l_charsRead));
        } else {

        }
    }


This does appear to be optimized for reading smallish chunks of text.
If the character source behind the Reader is sufficiently large, I
believe we'd be better off building up a StringBuffer rather than a
character array.  Something like this:

    if (connection.haveMinimumCompatibleVersion("7.2")) {
        StringBuffer sb = new StringBuffer(length);
        BufferedReader br = new BufferedReader(x);
        try {
            while(true) {
                String chunk = br.readLine();
                if (chunk == null) { break; }
                else {sb.append(chunk); }
            }
        } catch (IOException l_ioe) {
            throw new PSQLException(GT.tr("Provided Reader failed."),
                                    PSQLState.UNEXPECTED_ERROR, l_ioe);
        }
        setString(i, sb.toString());
    }

The reason this is better is because

  (a) the String(char[],int,int) constructor always defensively copies
      the passed in character array.  So, as you point out, if you
      pass in a 10-million-character array, a new 10-million character
      array will be allocated.

  (b) in contrast, if you construct a String from a StringBuffer, they
      can share the underlying character array:
      http://www.google.com/search?q=Heinz+Kabutz+StringBuffer+Issue+068&btnI

Re: ps.setCharacterStream() and memory usage

From

Sebastiaan van Erk

Date:

02 November 2004, 16:19:54

Vadim Nasardinov wrote:

> I assume you're talking about
> org/postgresql/jdbc2/AbstractJdbc2Statement.java which has this piece
> of code (slightly reformatted to fit on a 80-column page):

[snip]

> This does appear to be optimized for reading smallish chunks of text.
> If the character source behind the Reader is sufficiently large, I
> believe we'd be better off building up a StringBuffer rather than a
> character array.  Something like this:
>
>     if (connection.haveMinimumCompatibleVersion("7.2")) {
>         StringBuffer sb = new StringBuffer(length);
>         BufferedReader br = new BufferedReader(x);
>         try {
>             while(true) {
>                 String chunk = br.readLine();
>                 if (chunk == null) { break; }
>                 else {sb.append(chunk); }
>             }
>         } catch (IOException l_ioe) {
>             throw new PSQLException(GT.tr("Provided Reader failed."),
>                                     PSQLState.UNEXPECTED_ERROR, l_ioe);
>         }
>         setString(i, sb.toString());
>     }
>
> The reason this is better is because
>
>   (a) the String(char[],int,int) constructor always defensively copies
>       the passed in character array.  So, as you point out, if you
>       pass in a 10-million-character array, a new 10-million character
>       array will be allocated.
>
>   (b) in contrast, if you construct a String from a StringBuffer, they
>       can share the underlying character array:
>       http://www.google.com/search?q=Heinz+Kabutz+StringBuffer+Issue+068&btnI

This is indeed part of the code I was talking about. And indeed a
StringBuffer is already a definate improvement.

However, the setString() method will also cause the data to be duplicated
when escapeString() is called. A version of escapeString() which works
on a StringBuffer would be another improvement in my opinion, saving
(if the allocated StringBuffer is made a bit larger to allow for the
escaping), in most cases another allocation of the whole string.

Greetings,
Sebastiaan van Erk

Re: ps.setCharacterStream() and memory usage

From

Vadim Nasardinov

Date:

02 November 2004, 16:49:26

On Friday 29 October 2004 07:56, Sebastiaan van Erk wrote:
>    ps.setCharacterStream(1, reader, (int) messageFile.length());
>    ps.executeUpdate();
>
> The reason I do this (using a character stream to load a text file
> to a TEXT field) is that I wish to avoid loading the file into
> memory completely.

By doing this, you may overallocate.  Note that File#length() [1]
returns the length of the file in _bytes_.
AbstractJdbc2Statement#setCharacterStream uses the passed in number to
allocate a _character_ array of that length.  If your file is encoded
in, say, UTF-16, its byte length is twice its character length.  Half
of the character array allocated by the setCharacterStream method is
wasted in this case.

Vadim

Footnotes

  1. http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#length%28%29

Re: ps.setCharacterStream() and memory usage

From

Oliver Jowett

Date:

02 November 2004, 19:44:20

Sebastiaan van Erk wrote:

> This is indeed part of the code I was talking about. And indeed a
> StringBuffer is already a definate improvement.
>
> However, the setString() method will also cause the data to be duplicated
> when escapeString() is called. A version of escapeString() which works
> on a StringBuffer would be another improvement in my opinion, saving
> (if the allocated StringBuffer is made a bit larger to allow for the
> escaping), in most cases another allocation of the whole string.

I would point out again that this is all different in the current
development driver. escapeString() no longer exists, and the V3 protocol
path does not do any additional string escaping as the string is passed
directly as a parameter in a Bind message.

There's still much scope for improvement, but I'd strongly suggest you
use the development driver as a starting point.

-O