Thread: Inserting large BLOBs via JDBC - OutOfMemoryError

Inserting large BLOBs via JDBC - OutOfMemoryError

From
hhaag@gmx.de
Date:
(please apologize if I'm double-posting, but I assume my first mail did not
work because I was using the wrong mail account)

Hi all,

I am facing a problem when writing blobs via jdbc.

Small BLOBs work, but with BLOBs of a certain size my code throws a
java.lang.OutOfMemoryError.
The file I tried has about 2-3 MB.


Environment:
- PostgreSQL 7.2 on i686-pc-linux-gnu, compiled by GCC 2.96
- Mandrake 8.2
- 192 MB RAM
- blob column created as datatype "bytea" (maybe that is the problem)
- code posted below (BlobPostgresTest.java)
- console output posted below
- java command to run program:
    java -verbose:gc -Xmx128m -classpath .:pgjdbc2.jar BlobPostgresTest
- java version as follows from "java -version" command:
    java version "1.4.0_01"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
    Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)


To me it seems that the implementation of setBinaryStream() could be
improved.
Details follow below.

let me know what you think.
holger



Console output (including GC activities):
=========================================

java -verbose:gc -Xmx128m -classpath .:pgjdbc2.jar BlobPostgresTest

BLOB/PostgreSQL Demo started
[GC 511K->230K(1984K), 0.0163708 secs]
preparing statement
setting binary stream
[GC 402K->273K(1984K), 0.0092325 secs]
[Full GC 273K->273K(1984K), 0.0771294 secs]
[GC 4351K->4058K(4908K), 0.0046329 secs]
[Full GC 4058K->3770K(4908K), 0.0831070 secs]
[GC 4922K->4922K(6860K), 0.0009556 secs]
[Full GC 4922K->4346K(6860K), 0.0890570 secs]
[GC 6650K->6650K(7820K), 0.0011110 secs]
[Full GC 6650K->5496K(7820K), 0.1420860 secs]
[GC 10104K->10104K(14480K), 0.0012827 secs]
[Full GC 10104K->7800K(14480K), 0.1236744 secs]
[GC 17016K->17016K(24084K), 0.0015421 secs]
[Full GC 17016K->12408K(24084K), 0.1688843 secs]
[GC 30840K->30840K(43224K), 0.0020800 secs]
[Full GC 30840K->21624K(43224K), 0.2547274 secs]
[GC 76920K->74616K(81500K), 0.0041685 secs]
[Full GC 74616K->49272K(81500K), 0.5688448 secs]
[GC 67704K->67704K(88332K), 0.0033407 secs]
[Full GC 67704K->58488K(88332K), 0.2558231 secs]
executing update
[GC 95352K->95352K(104844K), 0.0932741 secs]
[Full GC 95352K->40056K(104844K), 0.9644251 secs]
[GC 69245K->69245K(104844K), 0.0036631 secs]
[Full GC 69245K->69245K(104844K), 0.0814962 secs]
[Full GC 69245K->66324K(129728K), 1.1439123 secs]
Exception in thread "main" java.lang.OutOfMemoryError


as you can see the program stops when setBinaryStream() is executed.
the memory allocated is exceeding 128 MB (the whole thing fails with a max.
heap of 192MB as well).

-> my assumption is that setBinaryStream() is allocating a huge amount of
memory

looking into the PostreSQL source code, I find

org.postgresql.jdbc1.AbstractJdbc1Statement#setBinaryStream
org.postgresql.util.PGbytea#toPGString

mainly toPGString seems to be programmed inefficiently in terms of memory
consumption, e.g.

the string buffer allocation

    public static String toPGString(byte[] p_buf) throws SQLException
    {
        if (p_buf == null)
            return null;
        StringBuffer l_strbuf = new StringBuffer();


should contain a senseful inital size


    public static String toPGString(byte[] p_buf) throws SQLException
    {
        if (p_buf == null)
            return null;
        StringBuffer l_strbuf = new StringBuffer(p_buf.length);

or even

        StringBuffer l_strbuf = new StringBuffer(p_buf.length*2);

because of special characters


to avoid a repeated re-allocation of the internal char[] array in the
stringbuffer.



BlobPostgresTest.java
=====================

import java.sql.*;
import java.io.*;

/*

drop table blobdemo;

drop sequence blobdemo_id_seq;

create table blobdemo(
  id serial not null primary key,
  name varchar(50),
  content bytea);

*/

public final class BlobPostgresTest {
    private final static String NAME = "TEST";
    //private final static String FILE_NAME = "/tmp/blob/2mb.xxx";
    private final static String FILE_NAME = "BlobPostgresTest.java";

    public final static void main(String[] args) throws Throwable {
    Connection con = null;
    Statement statement = null;
    PreparedStatement insertStatement = null;
    ResultSet rs = null;
    File file = null;
    FileInputStream fis = null;
    BufferedInputStream bis = null;

    try {
        System.out.println("BLOB/PostgreSQL Demo started");

        Class.forName("org.postgresql.Driver");

        con = DriverManager.getConnection
        ("jdbc:postgresql://localhost/template1",
         "postgres", "");
        con.setAutoCommit(true);

        statement = con.createStatement();

        // cleanup
        statement.executeUpdate("delete from blobdemo");

        // file
        file = new File(FILE_NAME);
        fis = new FileInputStream(file);
        bis = new BufferedInputStream(fis);

        // insert one record
        System.out.println("preparing statement");
        insertStatement = con.prepareStatement
        ("insert into blobdemo ( name, content ) values ( ?, ? )");

        insertStatement.setString(1, NAME);

        System.out.println("setting binary stream");
        insertStatement.setBinaryStream(2, bis, (int)file.length());

        System.out.println("executing update");
        insertStatement.executeUpdate();

        // retrieve
        rs = statement.executeQuery
        ("select id, name, content from blobdemo");

        while(rs.next()) {
        System.out.println("id=" + rs.getObject(1));
        System.out.println("name=" + rs.getObject(2));

        byte[] bytes = rs.getBytes(3);
        String content = new String(bytes);

        //System.out.println(content);
        System.out.println("retrieved " + bytes.length +  " bytes");
        }
    } finally {

        if(rs != null) rs.close();
        if(statement != null) statement.close();
        if(insertStatement != null) insertStatement.close();
        if(con != null) con.close();

        if(fis != null) fis.close();
        if(bis != null) bis.close();
    }

    System.out.println("BLOB/PostgreSQL Demo complete");
    }
}

--
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
Barry Lind
Date:
Holger,

There certainly are areas where the driver can be improved.  And this is
probably one of them.  But like many things the solution isn't as easy
as it first appears.

hhaag@gmx.de wrote:

>mainly toPGString seems to be programmed inefficiently in terms of memory
>consumption, e.g.
>
>the string buffer allocation
>
>    public static String toPGString(byte[] p_buf) throws SQLException
>    {
>        if (p_buf == null)
>            return null;
>        StringBuffer l_strbuf = new StringBuffer();
>
>
>should contain a senseful inital size
>
>
>    public static String toPGString(byte[] p_buf) throws SQLException
>    {
>        if (p_buf == null)
>            return null;
>        StringBuffer l_strbuf = new StringBuffer(p_buf.length);
>
>or even
>
>        StringBuffer l_strbuf = new StringBuffer(p_buf.length*2);
>
>because of special characters
>
>
>to avoid a repeated re-allocation of the internal char[] array in the
>stringbuffer.
>
>
>
>
While "new StringBuffer(p_buf.length)" is probably an improvement, it is
difficult to predict what size buffer you will really need.  This is
because depending on the data you will see between zero and four times
data expansion.  Because the  protocol postgres uses to talk between the
client and server is string based, the binary data needs to be encoded
in an ascii safe way.  The encoding for the bytea datatype is to use
\OOO octal escaping.  Therefore each byte of data may take up to four
bytes in the output.  However if the data is mostly printable 7bit ascii
bytes then there will be little expansion.

I think your idea of initializing the buffer to be the size of the
byte[] is a good idea.  I will apply that change unless someone has a
better suggestion.

thanks,
--Barry

PS.  It should also be possible to not prebuild the entire sql statement
in memory (which includes the entire escaped binary value), but to
stream the value back to the server.  Thus instead of building the
following String in memory:
insert into table foo (bar, baz, boo) values (1000, '\123\432\543\567...
continuing on for possibly megabytes...', 'boo value')
The driver should be able to just send back the pieces and stream the
bytea value from the inputstream:
send -> "insert int table foo (bar, baz, boo) values("
send -> 1000            (value for column bar)
send -> ", "
send -> (value for column baz by reading from the input stream here,
encoding the results byte by byte and sending them)
send -> ", "
send -> "'boo value'"  (value for column boo)
send -> ")"
I am not sure how much work this will be, but it is probably the correct
long term solution.  Any volunteers for looking into this and submitting
a patch?



Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
hhaag@gmx.de
Date:
>While "new StringBuffer(p_buf.length)" is probably an improvement, it is
>difficult to predict what size buffer you will really need.  This is
>because depending on the data you will see between zero and four times
>data expansion.  Because the  protocol postgres uses to talk between the
>client and server is string based, the binary data needs to be encoded
>in an ascii safe way.  The encoding for the bytea datatype is to use
>\OOO octal escaping.  Therefore each byte of data may take up to four
>bytes in the output.  However if the data is mostly printable 7bit ascii
>bytes then there will be little expansion.
>
>I think your idea of initializing the buffer to be the size of the
>byte[] is a good idea.  I will apply that change unless someone has a
>better suggestion.

I think it's at least better than initializing the stringbuffer with the
default capacity, which is 16. And as long as the stringbuffer is used only
internally (as a local variable) in a private method, no other parts of the code
should be affected. Of course you cannot predict the final size of the
created string.

There are also other places where StringBuffer usage could be improved in my
opinion:

(1) org.postgresql.jdbc1.AbstractJdbc1Statement#setString()

    // Some performance caches
    private StringBuffer sbuf = new StringBuffer();
...

current:

    public void setString(int parameterIndex, String x) throws SQLException {
                           ....
          synchronized (sbuf) {
                              sbuf.setLength(0);


proposed:

          StringBuffer sbuf = new StringBuffer(x.length());

--> use a local, non-synchronized variable. initialize the stringbuffer with
a smart capacity.

please note that I have not fully explored the usage of synchronized and the
re-usage of the stringbuffer. but as the synchronized keyword indicates,
this variable will only be accessed by one thread at a time. additionally the
actual contents of the stringbuffer are always disposed at the beginning of a
method. so a local variable should be fine - and faster than a synchronized
instance variable


(2) org.postgresql.jdbc1.AbstractJdbc1Statement#compileQuery()


protected synchronized String compileQuery()
throws SQLException
{
    sbuf.setLength(0);
    int i;

    if (isFunction && !returnTypeSet)
        throw new PSQLException("postgresql.call.noreturntype");
    if (isFunction) { // set entry 1 to dummy entry..
        inStrings[0] = ""; // dummy entry which ensured that no one overrode
        // and calls to setXXX (2,..) really went to first arg in a function
call..
    }

    for (i = 0 ; i < inStrings.length ; ++i)
    {
        if (inStrings[i] == null)
            throw new PSQLException("postgresql.prep.param", new Integer(i + 1));
        sbuf.append (templateStrings[i]).append (inStrings[i]);
    }
    sbuf.append(templateStrings[inStrings.length]);
    return sbuf.toString();
}


also in this case the stringbuffer should be initialized with a smart
capacity.

something like the sum of all string lengths to be appended. I'm a bit in a
rush today, but I'll try to find an algorithm in the next few days

--


GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
hhaag@gmx.de
Date:
>PS.  It should also be possible to not prebuild the entire sql statement
>in memory (which includes the entire escaped binary value), but to
>stream the value back to the server.  Thus instead of building the
>following String in memory:
>insert into table foo (bar, baz, boo) values (1000, '\123\432\543\567...
>continuing on for possibly megabytes...', 'boo value')
>The driver should be able to just send back the pieces and stream the
>bytea value from the inputstream:
>send -> "insert int table foo (bar, baz, boo) values("
>send -> 1000            (value for column bar)
>send -> ", "
>send -> (value for column baz by reading from the input stream here,
>encoding the results byte by byte and sending them)
>send -> ", "
>send -> "'boo value'"  (value for column boo)
>send -> ")"

of course the prebuilding of the statement is the root of all troubles. with
all the escaping which has to be done, the arrays will always build up in
memory. The organization of the source is pretty good in terms of oop, I think,
because the responsibilites are pretty simple to understand. however, this
approach is only workable for regular types (string, int etc.) or small blobs.


The "perfect" way would be to stream the BLOB (which is passed as a stream
to setBinaryStream) directly to the db server. escaping could be done by
nesting streams. in this case, no arrays would be needed at all. While being
probably the best way, this would imply some sort of big overhaul of the code.

>I am not sure how much work this will be, but it is probably the correct
>long term solution.  Any volunteers for looking into this and submitting
>a patch?

I would be interested in doing/helping this, but as I am not familiar with
the PostgreSQL JDBC source code, I'd have to work together with somebody who
really knows what's going on.


--
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
hhaag@gmx.de
Date:
There might be a workaround:

From JDBC 2.0 on, the interface java.sql.Blob allows to manipulate BLOBs.
AbstractJdbc2Statement.setBlob() (code posted below) seems to create a BLOB
upfront storing it in the database. The actual INSERT command will then only
contain the OID, avoiding all memory problems.

I'll let you know how it goes.



    public void setBlob(int i, Blob x) throws SQLException
    {
        InputStream l_inStream = x.getBinaryStream();
        int l_length = (int) x.length();
        LargeObjectManager lom = connection.getLargeObjectAPI();
        int oid = lom.create();
        LargeObject lob = lom.open(oid);
        OutputStream los = lob.getOutputStream();
        try
        {
            // could be buffered, but then the OutputStream returned by LargeObject
            // is buffered internally anyhow, so there would be no performance
            // boost gained, if anything it would be worse!
            int c = l_inStream.read();
            int p = 0;
            while (c > -1 && p < l_length)
            {
                los.write(c);
                c = l_inStream.read();
                p++;
            }
            los.close();
        }
        catch (IOException se)
        {
            throw new PSQLException("postgresql.unusual", se);
        }
        // lob is closed by the stream so don't call lob.close()
        setInt(i, oid);
    }



--
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
Barry Lind
Date:
Make sure you have read and understand the limitations in postgres for
this approach.  The java.sql.Blob support in the postgres jdbc driver
uses the postgres LargeObject API.  There is a section in the postgres
jdbc documentation that explains the pros/cons of the LargeObject API as
compared to using the bytea datatype.  If you can live with the
limitations, then this is a fine solution.

thanks,
--Barry

hhaag@gmx.de wrote:

>There might be a workaround:
>
>From JDBC 2.0 on, the interface java.sql.Blob allows to manipulate BLOBs.
>AbstractJdbc2Statement.setBlob() (code posted below) seems to create a BLOB
>upfront storing it in the database. The actual INSERT command will then only
>contain the OID, avoiding all memory problems.
>
>I'll let you know how it goes.
>
>
>
>    public void setBlob(int i, Blob x) throws SQLException
>    {
>        InputStream l_inStream = x.getBinaryStream();
>        int l_length = (int) x.length();
>        LargeObjectManager lom = connection.getLargeObjectAPI();
>        int oid = lom.create();
>        LargeObject lob = lom.open(oid);
>        OutputStream los = lob.getOutputStream();
>        try
>        {
>            // could be buffered, but then the OutputStream returned by LargeObject
>            // is buffered internally anyhow, so there would be no performance
>            // boost gained, if anything it would be worse!
>            int c = l_inStream.read();
>            int p = 0;
>            while (c > -1 && p < l_length)
>            {
>                los.write(c);
>                c = l_inStream.read();
>                p++;
>            }
>            los.close();
>        }
>        catch (IOException se)
>        {
>            throw new PSQLException("postgresql.unusual", se);
>        }
>        // lob is closed by the stream so don't call lob.close()
>        setInt(i, oid);
>    }
>
>
>
>
>



Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
Barry Lind
Date:

hhaag@gmx.de wrote:

>I think it's at least better than initializing the stringbuffer with the
>default capacity, which is 16. And as long as the stringbuffer is used only
>internally (as a local variable) in a private method, no other parts of the code
>should be affected. Of course you cannot predict the final size of the
>created string.
>
I agree.  I will check in a patch for this later today.

>There are also other places where StringBuffer usage could be improved in my
>opinion:
>
>(1) org.postgresql.jdbc1.AbstractJdbc1Statement#setString()
>
>    // Some performance caches
>    private StringBuffer sbuf = new StringBuffer();
>...
>
>current:
>
>    public void setString(int parameterIndex, String x) throws SQLException {
>                           ....
>          synchronized (sbuf) {
>                              sbuf.setLength(0);
>
>
>proposed:
>
>          StringBuffer sbuf = new StringBuffer(x.length());
>
>--> use a local, non-synchronized variable. initialize the stringbuffer with
>a smart capacity.
>
>
The use of the member variable sbuf is there to solve the same problem.
 Instead of creating new StringBuffers every time you call setString()
in a PreparedStatement it reuses the one object thus leading to less
object creation and less memory usage.  However as above it probably
makes sense to not use the default size, and to call ensureCapacity() to
resize it before use.  I will check in fixes for this today as well.

>(2) org.postgresql.jdbc1.AbstractJdbc1Statement#compileQuery()
>
>
>protected synchronized String compileQuery()
>throws SQLException
>{
>    sbuf.setLength(0);
>    int i;
>
>    if (isFunction && !returnTypeSet)
>        throw new PSQLException("postgresql.call.noreturntype");
>    if (isFunction) { // set entry 1 to dummy entry..
>        inStrings[0] = ""; // dummy entry which ensured that no one overrode
>        // and calls to setXXX (2,..) really went to first arg in a function
>call..
>    }
>
>    for (i = 0 ; i < inStrings.length ; ++i)
>    {
>        if (inStrings[i] == null)
>            throw new PSQLException("postgresql.prep.param", new Integer(i + 1));
>        sbuf.append (templateStrings[i]).append (inStrings[i]);
>    }
>    sbuf.append(templateStrings[inStrings.length]);
>    return sbuf.toString();
>}
>
>
>also in this case the stringbuffer should be initialized with a smart
>capacity.
>
>
>
This one is a little tricky.  I am open to suggestions on what would be
reasonable.

thanks,
--Barry



Re: Inserting large BLOBs via JDBC - OutOfMemoryError

From
Barry Lind
Date:

hhaag@gmx.de wrote:

>>I am not sure how much work this will be, but it is probably the correct
>>long term solution.  Any volunteers for looking into this and submitting
>>a patch?
>>
>>
>
>I would be interested in doing/helping this, but as I am not familiar with
>the PostgreSQL JDBC source code, I'd have to work together with somebody who
>really knows what's going on.
>

Feel free to contact me with any questions you may have.  I am always
willing to help others get upto speed with the source.

thanks,
--Barry