Thread: bytea size limit?
Hi. I have Java code that stores binary into the DB. When I use MS SQL Server 2000 (type image) it works fine. But when I use Postgres 7.4 it fails. Is there a size limit setting somewhere I need to set up? Thanks, Michael
Michael Privat wrote: > Hi. I have Java code that stores binary into the DB. When I use MS SQL > Server 2000 (type image) it works fine. But when I use Postgres 7.4 it > fails. Is there a size limit setting somewhere I need to set up? Can you provide the schema and code that is failing, and details of how it fails (with stack traces of any exceptions you see)? Otherwise it's hard to help. -O
No, there is no size limit. However you may have better luck with largeobjects. Dave On Sun, 2004-04-11 at 10:57, Michael Privat wrote: > Hi. I have Java code that stores binary into the DB. When I use MS SQL > Server 2000 (type image) it works fine. But when I use Postgres 7.4 it > fails. Is there a size limit setting somewhere I need to set up? > > Thanks, > Michael > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Dave Cramer 519 939 0336 ICQ # 14675561
Here's the full code sample for a minimalistic reproduction of the error. When you run it, if you increase the size (through the command line argument, you get an out of memory error. On my machine, after the size gets higher than about 1400000 bytes. I can make it happen every time. The DB table has two fields: id: integer data: bytea import java.sql.*; import java.io.*; import java.util.*; public class BlobTest { public static void main(String[] args) { Connection c = null; try { Class.forName("org.postgresql.Driver"); String url = "jdbc:postgresql://myserver/mydb"; c = DriverManager.getConnection(url, "myuser", "mypass"); String sql = "INSERT INTO blobtest (id, data) VALUES(?,?)"; int size = Integer.parseInt(args[0]); byte[] data = new byte[size]; int id = Math.abs(new Random().nextInt()); PreparedStatement stmt = c.prepareStatement(sql); stmt.setInt(1, id); stmt.setBinaryStream(2, new ByteArrayInputStream(data), data.length); stmt.executeUpdate(); stmt.close(); } catch(Throwable t) { t.printStackTrace(); } finally { try { c.close(); } catch(Exception e) {} } } } Sunday, April 11, 2004, 9:42:16 PM, you wrote: DC> No, there is no size limit. However you may have better luck with DC> largeobjects. DC> Dave DC> On Sun, 2004-04-11 at 10:57, Michael Privat wrote: >> Hi. I have Java code that stores binary into the DB. When I use MS SQL >> Server 2000 (type image) it works fine. But when I use Postgres 7.4 it >> fails. Is there a size limit setting somewhere I need to set up? >> >> Thanks, >> Michael >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >>
Michael, that doesn't surprise me, as the postgresql driver currently buffers that internally, so you end up with two buffers of 1400000 bytes, have a look through the archives for out of memory errors. Dave On Sun, 2004-04-11 at 21:48, Michael Privat wrote: > Here's the full code sample for a minimalistic reproduction of the > error. When you run it, if you increase the size (through the command > line argument, you get an out of memory error. On my machine, after > the size gets higher than about 1400000 bytes. I can make it happen every > time. > > The DB table has two fields: > > id: integer > data: bytea > > > import java.sql.*; > import java.io.*; > import java.util.*; > > public class BlobTest { > public static void main(String[] args) { > Connection c = null; > try { > Class.forName("org.postgresql.Driver"); > String url = "jdbc:postgresql://myserver/mydb"; > c = DriverManager.getConnection(url, "myuser", > "mypass"); > > String sql = "INSERT INTO blobtest (id, data) > VALUES(?,?)"; > > int size = Integer.parseInt(args[0]); > > byte[] data = new byte[size]; > > int id = Math.abs(new Random().nextInt()); > > PreparedStatement stmt = c.prepareStatement(sql); > stmt.setInt(1, id); > stmt.setBinaryStream(2, new ByteArrayInputStream(data), data.length); > > stmt.executeUpdate(); > > stmt.close(); > } > catch(Throwable t) { > t.printStackTrace(); > } > finally { > try { c.close(); } catch(Exception e) {} > } > } > } > > > > Sunday, April 11, 2004, 9:42:16 PM, you wrote: > > DC> No, there is no size limit. However you may have better luck with > DC> largeobjects. > DC> Dave > DC> On Sun, 2004-04-11 at 10:57, Michael Privat wrote: > >> Hi. I have Java code that stores binary into the DB. When I use MS SQL > >> Server 2000 (type image) it works fine. But when I use Postgres 7.4 it > >> fails. Is there a size limit setting somewhere I need to set up? > >> > >> Thanks, > >> Michael > >> > >> > >> ---------------------------(end of > >> broadcast)--------------------------- > >> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >> > -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave Cramer wrote: > Michael, > > that doesn't surprise me, as the postgresql driver currently buffers > that internally, so you end up with two buffers of 1400000 bytes, have a > look through the archives for out of memory errors. It's worse that that, Jim.. From memory, we actually end up with a char[] of size 3*array length on average (assuming 50% of the data needs escaping to a '\\nnn' form), i.e. we need about 6 times the array's size in temporary storage (8.4mb in this case). >> byte[] data = new byte[size]; >> >> int id = Math.abs(new Random().nextInt()); >> >> PreparedStatement stmt = c.prepareStatement(sql); >> stmt.setInt(1, id); >> stmt.setBinaryStream(2, new ByteArrayInputStream(data), data.length); setBinaryStream ends up allocating a new byte array and reading into it, then passing the result to setBytes. So you need a total of almost 10mb of temporary storage to insert a 1.4mb bytearray. Yes, this sucks. You'd be better off in this case (where the data is already in a byte array) to call setBytes() directly. Once we have a streaming implementation, though, calling setBinaryStream() will be better, as setBytes() will have to take an immediate copy of the array to avoid seeing later changes before the statement is executed and the data streamed. If you can't throw memory at the problem, using LOs as Dave suggested earlier is probably a better idea. The LO manager uses 4k blocks (IIRC) when moving data to/from the backend so the memory overhead is much lower. -O
Mmmh, well thanks guys. Only problem with changing to LOs is that I already have data in production so changing the DB schema will be a little complicated. I guess I could also contribute to the driver to provide a streaming implementation. Do you know where that char[] is in the code? Sunday, April 11, 2004, 10:44:47 PM, you wrote: OJ> Dave Cramer wrote: >> Michael, >> >> that doesn't surprise me, as the postgresql driver currently buffers >> that internally, so you end up with two buffers of 1400000 bytes, have a >> look through the archives for out of memory errors. OJ> It's worse that that, Jim.. OJ> From memory, we actually end up with a char[] of size 3*array length on OJ> average (assuming 50% of the data needs escaping to a '\\nnn' form), OJ> i.e. we need about 6 times the array's size in temporary storage (8.4mb OJ> in this case). >>> byte[] data = new byte[size]; >>> >>> int id = Math.abs(new Random().nextInt()); >>> >>> PreparedStatement stmt = c.prepareStatement(sql); >>> stmt.setInt(1, id); >>> stmt.setBinaryStream(2, new >>> ByteArrayInputStream(data), data.length); OJ> setBinaryStream ends up allocating a new byte array and reading into it, OJ> then passing the result to setBytes. So you need a total of almost 10mb OJ> of temporary storage to insert a 1.4mb bytearray. Yes, this sucks. OJ> You'd be better off in this case (where the data is already in a byte OJ> array) to call setBytes() directly. Once we have a streaming OJ> implementation, though, calling setBinaryStream() will be better, as OJ> setBytes() will have to take an immediate copy of the array to avoid OJ> seeing later changes before the statement is executed and the data streamed. OJ> If you can't throw memory at the problem, using LOs as Dave suggested OJ> earlier is probably a better idea. The LO manager uses 4k blocks (IIRC) OJ> when moving data to/from the backend so the memory overhead is much lower. OJ> -O
Michael Privat wrote: > Mmmh, well thanks guys. Only problem with changing to LOs is that I > already have data in production so changing the DB schema will be a > little complicated. I guess I could also contribute to the driver to > provide a streaming implementation. Do you know where that char[] is > in the code? The char[] arises from the fact that the driver turns each parameter into a string representation at the time setBinaryStream etc. is called. It's this string representation that is big. Take a look at AbstractJdbc1Statement.setBytes() and the call to PGbytea.toPGString(). To fix this the Right Way involves: - changing setBytes() and setBinaryStream() to store the parameter value for later use, not turn it into a string immediately. - using the V3 extended query protocol (this requires a number of other driver changes, as at a minimum the driver will need to split up queries that contain multiple statements) to allow use of a separate Bind message. - using a binary-format parameter in the Bind message to represent the bytea field, and streaming from the byte[]/InputStream to the socket directly when writing the parameter. This is not a trivial piece of work unfortunately. There may be a way to do a temporary fix that provides streaming without the other work. For V2, it may be possible to stream while still using a string representation of the bytea, as V2 queries are null-terminated and we don't need to know the length in advance. I haven't looked into this in detail. It may be possible to do this for V3 too, as we can in theory predict the length of the stringized parameter (necessary as V3 messages are preceeded by a total message length field): the bytea string format is predictable, and the driver is using a predetermined encoding (unicode/utf8). It'd be ugly, though.. -O
Oliver Jowett <oliver@opencloud.com> writes: > ... using the V3 extended query protocol (this requires a number of > other driver changes, as at a minimum the driver will need to split up > queries that contain multiple statements) This is a bit off the thread topic, but: does the JDBC spec actually allow/expect multiple SQL commands in a query? Some people thought that the V3 restriction to one command per query string was a Good Thing because it helps guard against SQL-injection attacks when an application is careless about inserting untrustworthy text into a command string. If you don't have a spec requirement to cope with this, then I'd advise against adding code to support it. regards, tom lane
Tom Lane wrote: > Oliver Jowett <oliver@opencloud.com> writes: > >>... using the V3 extended query protocol (this requires a number of >>other driver changes, as at a minimum the driver will need to split up >>queries that contain multiple statements) > > > This is a bit off the thread topic, but: does the JDBC spec actually > allow/expect multiple SQL commands in a query? Some people thought that > the V3 restriction to one command per query string was a Good Thing > because it helps guard against SQL-injection attacks when an application > is careless about inserting untrustworthy text into a command string. > If you don't have a spec requirement to cope with this, then I'd advise > against adding code to support it. I don't think there's such a requirement. The JDBC spec is vague, but Sun's JDBC tutorial says: === This code does not end with a DBMS statement terminator, which can vary from DBMS to DBMS. For example, Oracle uses a semicolon (;) to indicate the end of a statement, and Sybase uses the word go . The driver you are using will automatically supply the appropriate statement terminator, and you will not need to include it in your JDBC code. === That said, executing multiple commands in a query seems quite common, and works ok with the current driver. I'd be reluctant to break existing apps by forbidding it. Also, the JDBC spec recommends: === Drivers should provide access to every feature implemented by the underlying data source, including features that extend the JDBC API. When a feature is not supported, the corresponding methods throw an SQLException. The intent is for applications using the JDBC API to have access to the same feature set as native applications. === -O
Okay, I fixed the implementation of the BYTEA support in the driver based on your guidelines and it is very happy now. I want to clean up a few things before I do that, but what is the process for submitting a patch to the dev team? Sunday, April 11, 2004, 11:14:01 PM, you wrote: OJ> Michael Privat wrote: >> Mmmh, well thanks guys. Only problem with changing to LOs is that I >> already have data in production so changing the DB schema will be a >> little complicated. I guess I could also contribute to the driver to >> provide a streaming implementation. Do you know where that char[] is >> in the code? OJ> The char[] arises from the fact that the driver turns each parameter OJ> into a string representation at the time setBinaryStream etc. is called. OJ> It's this string representation that is big. Take a look at OJ> AbstractJdbc1Statement.setBytes() and the call to PGbytea.toPGString(). OJ> To fix this the Right Way involves: OJ> - changing setBytes() and setBinaryStream() to store the parameter OJ> value for later use, not turn it into a string immediately. OJ> - using the V3 extended query protocol (this requires a number of OJ> other driver changes, as at a minimum the driver will need to split up OJ> queries that contain multiple statements) to allow use of a separate OJ> Bind message. OJ> - using a binary-format parameter in the Bind message to represent the OJ> bytea field, and streaming from the byte[]/InputStream to the socket OJ> directly when writing the parameter. OJ> This is not a trivial piece of work unfortunately. OJ> There may be a way to do a temporary fix that provides streaming without OJ> the other work. For V2, it may be possible to stream while still using a OJ> string representation of the bytea, as V2 queries are null-terminated OJ> and we don't need to know the length in advance. I haven't looked into OJ> this in detail. OJ> It may be possible to do this for V3 too, as we can in theory predict OJ> the length of the stringized parameter (necessary as V3 messages are OJ> preceeded by a total message length field): the bytea string format is OJ> predictable, and the driver is using a predetermined encoding OJ> (unicode/utf8). OJ> It'd be ugly, though.. OJ> -O
On Mon, 12 Apr 2004, Michael Privat wrote: > Okay, I fixed the implementation of the BYTEA support in the driver > based on your guidelines and it is very happy now. I want to clean > up a few things before I do that, but what is the process for > submitting a patch to the dev team? Make sure you are working against the cvs version of the driver, available at http://gborg.postgresql.org/project/pgjdbc/projdisplay.php and then produce a context diff ( -c ) and send it to this list. Kris Jurka
Here you go. Coded against the CVS HEAD. Let me know if you want me to change anything. Obviously, because I have to send the size ahead, there is no other way but store the entire byte[] in memory (can't stream), but with this code, I was able to load a blob of 30Mb+ (with a heap size of 32Mb) against 1.5Mb yesterday with the same heap. So there is some improvement. http://jupiter.incodea.com/trans/pgsql-jdbc.diff Michael Monday, April 12, 2004, 7:20:08 PM, you wrote: KJ> On Mon, 12 Apr 2004, Michael Privat wrote: >> Okay, I fixed the implementation of the BYTEA support in the driver >> based on your guidelines and it is very happy now. I want to clean >> up a few things before I do that, but what is the process for >> submitting a patch to the dev team? KJ> Make sure you are working against the cvs version of the driver, available KJ> at http://gborg.postgresql.org/project/pgjdbc/projdisplay.php and then KJ> produce a context diff ( -c ) and send it to this list. KJ> Kris Jurka KJ> ---------------------------(end of KJ> broadcast)--------------------------- KJ> TIP 9: the planner will ignore your desire to choose an index scan if your KJ> joining column's datatypes do not match
Michael Privat wrote: > Here you go. Coded against the CVS HEAD. Let me know if you want me to > change anything. Obviously, because I have to send the size ahead, > there is no other way but store the entire byte[] in memory (can't > stream), but with this code, I was able to load a blob of 30Mb+ (with > a heap size of 32Mb) against 1.5Mb yesterday with the same heap. So > there is some improvement. > > http://jupiter.incodea.com/trans/pgsql-jdbc.diff Comments: Does this work for V2 connections? I didn't see a patch to the V2 execution path, but you've changed how setBytes() binds its argument. The argument to setBytes() is not copied when bound; changes made to the array after the setBytes() call but before query execution will be visible. Personally I don't mind this behaviour but others disagree :) -- see earlier discussion about mutable parameters in the pgsql-jdbc archives. Individually escaping and encoding each byte (in SendBytea and GetByteaSize) seems overly expensive. We should be able to do this in blocks at a minimum. Why is EncodedStream necessary? Can't you just unroll that logic directly in SendBytea()? It's probably possible to work around needing the whole array in memory (i.e. we could stream from an InputStream + length via setBinaryStream). If we escape *every* byte, we need exactly 5 bytes of UTF8 output for every input byte ('\' and '0'-'9' are all single bytes in UTF8, I believe) so we could compute the output size directly from the input length. ==== That said, this patch is a lot simpler than I expected it to be! It looks like a good solution until the driver knows how to speak V3 fluently. If you like, I'll see if I can rework your patch over the next couple of days to address the above points. -O
Frankly, I haven't touched the V2 and you are right, I did not apply the patch to it. It's the first time I take a look at the driver so it took me a little bit of adaptation, bare with me :) With regards to individually escaping, it's just as expensive as how it was before, except that it doesn't make several copies of the array like it used to (hence the memory saving). I don't think there is any performance impact at all. Basically just a memory gain. As far as the encoding. I think in your original email you had mentioned that the driver used UTF-8 (in which case there is an obvious optimization that can be made), but I couldn't find it in the driver. Everything looked like it was inheriting from the encoding scheme set in the connection. The stream is not necessary but I didn't like the idea of having bytea logic inside the stream so I put it into that inner class to make it easier to move if need be (although I couldn't find a good place for it). Plus it makes it easier to help stream directly (once we figure this out). Good point about over-escaping. This would probably resolve this whole problem of pre-determining the size. I didn't think about this (but then again, it assumes UTF-8 in all cases and I'm not familiar enough with how the postgresql driver works to make this call). Feel free to rework the patch as you want it. I just wanted to provide a starting point and I'm happy to help out. Monday, April 12, 2004, 10:08:44 PM, you wrote: OJ> Michael Privat wrote: >> Here you go. Coded against the CVS HEAD. Let me know if you want me to >> change anything. Obviously, because I have to send the size ahead, >> there is no other way but store the entire byte[] in memory (can't >> stream), but with this code, I was able to load a blob of 30Mb+ (with >> a heap size of 32Mb) against 1.5Mb yesterday with the same heap. So >> there is some improvement. >> >> http://jupiter.incodea.com/trans/pgsql-jdbc.diff OJ> Comments: OJ> Does this work for V2 connections? I didn't see a patch to the V2 OJ> execution path, but you've changed how setBytes() binds its argument. OJ> The argument to setBytes() is not copied when bound; changes made to the OJ> array after the setBytes() call but before query execution will be OJ> visible. Personally I don't mind this behaviour but others disagree :) OJ> -- see earlier discussion about mutable parameters in the pgsql-jdbc OJ> archives. OJ> Individually escaping and encoding each byte (in SendBytea and OJ> GetByteaSize) seems overly expensive. We should be able to do this in OJ> blocks at a minimum. OJ> Why is EncodedStream necessary? Can't you just unroll that logic OJ> directly in SendBytea()? OJ> It's probably possible to work around needing the whole array in memory OJ> (i.e. we could stream from an InputStream + length via setBinaryStream). OJ> If we escape *every* byte, we need exactly 5 bytes of UTF8 output for OJ> every input byte ('\' and '0'-'9' are all single bytes in UTF8, I OJ> believe) so we could compute the output size directly from the input length. OJ> ==== OJ> That said, this patch is a lot simpler than I expected it to be! It OJ> looks like a good solution until the driver knows how to speak V3 OJ> fluently. If you like, I'll see if I can rework your patch over the next OJ> couple of days to address the above points. OJ> -O OJ> ---------------------------(end of OJ> broadcast)--------------------------- OJ> TIP 6: Have you searched our list archives? OJ> http://archives.postgresql.org
On Mon, 12 Apr 2004, Michael Privat wrote: > As far as the encoding. I think in your original email you had > mentioned that the driver used UTF-8 (in which case there is an > obvious optimization that can be made), but I couldn't find it in the > driver. Everything looked like it was inheriting from the encoding > scheme set in the connection. > The driver's internal encoding (the client_encoding GUC) is unicode for all 7.3+ databases, but not necessarily for 7.2 and before. The 7.3 release was the first that came with multibyte support compiled by default and before that time the charSet URL parameter was relied upon to determine the correct database encoding. Kris Jurka
Michael Privat wrote: > With regards to individually escaping, it's just as expensive as how > it was before, except that it doesn't make several copies of the array > like it used to (hence the memory saving). I don't think there is any > performance impact at all. Basically just a memory gain. Doesn't it end up creating an individual String (and backing char[]) for each source byte? While these objects are very short-lived, they are extra garbage and will cause more frequent GCs. > As far as the encoding. I think in your original email you had > mentioned that the driver used UTF-8 (in which case there is an > obvious optimization that can be made), but I couldn't find it in the > driver. Everything looked like it was inheriting from the encoding > scheme set in the connection. Everything does indeed use the encoding object on the connection. The trick is that if you look at the V3 connection setup code, the encoding is always set to UNICODE which maps to Java's UTF8. V2 connections can have a variety of encodings. The driver sets the encoding to UNICODE for server versions >= 7.3, but uses the database's encoding for earlier versions. But we don't need to know the encoded length in advance when talking V2 anyway.. -O
I don't think it will create more frequent GCs (the VM GC works by allocated size not object count). Either way, it's a mute point if we assume UTF-8. In this case, it is possible to forecast the size of that byte array in advance and we don't need to mess around with encoding just for sizing. Also the sendByteA() method can be optimized to encode A LOT faster and more efficiently, directly at the byte level instead of using the Encoder, pushing the data directly into the stream. But anyway, I didn't mean to start a whole debate on this. Just thought I'd share the code. The driver does handle a lot bigger binary objects than before and we all agree that, with the UTF-8 assumption, there are good optimizations that can be made that had not been noticed before. Monday, April 12, 2004, 10:49:18 PM, you wrote: OJ> Michael Privat wrote: >> With regards to individually escaping, it's just as expensive as how >> it was before, except that it doesn't make several copies of the array >> like it used to (hence the memory saving). I don't think there is any >> performance impact at all. Basically just a memory gain. OJ> Doesn't it end up creating an individual String (and backing char[]) for OJ> each source byte? While these objects are very short-lived, they are OJ> extra garbage and will cause more frequent GCs. >> As far as the encoding. I think in your original email you had >> mentioned that the driver used UTF-8 (in which case there is an >> obvious optimization that can be made), but I couldn't find it in the >> driver. Everything looked like it was inheriting from the encoding >> scheme set in the connection. OJ> Everything does indeed use the encoding object on the connection. The OJ> trick is that if you look at the V3 connection setup code, the encoding OJ> is always set to UNICODE which maps to Java's UTF8. OJ> V2 connections can have a variety of encodings. The driver sets the OJ> encoding to UNICODE for server versions >= 7.3, but uses the database's OJ> encoding for earlier versions. But we don't need to know the encoded OJ> length in advance when talking V2 anyway.. OJ> -O OJ> ---------------------------(end of OJ> broadcast)--------------------------- OJ> TIP 2: you can get off all lists at once with the unregister command OJ> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)