Thread: JDBC driver's (non-)handling of InputStream:s
Hello, Tonight I rewrote a part of an application that deals with http uploads, because it turned out it has to handle larger files than originally intended - and one was getting OutOfMemory errors. So I rewrote evcerything so that an InputStream is passed to the JDBC driver and the files are never completely loaded into memory. However I am still getting an OutOfMemory error for large files. While it is difficult to pinpoint exactly where due to the lack of a stack trace, it does look like the driver is causing it. Does the JDBC driver handle InputStream:s intelligently at all? If so, does it do so under all circumstances? In this case I am putting data into a column of type 'bytea' and am using PreparedStatement.setBinaryStream(). The backend is PostgreSQL 7.4.1, and I am using the driver for 7.4.1 (pg74.1jdbc3.jar). Running under JDK 1.4.2. Do I need to use some other type in the database in order for input streams to be handled properly? Do I have to use some PostgreSQL specific API? Does the JDBC driver need to be changed to support this? I can always fall back to using files on the filesystem, but then I will loose all the niceties that come with ACID transactions which I automatically get if I keep it all in the database. Thanks! -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Peter, It would be great if you could supply a test case which exhibits this behaviour. Dave On Mon, 2004-03-29 at 20:10, Peter Schuller wrote: > Hello, > > Tonight I rewrote a part of an application that deals with http uploads, > because it turned out it has to handle larger files than originally intended > - and one was getting OutOfMemory errors. > > So I rewrote evcerything so that an InputStream is passed to the JDBC driver > and the files are never completely loaded into memory. However I am still > getting an OutOfMemory error for large files. While it is difficult to > pinpoint exactly where due to the lack of a stack trace, it does look like > the driver is causing it. > > Does the JDBC driver handle InputStream:s intelligently at all? If so, does it > do so under all circumstances? In this case I am putting data into a column > of type 'bytea' and am using PreparedStatement.setBinaryStream(). > > The backend is PostgreSQL 7.4.1, and I am using the driver for 7.4.1 > (pg74.1jdbc3.jar). Running under JDK 1.4.2. > > Do I need to use some other type in the database in order for input streams to > be handled properly? Do I have to use some PostgreSQL specific API? Does the > JDBC driver need to be changed to support this? > > I can always fall back to using files on the filesystem, but then I will loose > all the niceties that come with ACID transactions which I automatically get > if I keep it all in the database. > > Thanks! -- Dave Cramer 519 939 0336 ICQ # 14675561
Peter Schuller wrote: > Does the JDBC driver handle InputStream:s intelligently at all? If so, does it > do so under all circumstances? In this case I am putting data into a column > of type 'bytea' and am using PreparedStatement.setBinaryStream(). The short answer is no, it's not smart about InputStreams. It treats the stream essentially the same as if you'd read the entire stream into a byte array then called setBytes(). > The backend is PostgreSQL 7.4.1, and I am using the driver for 7.4.1 > (pg74.1jdbc3.jar). Running under JDK 1.4.2. > > Do I need to use some other type in the database in order for input streams to > be handled properly? Do I have to use some PostgreSQL specific API? Does the > JDBC driver need to be changed to support this? It'll need non-trivial driver modifications. It's on my list of things to do, but I doubt I'm going to get to any of the postgresql work on that list for quite some time now :( The "right way" to do it is to expand the driver's use of the V3 protocol to use the extended query protocol; then the stream can be directly streamed to the backend without further translation/escaping/etc using a binary Bind parameter. But there's some infrastructure work to do before that can happen. I think there was a patch floating around on the list some time ago that got a similar effect by using the LO interface to stream the data to a temporary LO, then inserted into the bytea from the LO. I can't find it right now though; the archives.postgresql.org search page seems to be broken again (it's returning 0 results for many searches -- e.g. searching for "inputstream" in pgsql-jdbc in the last year returns 0 results which is obviously wrong). You could use LOs instead of bytea if you have some flexibility in your schema; the LO interface should stream nicely. It's nastier to deal with LOs on the JDBC side, though: you'll need to drop the driver into a previous-version-compatibility mode (7.1?) or use the LO manager directly. Also LOs don't play well with autocommit. -O
Hello, > It would be great if you could supply a test case which exhibits this > behaviour. I can probably do that relatively easily (though of course subject to the behavior of the VM memory manager), though given Oliver Jowett's response it seems pretty obvious what is happening - allocating a 5 MB byte array is failing and/or the byte array is constructed in a tight loop allocating incrementally larger arrays until the data fits. Allocating such large arrays is bound to be unreliable in any case, so as long as that is what's happening I fully understand it and there's probably not much to be done... -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Hello, > The short answer is no, it's not smart about InputStreams. It treats the > stream essentially the same as if you'd read the entire stream into a > byte array then called setBytes(). Ok. That's what I suspected. Thanks! [ lots of interesting stuff snipped] > You could use LOs instead of bytea if you have some flexibility in your > schema; the LO interface should stream nicely. It's nastier to deal with > LOs on the JDBC side, though: you'll need to drop the driver into a > previous-version-compatibility mode (7.1?) or use the LO manager > directly. Also LOs don't play well with autocommit. Hmm, okay. So in versions greater than 7.1, I can not simply use a normal PreparedStatement and setBinaryStream(), as suggested in the documentation? Is this for some technical reason or is this method now deprecated? I am only getting timeouts on postgresql.org at the moment, so perhaps this is explicitly stated in the docs, but can I safely assume that any usage of the Large Object Manager as part of a transactions whose queries are otherwise done through normal JDBC channels, will not cause any interference between the JDBC API and the LO manager? I.e., is it "safe" to intermix usage of large objects with normal statements through JDBC? -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
On Tue, 30 Mar 2004, Oliver Jowett wrote: > The "right way" to do it is to expand the driver's use of the V3 > protocol to use the extended query protocol; then the stream can be > directly streamed to the backend without further > translation/escaping/etc using a binary Bind parameter. But there's some > infrastructure work to do before that can happen. > Can we really stream an InputStream directly to the backend? Don't we need to prefix the message with a length argument? and the only way to get this with the InputStream API is to read the whole thing. Yes, we can avoid the translation/escaping, but I don't think we can avoid persisting the InputStream at some point. Kris Jurka
Kris Jurka wrote: > > On Tue, 30 Mar 2004, Oliver Jowett wrote: > > >>The "right way" to do it is to expand the driver's use of the V3 >>protocol to use the extended query protocol; then the stream can be >>directly streamed to the backend without further >>translation/escaping/etc using a binary Bind parameter. But there's some >>infrastructure work to do before that can happen. >> > > > Can we really stream an InputStream directly to the backend? Don't we > need to prefix the message with a length argument? and the only way to > get this with the InputStream API is to read the whole thing. Yes, we can > avoid the translation/escaping, but I don't think we can avoid persisting > the InputStream at some point. setBinaryStream() provides a length in addition to the stream. If the length is correct, we can calculate the query size ahead of time without reading the input stream. The spec is not entirely clear as to which is the length to insert; it implies it's an error to have a stream that doesn't match the provided length, but it's not explicit about what happens in that case. If the length doesn't match the stream, we can deal with it: we can truncate to the specified length easily (emit a SQLWarning perhaps?), and an unexpected EOF can be treated like an IOException. -O
Peter Schuller wrote: >>You could use LOs instead of bytea if you have some flexibility in your >>schema; the LO interface should stream nicely. It's nastier to deal with >>LOs on the JDBC side, though: you'll need to drop the driver into a >>previous-version-compatibility mode (7.1?) or use the LO manager >>directly. Also LOs don't play well with autocommit. > > > Hmm, okay. So in versions greater than 7.1, I can not simply use a normal > PreparedStatement and setBinaryStream(), as suggested in the documentation? > Is this for some technical reason or is this method now deprecated? It's a technical reason. Some background.. In 7.1, calling setBinaryStream()/setBytes()/etc (and the equivalent ResultSet methods) would assume the field contained a Large Object OID; the driver would talk to the LO API to handle retrieving/storing the actual LO data as needed. This worked but had some odd quirks, since a field containing a LO OID is not the same as a field containing the data itself -- you had to be careful with transaction demarcation etc. In 7.2+, calling those methods assumed the field was a bytea and the data could be directly stored/retrieved as part of the query without a separate operation via the LO API. 'bytea' is a much better match to JDBC's LONGVARBINARY, so (at least to me) this change makes sense. Since the driver doesn't know much about the context of a parameter in a query, and has to pick one approach or the other at the point where setBinaryStream() is called, the only way to switch behaviours was to globally change the driver's behaviour by telling it "behave like you're a 7.1 driver" (a 'compatible=7.1' URL parameter). > I am only getting timeouts on postgresql.org at the moment, so perhaps this is > explicitly stated in the docs, but can I safely assume that any usage of the > Large Object Manager as part of a transactions whose queries are otherwise > done through normal JDBC channels, will not cause any interference between > the JDBC API and the LO manager? I.e., is it "safe" to intermix usage of > large objects with normal statements through JDBC? Yes, it's safe. -O
>> It would be great if you could supply a test case which exhibits this >> behaviour. > >I can probably do that relatively easily (though of course subject to the >behavior of the VM memory manager), though given Oliver Jowett's response it >seems pretty obvious what is happening - allocating a 5 MB byte array is >failing and/or the byte array is constructed in a tight loop allocating >incrementally larger arrays until the data fits. It's slightly more "intelligent" - see code... >Allocating such large arrays is bound to be unreliable in any case, so as long >as that is what's happening I fully understand it and there's probably not >much to be done... Thats true so long the consumer needs the data fully in RAM, e.g. an Image-Label or similar. If the consumer is also stream-based it could be avoided. Unfortunately the backend would still hold the results of the query fully in memory which is also not always desired... Guido