Thread: OutOfMemory when inserting stream of unknown length
When writing a java stream of unknown length to a bytea column with the prepared statement setBinaryStream the jdbc drivers throws always an OutOfMemoryException. I'm calling the function with following parameters as I do not know the real length of the stream: stm.setBinaryStream(1, stream, Interger.MAX_VALUE); and it seems that somewhere along the way the jdbc driver wrongly assumes that the stream contains gigabytes of data and a memory allocation fails. The other jdbc drivers I have tried (oracle 9.2, hypersonic 1.7.2.3) do accept this. The javadoc itself (reproduced at the bottom of the mail) is a bit controversial as it at the same time says that the stream will have 'length' number of bytes, but on the other hand says that all data will be read. I have interpreted this so that the length is just a hint and jdbc driver must not store more bytes than the length, but an end-of-file before the length bytes is still valid and shouldn't cause any exceptions. And the javadoc might also mean that if the stream hasn't reached end-of-file when length bytes have been read the jdbc driver should still continue reading but discarding the remaining bytes. In summary my understanding: stream length | length parameter | result --------------+------------------+-------- 10 | 5 | 5 bytes inserted, 5 bytes discarded 10 | 10 | 10 bytes inserted 10 | 15 | 10 bytes inserted --- javadoc snippet: The PreparedStatement.setBinaryStream(int parameterIndex, InputStream x, int length) Sets the designated parameter to the given input stream, which will have the specified number of bytes... The data will be read from the stream as needed until end-of-file is reached. Parameters: parameterIndex - the first parameter is 1, the second is 2, ... x - the java input stream which contains the binary parameter value length - the number of bytes in the stream
On Thu, 19 Aug 2004, Mikko T. wrote: > > When writing a java stream of unknown length to a bytea column with the > prepared statement setBinaryStream the jdbc drivers throws always an > OutOfMemoryException. > > I'm calling the function with following parameters as I do not know the real > length of the stream: > > stm.setBinaryStream(1, stream, Interger.MAX_VALUE); > > and it seems that somewhere along the way the jdbc driver wrongly > assumes that the stream contains gigabytes of data and a memory > allocation fails. This is true of the stable driver branch, it will try to allocate a huge amount of memory. The development driver has the ability to stream data to the backend, but unfortunately for you requires an exact length argument. The problem is that the frontend/backend protocol messages are preceded with a length parameter. So the options are to read the whole stream to determine the length (which requires huge amounts of memory to store the whole stream), or take the provided length argument at face value and stream data to the server hoping the user was right. In the streaming case it's ok if the length argument is less than the length of the stream because then we can just stop reading, but not if the stream itself is short. Kris Jurka
Mikko T. wrote: > The javadoc itself (reproduced at the bottom of the mail) is a bit > controversial as it at the same time says that the stream will have > 'length' number of bytes, but on the other hand says that all data will > be read. I have interpreted this so that the length is just a hint and > jdbc driver must not store more bytes than the length, but an > end-of-file before the length bytes > is still valid and shouldn't cause any exceptions. And the javadoc might > also mean that if the stream hasn't reached end-of-file when length > bytes have been read the jdbc driver should still continue reading but > discarding the remaining bytes. This is not my interpretation (and not the driver's interpretation either). The driver reads exactly 'length' bytes and does not subsequently touch the stream. It is an error to provide less than 'length' bytes in the stream (and this will actually also toast your server connection). Your OOME is due to older drivers not having proper streaming logic -- they read into heap then call setBytes(). Newer drivers stream the data directly to the server. > The PreparedStatement.setBinaryStream(int parameterIndex, > InputStream x, > int length) > > Sets the designated parameter to the given input stream, which will have > the specified number of bytes. This sounds like a requirement to me -- i.e. it is an error to pass a stream that does not have the specified number of bytes. > .. The data will be read from the stream > as needed until end-of-file is reached. .. but as usual the JDBC javadoc goes on to contradict itself. sigh. I wish sun could come up with a proper spec for a change, not just a collection of partially-documented APIs. As I see it the reason for having a length value there is so that the driver can stream the data directly to the DB even when it needs to know the length ahead of time. This is exactly the case with the postgresql driver. If we can't trust the length field to be accurate, then we must read the entire stream into heap before starting. In that case setBinaryStream() is no better than setBytes().. I could live with an interpretation that says "always store exactly length bytes, but then read to EOF if there are extra bytes left over". It would still be an error to supply less than 'length' bytes in the stream; I think this is better than padding with 0 bytes or anything similar (by the time the driver sees EOF, it is committed at the protocol level to writing a parameter of length 'length', so it can't just stop at EOF). But I don't know if that's a better interpretation.. -O
Mikko T. wrote: > But now the current implementation forces the user of the > setBinaryStream to buffer all bytes to memory just to know how many > bytes the stream will actually hold. And after that it can already call > setBytes thus making the whole setBinaryStream useless when the stream > length is not known beforehand. Not true -- consider the case where you know the length but the data is held out of heap (for example, a large NIO buffer or a file on disk). -O
On Fri, 20 Aug 2004, Oliver Jowett wrote: >> The PreparedStatement.setBinaryStream(int parameterIndex, >> InputStream x, >> int length) >> >> Sets the designated parameter to the given input stream, which will have >> the specified number of bytes. > > This sounds like a requirement to me -- i.e. it is an error to pass a stream > that does not have the specified number of bytes. > >> .. The data will be read from the stream as needed until end-of-file is >> reached. > > .. but as usual the JDBC javadoc goes on to contradict itself. sigh. I wish > sun could come up with a proper spec for a change, not just a collection of > partially-documented APIs. > > As I see it the reason for having a length value there is so that the driver > can stream the data directly to the DB even when it needs to know the length > ahead of time. This is exactly the case with the postgresql driver. If we > can't trust the length field to be accurate, then we must read the entire > stream into heap before starting. In that case setBinaryStream() is no better > than setBytes().. But now the current implementation forces the user of the setBinaryStream to buffer all bytes to memory just to know how many bytes the stream will actually hold. And after that it can already call setBytes thus making the whole setBinaryStream useless when the stream length is not known beforehand. > I could live with an interpretation that says "always store exactly length > bytes, but then read to EOF if there are extra bytes left over". It would > still be an error to supply less than 'length' bytes in the stream; I think > this is better than padding with 0 bytes or anything similar (by the time the > driver sees EOF, it is committed at the protocol level to writing a parameter > of length 'length', so it can't just stop at EOF). I think that commiting to send very large data values that are not even guaranteed to exist (or be available) makes it impossible to get the protocol back to known state if for some reason the data cannot be sent or the last command should be aborted. In the current solution the only option seems to be to kill the whole connection but I find that quite an extreme thing to do. In my opinion the protocol level should support sending the data in chunks (4K - 16K could be optimal). This way the jdbc driver only has to buffer one chunk ahead and not the whole stream and it always knows there won't be any IOExceptions or EOF occuring during the sending. Supporting chunked sending would also allow the jdbc driver to support cancel/rollback the streaming if it needs to or is asked to do so by the database even if there would still be megabytes of data to send. -Mikko
Oliver Jowett wrote: > Mikko T. wrote: > >> But now the current implementation forces the user of the >> setBinaryStream to buffer all bytes to memory just to know how many >> bytes the stream will actually hold. And after that it can already >> call setBytes thus making the whole setBinaryStream useless when the >> stream length is not known beforehand. > > > Not true -- consider the case where you know the length but the data is > held out of heap (for example, a large NIO buffer or a file on disk). Eh, sorry, I misread your last sentence. Certainly if you don't know the length then you have to buffer internally and setBinaryStream() is of no use. The common case I've seen so far on the list is inserting from a file. It seems much less common to not know the length of your stream ahead of time.. -O
Mikko T. wrote: > I think that commiting to send very large data values that are not even > guaranteed to exist (or be available) makes it impossible to get the > protocol back to known state if for some reason the data cannot be sent > or the last command should be aborted. In the current solution the only > option seems to be to kill the whole connection but I find that quite an > extreme thing to do. I discussed some alternatives with Kris a while ago but no implementation has happened yet. The most promising approach was to pad out the parameter in the Bind message, then immediately Close the resulting portal without executing it and return an error to the application. > In my opinion the protocol level should support sending the data in > chunks (4K - 16K could be optimal). This way the jdbc driver only has to > buffer one chunk ahead and not the whole stream and it always knows > there won't be any IOExceptions or EOF occuring during the sending. > Supporting chunked sending would also allow the jdbc driver to support > cancel/rollback the streaming if it needs to or is asked to do so by the > database even if there would still be megabytes of data to send. Right -- something like HTTP/1.1's chunked transfer scheme. We would have to have some "partially bound portal" state floating around on the server side which seems ugly (and coordinating multiple parameters is going to get nasty). Is it worth it? I have a feeling that pgsql-hackers may take some convincing; sending data of unknown length doesn't seem like it would be widely used, and it'd be quite a lot of complexity to add. There's nothing much we can do about this with the current protocol unfortunately :( -O