Thread: Streaming ResultSet
Hello all. Within the PostgreSQL JDBC driver a ResultSet is, by default, streamed back to the client as the rows are generated by the server, and are stored in a collection in RAM. An attempt to get the first row from the ResultSet blocks until every last row is stored in RAM. If various conditions are met, a cursor is used instead, which allows a large result set to be processed without running out of RAM. I'm curious about the reasons that this appoach was taken -- why not return the rows as they are retrieved from the server? I can think of some possible explanations: (1) The time locks are held on the server is minimized by the default behavior. (2) If an exception occurs, you do not receive that data up to the point of the problem. (For example, divide by zero, serialization (deadlock) problems, or broken connections.) (3) The connection can be used for other purposes without conflict. (Other drivers I've used, which normall stream the results fall back on reading the whole result set into RAM when a statement is issued which cannot be run while the result set is pending.) The disadvantages of the current default approach are: (1) The entire result set might not fit in the client RAM. (2) You might want to see partial results from a query which ultimately generates an exception. (3) You can't overlap retrieval of the result set with processing of the rows. The disadvantages of the cursored approach are: (1) It might not be easy to create all of the right conditions for the cursored approach to work. (2) You might think you've met all the conditions for the cursored approach and be wrong. The default behavior might kick in and cause an OutOfMemoryError. (3) There is overhead to creating a cursor for a select statement. Granted, in my tests it was only about 20 microseconds, but that boosted runtime for my simple test case by 3%. If there was a connection property which allowed result sets to return rows as they are retrieved, I would use it. I'm not lobbying very hard for it, however, since it is only 20 microseconds per SELECT statement to use cursors, and I can easily modify my code to use them, although it will have to be an "all or nothing" change in the framework, so the cost is paid on every SELECT statement. I'm just expressing interest, should the feature be under consideration. -Kevin
Kevin, It's mostly because of the way the server returns result sets, and also the fact that java doesn't (didn't) support non-blocking I/O Dave On 19-Jul-05, at 4:22 PM, Kevin Grittner wrote: > Hello all. > > Within the PostgreSQL JDBC driver a ResultSet is, by default, streamed > back to the client as the rows are generated by the server, and are > stored in a collection in RAM. An attempt to get the first row > from the > ResultSet blocks until every last row is stored in RAM. If various > conditions are met, a cursor is used instead, which allows a large > result set to be processed without running out of RAM. > > I'm curious about the reasons that this appoach was taken -- why not > return the rows as they are retrieved from the server? I can think of > some possible explanations: > > (1) The time locks are held on the server is minimized by the default > behavior. > > (2) If an exception occurs, you do not receive that data up to the > point of the problem. (For example, divide by zero, serialization > (deadlock) problems, or broken connections.) > > (3) The connection can be used for other purposes without conflict. > (Other drivers I've used, which normall stream the results fall > back on > reading the whole result set into RAM when a statement is issued which > cannot be run while the result set is pending.) > > > The disadvantages of the current default approach are: > > (1) The entire result set might not fit in the client RAM. > > (2) You might want to see partial results from a query which > ultimately > generates an exception. > > (3) You can't overlap retrieval of the result set with processing of > the rows. > > > The disadvantages of the cursored approach are: > > (1) It might not be easy to create all of the right conditions for > the > cursored approach to work. > > (2) You might think you've met all the conditions for the cursored > approach and be wrong. The default behavior might kick in and > cause an > OutOfMemoryError. > > (3) There is overhead to creating a cursor for a select statement. > Granted, in my tests it was only about 20 microseconds, but that > boosted > runtime for my simple test case by 3%. > > > If there was a connection property which allowed result sets to return > rows as they are retrieved, I would use it. I'm not lobbying very > hard > for it, however, since it is only 20 microseconds per SELECT statement > to use cursors, and I can easily modify my code to use them, > although it > will have to be an "all or nothing" change in the framework, so the > cost > is paid on every SELECT statement. I'm just expressing interest, > should > the feature be under consideration. > > -Kevin > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly > >
Kevin Grittner wrote: > I'm curious about the reasons that this appoach was taken -- why not > return the rows as they are retrieved from the server? The short answer is because to stream on the fly is a much more complex approach and no-one has stepped forward to implement it. > (1) It might not be easy to create all of the right conditions for the > cursored approach to work. Another solution here is to work on making the cursor support code more flexible. Again the main reason that this is not done is that no-one has come up with a robust implementation yet. > (2) You might think you've met all the conditions for the cursored > approach and be wrong. The default behavior might kick in and cause an > OutOfMemoryError. The same is true with a streaming approach. The driver likes to issue its own queries internally in many places; if you happen to trip up on one of those while processing your resultset, boom! Your best bet to see this implemented is to implement it yourself and send us a patch. It's a fairly large chunk of work with debatable benefit, and personally I'd rather spend the time on better cursor support. (if I *had* time!) -O
Kevin Grittner wrote: > (3) There is overhead to creating a cursor for a select statement. > Granted, in my tests it was only about 20 microseconds, but that boosted > runtime for my simple test case by 3%. This seems a bit high; what's your query and which driver version are you using? The current drivers just name the result portal at the protocol level (which is essentially a FETCH_FORWARD_ONLY cursor, but not in so many words) and ask for N rows at a time; otherwise, all the processing should be the same. I'm suprised that the act of naming the portal and splitting the output into batches adds as much as 20ms to query execution time.. -O
20 microseconds (1/50th of a millisecond), not 20 milliseconds. I could run it a few more times to make sure that my results are statistically significant, and not a run of bad luck for the cursored tests, but that will have to wait for tomorrow. -Kevin >>> Oliver Jowett <oliver@opencloud.com> 07/19/05 6:12 PM >>> Kevin Grittner wrote: > Granted, in my tests it was only about 20 microseconds I'm suprised that the act of naming the portal and splitting the output into batches adds as much as 20ms
Kevin Grittner wrote: > 20 microseconds (1/50th of a millisecond), not 20 milliseconds. Oh, sorry, misread your email. That sounds more reasonable :) You are considering using resultset streaming in a query that takes only ~600us (from your 3% number)? How big is that resultset anyway? -O
Kevin Grittner wrote: > The disadvantages of the current default approach are: > (2) You might want to see partial results from a query which ultimately > generates an exception. I don't see this is something that any sort of portable application is ever going to need. It's very unpredictable where you get an error if an error occurs, since it's so dependent on backend / driver / protocol implementations. > The disadvantages of the cursored approach are: > (1) It might not be easy to create all of the right conditions for the > cursored approach to work. It occurs to me that the only condition that streaming would mitigate is having to have autocommit off. You can't usefully use streaming with TYPE_SCROLLABLE_* resultsets as even though you can stream results just fine while the resultset current-row is moving strictly forward, you have to keep all previously retrieved results in memory anyway in case the application asks to move backwards (perfectly legal with a SCROLLABLE resultset). Keeping the whole resultset in memory defeats the purpose of streaming in the first place.. -O
I'm considering using PostgreSQL in an environment where there is a middle tier framework running the queries, which has no way of knowing whether the query it is about to run will return one row or millions of rows. When it comes to returning millions of rows, I don't much care about 20us. Some of the one row queries run in as little as 265us (and I'm thinking of submitting changes which would bring that down), so 20us might creep into the range of significance. The test which showed a 20us difference were running across the LAN, while the production middle tier will be running on the same box as the database; so, if much of the 20us is in network latency, it becomes much less significant. -Kevin >>> Oliver Jowett <oliver@opencloud.com> 07/19/05 6:38 PM >>> You are considering using resultset streaming in a query that takes only ~600us (from your 3% number)? How big is that resultset anyway?
Hi Kevin, How is your question / suggestion going? I met the same problem. Our company uses PostgreSQL as the OLTP system. I am working on an ETL project which reads data from a PostgreSQL db, transforms the data and then load the data into another PostgreSQL db. The ETL tool uses JDBC. My problem is that the ETL won't start to transform the data before the whole data result set has been loaded into memory. I would like to do the data transforming and loading while extracting data. Do you know if there is any progress reagrding the streaming result set? As I know, MySQL does support it. Thank you. Regards, Albert -- View this message in context: http://www.postgresql-archive.org/Streaming-ResultSet-tp2168704p5968275.html Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.
Hi Kevin, How is your question / suggestion going? I met the same problem. Our company uses PostgreSQL as the OLTP system. I am working on an ETL project which reads data from a PostgreSQL db, transforms the data and then load the data into another PostgreSQL db. The ETL tool uses JDBC. My problem is that the ETL won't start to transform the data before the whole data result set has been loaded into memory. I would like to do the data transforming and loading while extracting data. Do you know if there is any progress reagrding the streaming result set? As I know, MySQL does support it. Thank you. Regards, Albert -- View this message in context: http://www.postgresql-archive.org/Streaming-ResultSet-tp2168704p5968275.html Sent from the PostgreSQL - jdbc mailing list archive at Nabble.com.
albertwang schrieb am 23.06.2017 um 02:08: > I met the same problem. Our company uses PostgreSQL as the OLTP system. I am > working on an ETL project which reads data from a PostgreSQL db, transforms > the data and then load the data into another PostgreSQL db. The ETL tool > uses JDBC. > > My problem is that the ETL won't start to transform the data before the > whole data result set has been loaded into memory. > > I would like to do the data transforming and loading while extracting data. > > Do you know if there is any progress reagrding the streaming result set? As > I know, MySQL does support it. Not sure what you mean with "streaming resultset", but if you don't want the driver to buffer the complete result, then thereis a solution to it: retrieving based on a cursor. See the documentation for details: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor Thomas
albertwang schrieb am 23.06.2017 um 02:08: > I met the same problem. Our company uses PostgreSQL as the OLTP system. I am > working on an ETL project which reads data from a PostgreSQL db, transforms > the data and then load the data into another PostgreSQL db. The ETL tool > uses JDBC. > > My problem is that the ETL won't start to transform the data before the > whole data result set has been loaded into memory. > > I would like to do the data transforming and loading while extracting data. > > Do you know if there is any progress reagrding the streaming result set? As > I know, MySQL does support it. Not sure what you mean with "streaming resultset", but if you don't want the driver to buffer the complete result, then thereis a solution to it: retrieving based on a cursor. See the documentation for details: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor Thomas
albertwang wrote: > Hi Kevin, > > How is your question / suggestion going? > > I met the same problem. Our company uses PostgreSQL as the OLTP system. I am > working on an ETL project which reads data from a PostgreSQL db, transforms > the data and then load the data into another PostgreSQL db. The ETL tool > uses JDBC. > > My problem is that the ETL won't start to transform the data before the > whole data result set has been loaded into memory. > > I would like to do the data transforming and loading while extracting data. How about instead of using the JDBC intermediary, extract the data into a local file/memory db, then transform and load from it to the destination PostgreSQL db. Set up three seperate parallel processes to perform the task. danap. > > Do you know if there is any progress reagrding the streaming result set? As > I know, MySQL does support it. > > Thank you. > > Regards, > Albert
albertwang wrote: > Hi Kevin, > > How is your question / suggestion going? > > I met the same problem. Our company uses PostgreSQL as the OLTP system. I am > working on an ETL project which reads data from a PostgreSQL db, transforms > the data and then load the data into another PostgreSQL db. The ETL tool > uses JDBC. > > My problem is that the ETL won't start to transform the data before the > whole data result set has been loaded into memory. > > I would like to do the data transforming and loading while extracting data. How about instead of using the JDBC intermediary, extract the data into a local file/memory db, then transform and load from it to the destination PostgreSQL db. Set up three seperate parallel processes to perform the task. danap. > > Do you know if there is any progress reagrding the streaming result set? As > I know, MySQL does support it. > > Thank you. > > Regards, > Albert