Thread: musings about more snappy result set retrieval, multiplexed, multithreaded, ...
musings about more snappy result set retrieval, multiplexed, multithreaded, ...
From
Gunther Schadow
Date:
Hi, I have a problem with selecting large result sets taking a long time. I was told that the problem is basically in the client, where most clients block any processing until *all* of the data resulting from a query had been received. This is clearly a problem for large tables. I had done a legacy system data interface before where I moved the result receiver into a separate thread such that it could receive in the background while the application could move ahead with showing / processing the data already received early on. This is easy to do in Java and thus I wonder if the JDBC driver doesn't do (something like) this already? When I tested retrieving a long table with pgjdbc2 and iSQL the query basically fails probably because of the long result set (iSQL will simply forget the job after a while.) Don't worry about iSQL's faults, I wonder if the jdbc driver actually wants to create the full result set in memory before returning from the query statement? Then I wonder if something like this doesn't also belong into the libpq directly. Of course threading is harder in C and may not be possible at all without significant review and conversion of all the libpq code. But also, in either case, there is an issue with having to open multiple connections in order to access the database while result sets are still being received. What I want is to make all of this as transparent to the application programmer as it possibly gets. Notably having the application programmer deal with cursors and limit/offset I do not consider solutions (they are merely workarounds.) One obvious choices would be for the libpq / jdbc (and whatever have you) to make a separate connection for every transaction that can possibly return a larger result set (i.e. all SELECT statements.) That, however, would open up a new server process every time, which quickly becomes a burden on the server host. There is an IETF protocol for multiplexed TCP, and even without that one could easily define a protocol in which messages on the TCP connection have a transaction id such that they can be received in the correct thread/queue. However, this now opens a can of worms, because the server process itself needs to become multi-threaded to process the different transactions in parallel. So, may be the compromise is to use multiple connections, since that is what one would have to do today anyway(?). One could do that even without multi-threading on the client side, because the flow-control inherent in the TCP protocol will regulate the network transfer of the result sets into operating system buffers for the different connections. Hmm, that would kind of work but it isn't pretty. So, may be the client could transparently use a cursor and so the multiplexing would occur on the client side only. That's what I understand the ODBC driver can (optionally) do already today. The advantage is that one should be able to use a single server process and can still process multiple transactions in a parallel (well, multiplexed) fashion. Once again, one wouldn't even need a multithreaded client, although in Java I'd still do a multi-threaded read-ahead so that the main thread doesn't have to wait for new rows. I am quite likely to hack on this problem for real within the next 6 months. I'd appreciate your thoughts, comments, warnings. Thanks, -Gunther -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org