Thread: musings about more snappy result set retrieval, multiplexed, multithreaded, ...

musings about more snappy result set retrieval, multiplexed, multithreaded, ...

From
Gunther Schadow
Date:
15 April 2002, 16:52:05
Hi, I have a problem with selecting large result sets taking a
long time. I was told that the problem is basically in the client,
where most clients block any processing until *all* of the data
resulting from a query had been received. This is clearly a
problem for large tables.

I had done a legacy system data interface before where I moved
the result receiver into a separate thread such that it could
receive in the background while the application could move ahead
with showing / processing the data already received early on.

This is easy to do in Java and thus I wonder if the JDBC driver
doesn't do (something like) this already? When I tested retrieving
a long table with pgjdbc2 and iSQL the query basically fails
probably because of the long result set (iSQL will simply forget
the job after a while.) Don't worry about iSQL's faults, I wonder
if the jdbc driver actually wants to create the full result set
in memory before returning from the query statement?

Then I wonder if something like this doesn't also belong into the
libpq directly. Of course threading is harder in C and may not be
possible at all without significant review and conversion of all
the libpq code.

But also, in either case, there is an issue with having to open
multiple connections in order to access the database while result
sets are still being received.

What I want is to make all of this as transparent to the application
programmer as it possibly gets. Notably having the application
programmer deal with cursors and limit/offset I do not consider
solutions (they are merely workarounds.)

One obvious choices would be for the libpq / jdbc (and
whatever have you) to make a separate connection for every
transaction that can possibly return a larger result set (i.e.
all SELECT statements.) That, however, would open up a new
server process every time, which quickly becomes a burden on
the server host.

There is an IETF protocol for multiplexed TCP, and even without
that one could easily define a protocol in which messages on the
TCP connection have a transaction id such that they can be received
in the correct thread/queue.

However, this now opens a can of worms, because the server
process itself needs to become multi-threaded to process the
different transactions in parallel.

So, may be the compromise is to use multiple connections, since
that is what one would have to do today anyway(?). One could do
that even without multi-threading on the client side, because the
flow-control inherent in the TCP protocol will regulate the
network transfer of the result sets into operating system
buffers for the different connections. Hmm, that would kind
of work but it isn't pretty.

So, may be the client could transparently use a cursor and so
the multiplexing would occur on the client side only. That's
what I understand the ODBC driver can (optionally) do already
today. The advantage is that one should be able to use a single
server process and can still process multiple transactions in
a parallel (well, multiplexed) fashion. Once again, one wouldn't
even need a multithreaded client, although in Java I'd still do
a multi-threaded read-ahead so that the main thread doesn't have
to wait for new rows.

I am quite likely to hack on this problem for real within the
next 6 months. I'd appreciate your thoughts, comments, warnings.

Thanks,
-Gunther


-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org