Re: libpq pipelining - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: libpq pipelining
Date
Msg-id 5480D476.2090306@vmware.com
Whole thread Raw
In response to Re: libpq pipelining  (Matt Newell <newellm@blur.com>)
Responses Re: libpq pipelining  (Matt Newell <newellm@blur.com>)
List pgsql-hackers
On 12/04/2014 09:11 PM, Matt Newell wrote:
> With the API i am proposing, only 2 new functions (PQgetFirstQuery,
> PQgetLastQuery) are required to be able to match each result to the query that
> caused it.  Another function, PQgetNextQuery allows iterating through the
> pending queries, and PQgetQueryCommand permits getting the original query
> text.
>
> Adding the ability to set a user supplied pointer on the PGquery struct might
> make it much easier for some frameworks, and other users might want a
> callback, but I don't think either are required.

I don't like exposing the PGquery struct to the application like that. 
Access to all other libpq objects is done via functions. The application 
can't (or shouldn't, anyway) directly access the fields of PGresult, for 
example. It has to call PQnfields(), PQntuples() etc.

The user-supplied pointer seems quite pointless. It would make sense if 
the pointer was passed to PQsendquery(), and you'd get it back in 
PGquery. You could then use it to tag the query when you send it with 
whatever makes sense for the application, and use the tag in the result 
to match it with the original query. But as it stands, I don't see the 
point.

The original query string might be handy for some things, but for others 
it's useless. It's not enough as a general method to identify the query 
the result belongs to. A common use case for this is to execute the same 
query many times with different parameters.

So I don't think you've quite nailed the problem of how to match the 
results to the commands that originated them, yet. One idea is to add a 
function that can be called after PQgetResult(), to get some identifier 
of the original command. But there needs to be a mechanism to tag the 
PQsendQuery() calls. Or you can assign each call a unique ID 
automatically, and have a way to ask for that ID after calling 
PQsendQuery().

The explanation of PQgetFirstQuery makes it sound pretty hard to match 
up the result with the query. You have to pay attention to PQisBusy.

It would be good to make it explicit when you start a pipelined 
operation. Currently, you get an error if you call PQsendQuery() twice 
in a row, without reading the result inbetween. That's a good thing, to 
catch application errors, when you're not trying to do pipelining. 
Otherwise, if you forget to get the result of a query you've sent, and 
then send another query, you'll merrily read the result of the first 
query and think that it belongs to the second.

Are you trying to support "continous pipelining", where you send new 
queries all the time, and read results as they arrive, without ever 
draining the pipe? Or are you just trying to do "batches", where you 
send a bunch of queries, and wait for all the results to arrive, before 
sending more? A batched API would be easier to understand and work with, 
although a "continuous" pipeline could be more efficient for an 
application that can take advantage of it.

>> Consideration of implicit transactions (autocommit), the whole pipeline
>> being one transaction, or multiple transactions is needed.
> The more I think about this the more confident I am that no extra work is
> needed.
>
> Unless we start doing some preliminary processing of the query inside of
> libpq, our hands are tied wrt sending a sync at the end of each query.  The
> reason for this is that we rely on the ReadyForQuery message to indicate the
> end of a query, so without the sync there is no way to tell if the next result
> is from another statement in the current query, or the first statement in the
> next query.
>
> I also don't see a reason to need multiple queries without a sync statement.
> If the user wants all queries to succeed or fail together it should be no
> problem to start the pipeline with begin and complete it commit.  But I may be
> missing some detail...

True. It makes me a bit uneasy, though, to not be sure that the whole 
batch is committed or rolled back as one unit. There are many ways the 
user can shoot himself in the foot with that. Error handling would be a 
lot simpler if you would only send one Sync for the whole batch. Tom 
explained it better on this recent thread: 
http://www.postgresql.org/message-id/32086.1415063405@sss.pgh.pa.us.

Another thought is that for many applications, it would actually be OK 
to not know which query each result belongs to. For example, if you 
execute a bunch of inserts, you often just want to get back the total 
number of inserted, or maybe not even that. Or if you execute a "CREATE 
TEMPORARY TABLE ... ON COMMIT DROP", followed by some insertions to it, 
some more data manipulations, and finally a SELECT to get the results 
back. All you want is the last result set.

If we could modify the wire protocol, we'd want to have a MiniSync 
message that is like Sync except that it wouldn't close the current 
transaction. The server would respond to it with a ReadyForQuery message 
(which could carry an ID number, to match it up with the MiniSync 
command). But I really wish we'd find a way to do this without changing 
the wire protocol.

- Heikki




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: superuser() shortcuts
Next
From: Andrew Dunstan
Date:
Subject: Re: controlling psql's use of the pager a bit more