Re: libpq pipelining - Mailing list pgsql-hackers

From Matt Newell
Subject Re: libpq pipelining
Date
Msg-id 5812260.A9DG1YyGp2@obsidian
Whole thread Raw
In response to Re: libpq pipelining  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: libpq pipelining  (Matt Newell <newellm@blur.com>)
List pgsql-hackers
On Thursday, December 04, 2014 11:39:02 PM Heikki Linnakangas wrote:
> > Adding the ability to set a user supplied pointer on the PGquery struct
> > might make it much easier for some frameworks, and other users might want
> > a callback, but I don't think either are required.
> 
> I don't like exposing the PGquery struct to the application like that.
> Access to all other libpq objects is done via functions. The application
> can't (or shouldn't, anyway) directly access the fields of PGresult, for
> example. It has to call PQnfields(), PQntuples() etc.
> 
Right, my patch doesn't expose it.  I was thinking of adding two new functions 
to get/set the user tag/pointer.

> The user-supplied pointer seems quite pointless. It would make sense if
> the pointer was passed to PQsendquery(), and you'd get it back in
> PGquery. You could then use it to tag the query when you send it with
> whatever makes sense for the application, and use the tag in the result
> to match it with the original query.
That's exactly what I envisioned, but with a separate call to avoid having to 
modify/duplicate the PQsendQuery functions:

PQsendQuery(conn,...)
query = PQgetLastQuery(conn);
PQquerySetUserPointer(query,userPtr);

...
result = PQgetResult(conn);
query = PQgetFirstQuery(conn);
userPtr = PQqueryGetUserPointer(query);

> But as it stands, I don't see the
> point.
I don't need it since it should be easy to keep track without it.  It was just 
an idea.

> The original query string might be handy for some things, but for others
> it's useless. It's not enough as a general method to identify the query
> the result belongs to. A common use case for this is to execute the same
> query many times with different parameters.
> 
Right, I'm only saving the query text because that's how things were done 
already.  Since it's already there I didn't see a reason not to expose it.

> So I don't think you've quite nailed the problem of how to match the
> results to the commands that originated them, yet. One idea is to add a
> function that can be called after PQgetResult(), to get some identifier
> of the original command. But there needs to be a mechanism to tag the
> PQsendQuery() calls. Or you can assign each call a unique ID
> automatically, and have a way to ask for that ID after calling
> PQsendQuery().
PGquery IS the unique ID, and it is available after calling PQsendQuery by 
calling PQgetLastQuery.  

> 
> The explanation of PQgetFirstQuery makes it sound pretty hard to match
> up the result with the query. You have to pay attention to PQisBusy.
> 
It's not hard at all and is very natural to use since the whole point of an 
async api is to avoid blocking, so it's natural to only call PQgetResult when 
it's not going to block.  PQgetFirstQuery should also be valid after calling 
PQgetResult and then you don't have to worry about PQisBusy, so I should 
probably change the documentation to indicate that is the preferred usage, or 
maybe make that the only guaranteed usage, and say the results are undefined if 
you call it before calling PQgetResult.  That usage also makes it consistent 
with PQgetLastQuery being called immediately after PQsendQuery.

Another option would be a function to get the PGquery for any PGresult.  This 
would make things a bit more straightforward for the user, but more 
complicated in the implementation since multiple PGresults will share the same 
PGquery.  However it's nothing that a reference count wouldn't solve.

> It would be good to make it explicit when you start a pipelined
> operation. Currently, you get an error if you call PQsendQuery() twice
> in a row, without reading the result inbetween. That's a good thing, to
> catch application errors, when you're not trying to do pipelining.
> Otherwise, if you forget to get the result of a query you've sent, and
> then send another query, you'll merrily read the result of the first
> query and think that it belongs to the second.
Agreed, and I think this is the only behavior change currently. An easy fix to 
restore existing behavior by default:

PQsetPipelining(PGconn *conn, int arg); 

should work.

> 
> Are you trying to support "continous pipelining", where you send new
> queries all the time, and read results as they arrive, without ever
> draining the pipe? Or are you just trying to do "batches", where you
> send a bunch of queries, and wait for all the results to arrive, before
> sending more? A batched API would be easier to understand and work with,
> although a "continuous" pipeline could be more efficient for an
> application that can take advantage of it.
> 
I don't see any reason to limit it to batches, though it can certainly be used 
that way.  My first test case does continuous pipelining and it provides a huge  
throughput gain when there's any latency in the connection.  I can envision a 
lot of uses for the continuous approach.

> >> Consideration of implicit transactions (autocommit), the whole pipeline
> >> being one transaction, or multiple transactions is needed.
> > 
> > The more I think about this the more confident I am that no extra work is
> > needed.
> > 
> > Unless we start doing some preliminary processing of the query inside of
> > libpq, our hands are tied wrt sending a sync at the end of each query. 
> > The
> > reason for this is that we rely on the ReadyForQuery message to indicate
> > the end of a query, so without the sync there is no way to tell if the
> > next result is from another statement in the current query, or the first
> > statement in the next query.
> > 
> > I also don't see a reason to need multiple queries without a sync
> > statement. If the user wants all queries to succeed or fail together it
> > should be no problem to start the pipeline with begin and complete it
> > commit.  But I may be missing some detail...
> 
> True. It makes me a bit uneasy, though, to not be sure that the whole
> batch is committed or rolled back as one unit. There are many ways the
> user can shoot himself in the foot with that. Error handling would be a
> lot simpler if you would only send one Sync for the whole batch. Tom
> explained it better on this recent thread:
> http://www.postgresql.org/message-id/32086.1415063405@sss.pgh.pa.us.
> 
I've read tom's email and every other one i could find related to pipelining 
and I simply don't see the problem.  

What's the advantage of being able to pipeline multiple queries without sync 
and without an explicit transaction, vs an explicit transaction with a sync 
per query?

The usage I have in mind for pipelining needs each query to succeed or fail 
independently of any query before or after it, unless the caller explicitly 
uses a transaction.

> Another thought is that for many applications, it would actually be OK
> to not know which query each result belongs to. For example, if you
> execute a bunch of inserts, you often just want to get back the total
> number of inserted, or maybe not even that. Or if you execute a "CREATE
> TEMPORARY TABLE ... ON COMMIT DROP", followed by some insertions to it,
> some more data manipulations, and finally a SELECT to get the results
> back. All you want is the last result set.
Easy, grab the PGquery from only the last select, call PQgetResult until you 
get a matching result, PQclear the rest.  We could provide a function to do 
just that as a convenience to the user, if it's going to be a common use-case.  
It would be no different than how PQexec processes and discards results.

Matt



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: New wal format distorts pg_xlogdump --stats
Next
From: Jim Nasby
Date:
Subject: Re: duplicated partial-column assigns allowed by checkInsertTargets